Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability
The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performan...
Gespeichert in:
Veröffentlicht in: | Groundwater for sustainable development 2025-02, Vol.28, p.101393, Article 101393 |
---|---|
Hauptverfasser: | , , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.
[Display omitted]
•Multi sampling/ensemble methods were used to classify a small-size imbalanced dataset.•Evaluated from 3 dimensions: performance, data structure, and model interpretability.•SMOTE&ADASYN were evaluated as the best methods for small-size imbalanced datasets.•Some ensemble methods show good performance, but their interpretability is restricted. |
---|---|
ISSN: | 2352-801X 2352-801X |
DOI: | 10.1016/j.gsd.2024.101393 |