Unsupervised drill core pseudo-log generation in raw and filtered data, a case study in the Rio Salitre greenstone belt, São Francisco Craton, Brazil
The goal of this work is to emulate a situation where an analyst with none or little previous geological knowledge of the samples must deal with an unsupervised approach to gain some insights about drill core samples and compare the results of two main unsupervised algorithms with and without filter...
Gespeichert in:
Veröffentlicht in: | Journal of geochemical exploration 2022-01, Vol.232, p.106885, Article 106885 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The goal of this work is to emulate a situation where an analyst with none or little previous geological knowledge of the samples must deal with an unsupervised approach to gain some insights about drill core samples and compare the results of two main unsupervised algorithms with and without filtering methods.
We used in situ portable X-ray Fluorescence data acquired in sawn drill core samples of rocks from the Sabiá prospect, at the Rio Salitre greenstone belt, São Francisco Craton Brazil, for pseudo-log automatic generation through running Unsupervised Learning models to group distinct lithotypes. We tested the K-means and Model-Based Cluster (MBC) algorithms and compared their performance in the raw and filtered data with a manual macroscopic log description. From the initial 47 available elements, 20 variables were selected for modeling following the criteria of presenting at least 95% of uncensored values. Additionally, we performed a Shapiro-Wilk test that confirmed a non-parametric distribution by verifying the P-value attribute less than the 5% significance level. We also checked if the dataset's distribution was statistically equivalent to the duplicates with the assistance of a Kruskal–Wallis test, which would confirm the representativity power of the measurements at the same 5% significance level. After this step, the pseudo-log models were created based on reduced dimension data, compressed by a centered Principal Component Analysis with data rescaled by its range. Concerning reducing the high-frequency noise in the selected features, we employed an exponential weighted moving average filter with a window of five samples. By the analysis of the Average Silhouette Width on sample space, the optimum number for K-means was fixed in two, and then the first models were generated for raw and filtered data. From the MBC perspective, the sample space is interpreted as a finite mixture of groups with distinct Gaussian probability distribution. The number of clusters is defined by the analysis of the Bayesian Information Criteria (BIC), where several models are tested, and the one in the first local maximum defines the number of groups and the type of probabilistic model in the simulation. For the data used in this work, the optimum group number for MBC is four, and the probabilistic model type determined by the BIC is elliptical with equal volume, shape, and orientation. Thus, Model-Based Cluster has detected four different cluster groups with almost t |
---|---|
ISSN: | 0375-6742 1879-1689 |
DOI: | 10.1016/j.gexplo.2021.106885 |