CAST: Cluster-Aware Self-Training for Tabular Data via Reliable Confidence
Tabular data is one of the most widely used data modalities, encompassing numerous datasets with substantial amounts of unlabeled data. Despite this prevalence, there is a notable lack of simple and versatile methods for utilizing unlabeled data in the tabular domain, where both gradient-boosting de...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Tabular data is one of the most widely used data modalities, encompassing
numerous datasets with substantial amounts of unlabeled data. Despite this
prevalence, there is a notable lack of simple and versatile methods for
utilizing unlabeled data in the tabular domain, where both gradient-boosting
decision trees and neural networks are employed. In this context, self-training
has gained attraction due to its simplicity and versatility, yet it is
vulnerable to noisy pseudo-labels caused by erroneous confidence. Several
solutions have been proposed to handle this problem, but they often compromise
the inherent advantages of self-training, resulting in limited applicability in
the tabular domain. To address this issue, we explore a novel direction of
reliable confidence in self-training contexts and conclude that self-training
can be improved by making that the confidence, which represents the value of
the pseudo-label, aligns with the cluster assumption. In this regard, we
propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances
existing self-training algorithms at a negligible cost while maintaining
simplicity and versatility. Concretely, CAST calibrates confidence by
regularizing the classifier's confidence based on local density for each class
in the labeled training data, resulting in lower confidence for pseudo-labels
in low-density regions. Extensive empirical evaluations on up to 21 real-world
datasets confirm not only the superior performance of CAST but also its
robustness in various setups in self-training contexts. |
---|---|
DOI: | 10.48550/arxiv.2310.06380 |