An empirical evaluation of deep semi-supervised learning
Obtaining labels for supervised learning is time-consuming, and practitioners seek to minimize manual labeling. Semi-supervised learning allows practitioners to eliminate manual labeling by including unlabeled data in the training process. With many deep semi-supervised algorithms and applications a...
Gespeichert in:
Veröffentlicht in: | International journal of data science and analytics 2025-01 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Obtaining labels for supervised learning is time-consuming, and practitioners seek to minimize manual labeling. Semi-supervised learning allows practitioners to eliminate manual labeling by including unlabeled data in the training process. With many deep semi-supervised algorithms and applications available, practitioners need guidelines to select the optimal labeling algorithm for their problem. The performance of new algorithms is rarely compared against existing algorithms on real-world data. This study empirically evaluates 16 deep semi-supervised learning algorithms to fill the research gap. To investigate whether the algorithms perform differently in different scenarios, the algorithms are run on 15 commonly known datasets of three datatypes (image, text and sound). Since manual data labeling is expensive, practitioners must know how many manually labeled instances are needed to achieve the lowest error rates. Therefore, this study utilizes different configurations for the number of available labels to study the manual effort required for optimal error rate. Additionally, to study how different algorithms perform on real-world datasets, the researchers add noise to the datasets to mirror real-world datasets. The study utilizes the Bradley–Terry model to rank the algorithms based on error rates and the Binomial model to investigate the probability of achieving an error rate lower than 10%. The results demonstrate that utilizing unlabeled data with semi-supervised learning may improve classification accuracy over supervised learning. Based on the results, the authors recommend FreeMatch, SimMatch, and SoftMatch since they provide the lowest error rate and have a high probability of achieving an error rate below 10% on noisy datasets. |
---|---|
ISSN: | 2364-415X 2364-4168 |
DOI: | 10.1007/s41060-024-00713-8 |