Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks
Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; n...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Bayesian deep learning seeks to equip deep neural networks with the ability
to precisely quantify their predictive uncertainty, and has promised to make
deep learning more reliable for safety-critical real-world applications. Yet,
existing Bayesian deep learning methods fall short of this promise; new methods
continue to be evaluated on unrealistic test beds that do not reflect the
complexities of downstream real-world tasks that would benefit most from
reliable uncertainty quantification. We propose the RETINA Benchmark, a set of
real-world tasks that accurately reflect such complexities and are designed to
assess the reliability of predictive models in safety-critical scenarios.
Specifically, we curate two publicly available datasets of high-resolution
human retina images exhibiting varying degrees of diabetic retinopathy, a
medical condition that can lead to blindness, and use them to design a suite of
automated diagnosis tasks that require reliable predictive uncertainty
quantification. We use these tasks to benchmark well-established and
state-of-the-art Bayesian deep learning methods on task-specific evaluation
metrics. We provide an easy-to-use codebase for fast and easy benchmarking
following reproducibility and software design principles. We provide
implementations of all methods included in the benchmark as well as results
computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and
evaluation on at least 6 random seeds each. |
---|---|
DOI: | 10.48550/arxiv.2211.12717 |