Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. To...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While computer vision has achieved tremendous success with multimodal
encoding and direct textual interaction with images via chat-based large
language models, similar advancements in medical imaging AI, particularly in 3D
imaging, have been limited due to the scarcity of comprehensive datasets. To
address this critical gap, we introduce CT-RATE, the first dataset that pairs
3D medical images with corresponding textual reports. CT-RATE comprises 25,692
non-contrast 3D chest CT scans from 21,304 unique patients. Through various
reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3
million 2D slices. Each scan is accompanied by its corresponding radiology
report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive
language-image pretraining framework designed for broad applications without
the need for task-specific training. We demonstrate how CT-CLIP can be used in
two tasks: multi-abnormality detection and case retrieval. Remarkably, in
multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully
supervised models across all key metrics, effectively eliminating the need for
manual annotation. In case retrieval, it efficiently retrieves relevant cases
using either image or textual queries, thereby enhancing knowledge
dissemination. By combining CT-CLIP's vision encoder with a pretrained large
language model, we create CT-CHAT, a vision-language foundational chat model
for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs
derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI
assistants, underscoring the necessity for specialized methods in 3D medical
imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT
not only addresses critical challenges in 3D medical imaging but also lays the
groundwork for future innovations in medical AI and improved patient care. |
---|---|
DOI: | 10.48550/arxiv.2403.17834 |