A fine-tuned multimodal large model for power defect image-text question-answering
In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text...
Gespeichert in:
Veröffentlicht in: | Signal, image and video processing image and video processing, 2024-12, Vol.18 (12), p.9191-9203 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text interactions, thereby enhancing work efficiency. This paper presents a fine-tuned multimodal large model for power defect image-text question-answering, addressing challenges such as training difficulties and the lack of image-text knowledge specific to power defects. This paper utilizes the YOLOv8 to create a dataset for multimodal power defect detection, enriching the image-text information in the power defect domain. By integrating the LoRA and Q-Former methods for model fine-tuning, the algorithm enhances the extraction of visual and semantic features and aligns visual and semantic information. The experimental results demonstrate that the proposed multimodal large model significantly outperforms other popular multimodal models in the domain of power defect question-answering. |
---|---|
ISSN: | 1863-1703 1863-1711 |
DOI: | 10.1007/s11760-024-03539-w |