A fine-tuned multimodal large model for power defect image-text question-answering

In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Signal, image and video processing image and video processing, 2024-12, Vol.18 (12), p.9191-9203
Hauptverfasser: Wang, Qiqi, Zhang, Jie, Du, Jianming, Zhang, Ke, Li, Rui, Zhao, Feng, Zou, Le, Xie, Chengjun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text interactions, thereby enhancing work efficiency. This paper presents a fine-tuned multimodal large model for power defect image-text question-answering, addressing challenges such as training difficulties and the lack of image-text knowledge specific to power defects. This paper utilizes the YOLOv8 to create a dataset for multimodal power defect detection, enriching the image-text information in the power defect domain. By integrating the LoRA and Q-Former methods for model fine-tuning, the algorithm enhances the extraction of visual and semantic features and aligns visual and semantic information. The experimental results demonstrate that the proposed multimodal large model significantly outperforms other popular multimodal models in the domain of power defect question-answering.
ISSN:1863-1703
1863-1711
DOI:10.1007/s11760-024-03539-w