A fine-tuned multimodal large model for power defect image-text question-answering

In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal, image and video processing image and video processing, 2024-12, Vol.18 (12), p.9191-9203
Hauptverfasser:	Wang, Qiqi, Zhang, Jie, Du, Jianming, Zhang, Ke, Li, Rui, Zhao, Feng, Zou, Le, Xie, Chengjun
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer Imaging Computer Science Defects Feature extraction Image enhancement Image Processing and Computer Vision Multimedia Information Systems Original Paper Pattern Recognition and Graphics Questions Semantics Signal,Image and Speech Processing Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text interactions, thereby enhancing work efficiency. This paper presents a fine-tuned multimodal large model for power defect image-text question-answering, addressing challenges such as training difficulties and the lack of image-text knowledge specific to power defects. This paper utilizes the YOLOv8 to create a dataset for multimodal power defect detection, enriching the image-text information in the power defect domain. By integrating the LoRA and Q-Former methods for model fine-tuning, the algorithm enhances the extraction of visual and semantic features and aligns visual and semantic information. The experimental results demonstrate that the proposed multimodal large model significantly outperforms other popular multimodal models in the domain of power defect question-answering.
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-024-03539-w