Lightweight multi-modal image description generation method based on CLIP encoder

The invention discloses a lightweight multi-modal image description generation method based on a CLIP encoder, and the method comprises the steps: carrying out the preprocessing of image data, and generating an image feature vector; secondly, generating a model by using a language; and finally, gene...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: HUANG WENMING, CHEN JICHU
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a lightweight multi-modal image description generation method based on a CLIP encoder, and the method comprises the steps: carrying out the preprocessing of image data, and generating an image feature vector; secondly, generating a model by using a language; and finally, generating a required image description model. According to the method, existing descriptions are expanded and synthesized to generate more accurate and diversified descriptions. According to the method, a hybrid multi-mode model is adopted, an advanced CLIP encoder is introduced in an image feature generation stage, and the encoder can carry out comparative learning on an image and text embedding space to generate feature representation with more semantic richness. The invention provides a simple, portable and efficient multi-modal text generation technology, and provides a powerful solution for solving challenges in multi-modal tasks. The method is expected to promote the development of the field of multi-modal text