i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particular...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-06
Hauptverfasser: Ahn, Daechul, Choi, Yura, Kim, San, Yu, Youngjae, Kang, Dongyeop, Choi, Jonghyun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Ahn, Daechul
Choi, Yura
Kim, San
Yu, Youngjae
Kang, Dongyeop
Choi, Jonghyun
description Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3069344620</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3069344620</sourcerecordid><originalsourceid>FETCH-proquest_journals_30693446203</originalsourceid><addsrcrecordid>eNqNit0KgjAYQEcQJOU7fNC1sDa1n7uIoqJuVLrpQlZ-ymQ622bQ2xfRA3R14JwzIB7jfBYsQsZGxLe2ppSyeM6iiHvkKoM0yVawVrJqZVvBSZgK4dwrJxtdCAVnXaCyUGoDF1mgtnB7wcGhEU4-EVJUZZCgM9p2eP-qY19UDbZuQoalUBb9H8dkuttmm33QGf3o0bq81r1pPynnNF7yMIwZ5f9db_MYQmA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3069344620</pqid></control><display><type>article</type><title>i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment</title><source>Free E- Journals</source><creator>Ahn, Daechul ; Choi, Yura ; Kim, San ; Yu, Youngjae ; Kang, Dongyeop ; Choi, Jonghyun</creator><creatorcontrib>Ahn, Daechul ; Choi, Yura ; Kim, San ; Yu, Youngjae ; Kang, Dongyeop ; Choi, Jonghyun</creatorcontrib><description>Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Misalignment ; Self alignment</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ahn, Daechul</creatorcontrib><creatorcontrib>Choi, Yura</creatorcontrib><creatorcontrib>Kim, San</creatorcontrib><creatorcontrib>Yu, Youngjae</creatorcontrib><creatorcontrib>Kang, Dongyeop</creatorcontrib><creatorcontrib>Choi, Jonghyun</creatorcontrib><title>i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment</title><title>arXiv.org</title><description>Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.</description><subject>Misalignment</subject><subject>Self alignment</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNit0KgjAYQEcQJOU7fNC1sDa1n7uIoqJuVLrpQlZ-ymQ622bQ2xfRA3R14JwzIB7jfBYsQsZGxLe2ppSyeM6iiHvkKoM0yVawVrJqZVvBSZgK4dwrJxtdCAVnXaCyUGoDF1mgtnB7wcGhEU4-EVJUZZCgM9p2eP-qY19UDbZuQoalUBb9H8dkuttmm33QGf3o0bq81r1pPynnNF7yMIwZ5f9db_MYQmA</recordid><startdate>20240617</startdate><enddate>20240617</enddate><creator>Ahn, Daechul</creator><creator>Choi, Yura</creator><creator>Kim, San</creator><creator>Yu, Youngjae</creator><creator>Kang, Dongyeop</creator><creator>Choi, Jonghyun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240617</creationdate><title>i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment</title><author>Ahn, Daechul ; Choi, Yura ; Kim, San ; Yu, Youngjae ; Kang, Dongyeop ; Choi, Jonghyun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30693446203</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Misalignment</topic><topic>Self alignment</topic><toplevel>online_resources</toplevel><creatorcontrib>Ahn, Daechul</creatorcontrib><creatorcontrib>Choi, Yura</creatorcontrib><creatorcontrib>Kim, San</creatorcontrib><creatorcontrib>Yu, Youngjae</creatorcontrib><creatorcontrib>Kang, Dongyeop</creatorcontrib><creatorcontrib>Choi, Jonghyun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahn, Daechul</au><au>Choi, Yura</au><au>Kim, San</au><au>Yu, Youngjae</au><au>Kang, Dongyeop</au><au>Choi, Jonghyun</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment</atitle><jtitle>arXiv.org</jtitle><date>2024-06-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_3069344620
source Free E- Journals
subjects Misalignment
Self alignment
title i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T09%3A30%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=i-SRT:%20Aligning%20Large%20Multimodal%20Models%20for%20Videos%20by%20Iterative%20Self-Retrospective%20Judgment&rft.jtitle=arXiv.org&rft.au=Ahn,%20Daechul&rft.date=2024-06-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3069344620%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3069344620&rft_id=info:pmid/&rfr_iscdi=true