PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach

In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2023-01, Vol.11, p.1-1
Hauptverfasser: Choi, Wangyu, Chen, Jiasi, Yoon, Jongwon
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE access
container_volume 11
creator Choi, Wangyu
Chen, Jiasi
Yoon, Jongwon
description In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challenge in understanding untrimmed video and generating several event-based sentences to describe the video. Numerous endeavors have been undertaken to enhance the efficacy of the dense video captioning task by the utilization of various approaches, such as bottom-up, top-down, parallel pipeline, pretraining, etc. In contrast, the weakly supervised dense video captioning method presents a highly promising strategy for generating dense video captions solely based on captions, without relying on any knowledge of ground-truth events, which distinguishes it from widely employed approaches. Nevertheless, this approach has a drawback that inadequate captions might hurt both event localization and captioning. This paper introduces PWS-DVC, a novel approach aimed at enhancing the performance of weakly supervised dense video captioning. PWS-DVC's event captioning module is initially trained on video-clip datasets, which are extensively accessible video datasets by leveraging the absence of ground-truth data during training. Subsequently, it undergoes fine-tuning specifically for dense video captioning. In order to demonstrate the efficacy of PWS-DVC, we conduct comparative experiments with state-of-the-art methods using the ActivityNet Captions dataset. The findings indicate that PWS-DVC exhibits improved performance in comparison to current approaches in weakly supervised dense video captioning.
doi_str_mv 10.1109/ACCESS.2023.3331756
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2023_3331756</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10314490</ieee_id><doaj_id>oai_doaj_org_article_f87ea683316e4289a61f19cf9550ca74</doaj_id><sourcerecordid>2892376555</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-272ea7374bb5f9f306d1cf77dce84c54cbd53713fe26db540179294e10b696353</originalsourceid><addsrcrecordid>eNpNUctOwzAQjBBIIOAL4BCJc4rfjrlVoTwkJCqVwtFynHXrUpLgpKD-PW6DEHvxajQzu-tJkguMRhgjdT0uislsNiKI0BGlFEsuDpITgoXKKKfi8F9_nJx33QrFyiPE5Ukyn77NstvX4iad1EtTW18v0jcw7-ttOtu0EL58B1V6C3UH6auvoEkL0_a-qXfEb98v02mAPhi_B8ZtGxpjl2fJkTPrDs5_39Nkfjd5KR6yp-f7x2L8lFnKVZ8RScBIKllZcqccRaLC1klZWciZ5cyWFacSUwdEVCVnCEtFFAOMSqFEPOg0eRx8q8asdBv8hwlb3Riv90ATFtqE3ts1aJdLMCKP_yOAkVwZgR1W1inOkTWSRa-rwSue8LmBrterZhPquL6OdEKl4Hw3kQ4sG5quC-D-pmKkd3HoIQ69i0P_xhFVl4PKA8A_BcWMKUR_AEwjhHA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2892376555</pqid></control><display><type>article</type><title>PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Choi, Wangyu ; Chen, Jiasi ; Yoon, Jongwon</creator><creatorcontrib>Choi, Wangyu ; Chen, Jiasi ; Yoon, Jongwon</creatorcontrib><description>In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challenge in understanding untrimmed video and generating several event-based sentences to describe the video. Numerous endeavors have been undertaken to enhance the efficacy of the dense video captioning task by the utilization of various approaches, such as bottom-up, top-down, parallel pipeline, pretraining, etc. In contrast, the weakly supervised dense video captioning method presents a highly promising strategy for generating dense video captions solely based on captions, without relying on any knowledge of ground-truth events, which distinguishes it from widely employed approaches. Nevertheless, this approach has a drawback that inadequate captions might hurt both event localization and captioning. This paper introduces PWS-DVC, a novel approach aimed at enhancing the performance of weakly supervised dense video captioning. PWS-DVC's event captioning module is initially trained on video-clip datasets, which are extensively accessible video datasets by leveraging the absence of ground-truth data during training. Subsequently, it undergoes fine-tuning specifically for dense video captioning. In order to demonstrate the efficacy of PWS-DVC, we conduct comparative experiments with state-of-the-art methods using the ActivityNet Captions dataset. The findings indicate that PWS-DVC exhibits improved performance in comparison to current approaches in weakly supervised dense video captioning.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2023.3331756</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Context modeling ; Cross-modal video-text comprehension ; Datasets ; Dense video captioning ; Dogs ; Effectiveness ; event localization in videos ; fine-tuning for dense captioning ; Location awareness ; Natural language processing ; natural language processing in videos ; Pretraining ; Task analysis ; Training ; Transformers ; Uncertainty ; Weakly supervised</subject><ispartof>IEEE access, 2023-01, Vol.11, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c359t-272ea7374bb5f9f306d1cf77dce84c54cbd53713fe26db540179294e10b696353</cites><orcidid>0000-0002-9052-243X ; 0000-0002-9010-6460</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10314490$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>315,781,785,865,2103,27638,27929,27930,54938</link.rule.ids></links><search><creatorcontrib>Choi, Wangyu</creatorcontrib><creatorcontrib>Chen, Jiasi</creatorcontrib><creatorcontrib>Yoon, Jongwon</creatorcontrib><title>PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach</title><title>IEEE access</title><addtitle>Access</addtitle><description>In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challenge in understanding untrimmed video and generating several event-based sentences to describe the video. Numerous endeavors have been undertaken to enhance the efficacy of the dense video captioning task by the utilization of various approaches, such as bottom-up, top-down, parallel pipeline, pretraining, etc. In contrast, the weakly supervised dense video captioning method presents a highly promising strategy for generating dense video captions solely based on captions, without relying on any knowledge of ground-truth events, which distinguishes it from widely employed approaches. Nevertheless, this approach has a drawback that inadequate captions might hurt both event localization and captioning. This paper introduces PWS-DVC, a novel approach aimed at enhancing the performance of weakly supervised dense video captioning. PWS-DVC's event captioning module is initially trained on video-clip datasets, which are extensively accessible video datasets by leveraging the absence of ground-truth data during training. Subsequently, it undergoes fine-tuning specifically for dense video captioning. In order to demonstrate the efficacy of PWS-DVC, we conduct comparative experiments with state-of-the-art methods using the ActivityNet Captions dataset. The findings indicate that PWS-DVC exhibits improved performance in comparison to current approaches in weakly supervised dense video captioning.</description><subject>Context modeling</subject><subject>Cross-modal video-text comprehension</subject><subject>Datasets</subject><subject>Dense video captioning</subject><subject>Dogs</subject><subject>Effectiveness</subject><subject>event localization in videos</subject><subject>fine-tuning for dense captioning</subject><subject>Location awareness</subject><subject>Natural language processing</subject><subject>natural language processing in videos</subject><subject>Pretraining</subject><subject>Task analysis</subject><subject>Training</subject><subject>Transformers</subject><subject>Uncertainty</subject><subject>Weakly supervised</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUctOwzAQjBBIIOAL4BCJc4rfjrlVoTwkJCqVwtFynHXrUpLgpKD-PW6DEHvxajQzu-tJkguMRhgjdT0uislsNiKI0BGlFEsuDpITgoXKKKfi8F9_nJx33QrFyiPE5Ukyn77NstvX4iad1EtTW18v0jcw7-ttOtu0EL58B1V6C3UH6auvoEkL0_a-qXfEb98v02mAPhi_B8ZtGxpjl2fJkTPrDs5_39Nkfjd5KR6yp-f7x2L8lFnKVZ8RScBIKllZcqccRaLC1klZWciZ5cyWFacSUwdEVCVnCEtFFAOMSqFEPOg0eRx8q8asdBv8hwlb3Riv90ATFtqE3ts1aJdLMCKP_yOAkVwZgR1W1inOkTWSRa-rwSue8LmBrterZhPquL6OdEKl4Hw3kQ4sG5quC-D-pmKkd3HoIQ69i0P_xhFVl4PKA8A_BcWMKUR_AEwjhHA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Choi, Wangyu</creator><creator>Chen, Jiasi</creator><creator>Yoon, Jongwon</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-9052-243X</orcidid><orcidid>https://orcid.org/0000-0002-9010-6460</orcidid></search><sort><creationdate>20230101</creationdate><title>PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach</title><author>Choi, Wangyu ; Chen, Jiasi ; Yoon, Jongwon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-272ea7374bb5f9f306d1cf77dce84c54cbd53713fe26db540179294e10b696353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Context modeling</topic><topic>Cross-modal video-text comprehension</topic><topic>Datasets</topic><topic>Dense video captioning</topic><topic>Dogs</topic><topic>Effectiveness</topic><topic>event localization in videos</topic><topic>fine-tuning for dense captioning</topic><topic>Location awareness</topic><topic>Natural language processing</topic><topic>natural language processing in videos</topic><topic>Pretraining</topic><topic>Task analysis</topic><topic>Training</topic><topic>Transformers</topic><topic>Uncertainty</topic><topic>Weakly supervised</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Choi, Wangyu</creatorcontrib><creatorcontrib>Chen, Jiasi</creatorcontrib><creatorcontrib>Yoon, Jongwon</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Choi, Wangyu</au><au>Chen, Jiasi</au><au>Yoon, Jongwon</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>11</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challenge in understanding untrimmed video and generating several event-based sentences to describe the video. Numerous endeavors have been undertaken to enhance the efficacy of the dense video captioning task by the utilization of various approaches, such as bottom-up, top-down, parallel pipeline, pretraining, etc. In contrast, the weakly supervised dense video captioning method presents a highly promising strategy for generating dense video captions solely based on captions, without relying on any knowledge of ground-truth events, which distinguishes it from widely employed approaches. Nevertheless, this approach has a drawback that inadequate captions might hurt both event localization and captioning. This paper introduces PWS-DVC, a novel approach aimed at enhancing the performance of weakly supervised dense video captioning. PWS-DVC's event captioning module is initially trained on video-clip datasets, which are extensively accessible video datasets by leveraging the absence of ground-truth data during training. Subsequently, it undergoes fine-tuning specifically for dense video captioning. In order to demonstrate the efficacy of PWS-DVC, we conduct comparative experiments with state-of-the-art methods using the ActivityNet Captions dataset. The findings indicate that PWS-DVC exhibits improved performance in comparison to current approaches in weakly supervised dense video captioning.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2023.3331756</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-9052-243X</orcidid><orcidid>https://orcid.org/0000-0002-9010-6460</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2023-01, Vol.11, p.1-1
issn 2169-3536
2169-3536
language eng
recordid cdi_crossref_primary_10_1109_ACCESS_2023_3331756
source IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects Context modeling
Cross-modal video-text comprehension
Datasets
Dense video captioning
Dogs
Effectiveness
event localization in videos
fine-tuning for dense captioning
Location awareness
Natural language processing
natural language processing in videos
Pretraining
Task analysis
Training
Transformers
Uncertainty
Weakly supervised
title PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning with Pretraining Approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-13T03%3A00%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PWS-DVC:%20Enhancing%20Weakly%20Supervised%20Dense%20Video%20Captioning%20with%20Pretraining%20Approach&rft.jtitle=IEEE%20access&rft.au=Choi,%20Wangyu&rft.date=2023-01-01&rft.volume=11&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2023.3331756&rft_dat=%3Cproquest_cross%3E2892376555%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2892376555&rft_id=info:pmid/&rft_ieee_id=10314490&rft_doaj_id=oai_doaj_org_article_f87ea683316e4289a61f19cf9550ca74&rfr_iscdi=true