HMTV: hierarchical multimodal transformer for video highlight query on baseball

With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To addr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia systems 2024-10, Vol.30 (5), Article 285
Hauptverfasser: Zhang, Qiaoyun, Chang, Chih-Yung, Su, Ming-Yang, Chang, Hsiang-Chuan, Roy, Diptendu Sinha
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 5
container_start_page
container_title Multimedia systems
container_volume 30
creator Zhang, Qiaoyun
Chang, Chih-Yung
Su, Ming-Yang
Chang, Hsiang-Chuan
Roy, Diptendu Sinha
description With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users’ questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.
doi_str_mv 10.1007/s00530-024-01479-6
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3108729724</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3108729724</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-49b8c7f3282420ef58a0bd0fb969dc9b411aa0d96c9bb34a2fa47c05088f0a603</originalsourceid><addsrcrecordid>eNp9UE1LAzEUDKJgrf4BTwHPqy8fzSbepKgVKr1UryHJJu2W_ajJVui_N7qCNw-PmcPMvGEQuiZwSwDKuwQwY1AA5QUQXqpCnKAJ4YwWREp6iiagOC24EvQcXaS0AyClYDBBq8Xr-v0eb2sfTXTb2pkGt4dmqNu-ynSIpkuhj62POAP-rCvfZ_Vm2-Qb8MfBxyPuO2xN8tY0zSU6C6ZJ_uoXp-jt6XE9XxTL1fPL_GFZOAow5CZWujIwKimn4MNMGrAVBKuEqpyynBBjoFIic8u4ocHw0sEMpAxgBLApuhlz97HPJdKgd_0hdvmlZgRkSVVJeVbRUeVin1L0Qe9j3Zp41AT093B6HE7n4fTPcFpkExtNKYu7jY9_0f-4vgCuz3C-</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3108729724</pqid></control><display><type>article</type><title>HMTV: hierarchical multimodal transformer for video highlight query on baseball</title><source>SpringerLink Journals</source><creator>Zhang, Qiaoyun ; Chang, Chih-Yung ; Su, Ming-Yang ; Chang, Hsiang-Chuan ; Roy, Diptendu Sinha</creator><creatorcontrib>Zhang, Qiaoyun ; Chang, Chih-Yung ; Su, Ming-Yang ; Chang, Hsiang-Chuan ; Roy, Diptendu Sinha</creatorcontrib><description>With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users’ questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-024-01479-6</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Baseball ; Coders ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Cryptology ; Data Storage Representation ; Multimedia Information Systems ; Operating Systems ; Queries ; Regular Paper ; Transformers ; Video</subject><ispartof>Multimedia systems, 2024-10, Vol.30 (5), Article 285</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-49b8c7f3282420ef58a0bd0fb969dc9b411aa0d96c9bb34a2fa47c05088f0a603</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00530-024-01479-6$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00530-024-01479-6$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Zhang, Qiaoyun</creatorcontrib><creatorcontrib>Chang, Chih-Yung</creatorcontrib><creatorcontrib>Su, Ming-Yang</creatorcontrib><creatorcontrib>Chang, Hsiang-Chuan</creatorcontrib><creatorcontrib>Roy, Diptendu Sinha</creatorcontrib><title>HMTV: hierarchical multimodal transformer for video highlight query on baseball</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users’ questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.</description><subject>Baseball</subject><subject>Coders</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Queries</subject><subject>Regular Paper</subject><subject>Transformers</subject><subject>Video</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9UE1LAzEUDKJgrf4BTwHPqy8fzSbepKgVKr1UryHJJu2W_ajJVui_N7qCNw-PmcPMvGEQuiZwSwDKuwQwY1AA5QUQXqpCnKAJ4YwWREp6iiagOC24EvQcXaS0AyClYDBBq8Xr-v0eb2sfTXTb2pkGt4dmqNu-ynSIpkuhj62POAP-rCvfZ_Vm2-Qb8MfBxyPuO2xN8tY0zSU6C6ZJ_uoXp-jt6XE9XxTL1fPL_GFZOAow5CZWujIwKimn4MNMGrAVBKuEqpyynBBjoFIic8u4ocHw0sEMpAxgBLApuhlz97HPJdKgd_0hdvmlZgRkSVVJeVbRUeVin1L0Qe9j3Zp41AT093B6HE7n4fTPcFpkExtNKYu7jY9_0f-4vgCuz3C-</recordid><startdate>20241001</startdate><enddate>20241001</enddate><creator>Zhang, Qiaoyun</creator><creator>Chang, Chih-Yung</creator><creator>Su, Ming-Yang</creator><creator>Chang, Hsiang-Chuan</creator><creator>Roy, Diptendu Sinha</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20241001</creationdate><title>HMTV: hierarchical multimodal transformer for video highlight query on baseball</title><author>Zhang, Qiaoyun ; Chang, Chih-Yung ; Su, Ming-Yang ; Chang, Hsiang-Chuan ; Roy, Diptendu Sinha</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-49b8c7f3282420ef58a0bd0fb969dc9b411aa0d96c9bb34a2fa47c05088f0a603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Baseball</topic><topic>Coders</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Queries</topic><topic>Regular Paper</topic><topic>Transformers</topic><topic>Video</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Qiaoyun</creatorcontrib><creatorcontrib>Chang, Chih-Yung</creatorcontrib><creatorcontrib>Su, Ming-Yang</creatorcontrib><creatorcontrib>Chang, Hsiang-Chuan</creatorcontrib><creatorcontrib>Roy, Diptendu Sinha</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Qiaoyun</au><au>Chang, Chih-Yung</au><au>Su, Ming-Yang</au><au>Chang, Hsiang-Chuan</au><au>Roy, Diptendu Sinha</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HMTV: hierarchical multimodal transformer for video highlight query on baseball</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2024-10-01</date><risdate>2024</risdate><volume>30</volume><issue>5</issue><artnum>285</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users’ questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-024-01479-6</doi></addata></record>
fulltext fulltext
identifier ISSN: 0942-4962
ispartof Multimedia systems, 2024-10, Vol.30 (5), Article 285
issn 0942-4962
1432-1882
language eng
recordid cdi_proquest_journals_3108729724
source SpringerLink Journals
subjects Baseball
Coders
Computer Communication Networks
Computer Graphics
Computer Science
Cryptology
Data Storage Representation
Multimedia Information Systems
Operating Systems
Queries
Regular Paper
Transformers
Video
title HMTV: hierarchical multimodal transformer for video highlight query on baseball
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T07%3A35%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HMTV:%20hierarchical%20multimodal%20transformer%20for%20video%20highlight%20query%20on%20baseball&rft.jtitle=Multimedia%20systems&rft.au=Zhang,%20Qiaoyun&rft.date=2024-10-01&rft.volume=30&rft.issue=5&rft.artnum=285&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-024-01479-6&rft_dat=%3Cproquest_cross%3E3108729724%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3108729724&rft_id=info:pmid/&rfr_iscdi=true