A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces

Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on software engineering and methodology 2023-07, Vol.32 (5), p.1-28, Article 123
Hauptverfasser:	Wei, Hongwei, Su, Xiaohong, Gao, Cuiyun, Zheng, Weining, Tao, Wenxin
Format:	Artikel
Sprache:	eng
Schlagworte:	Search-based software engineering Software and its engineering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	28
container_issue	5
container_start_page	1
container_title	ACM transactions on software engineering and methodology
container_volume	32
creator	Wei, Hongwei Su, Xiaohong Gao, Cuiyun Zheng, Weining Tao, Wenxin
description	Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.
doi_str_mv	10.1145/3591868
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3591868</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3591868</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</originalsourceid><addsrcrecordid>eNo9kEFLAzEUhIMoWKt495Sbp2iSt9lsjqVYKxQEt4K3Jd19aVe7m5JES_-9La2eZmA-hmEIuRX8QYhMPYIyosiLMzIQSmmmwcjzveeZYQDi45JcxfjJuQAuswFZjeh0t_FphbGNdI4xtf2SLWzEhk6C7XDrwxd1PtDSu7S1Aek4-BhZ5xu7pm-YQos_e9f2dIoJg19ij_470hI726e2puXG1hivyYWz64g3Jx2S98nTfDxls9fnl_FoxqwEkxhowVHLRjWLDBaqkQhOZ0ZjXkvHjURhRI7AsUGbF844m2sQGSipbWGMgiG5P_bWh5kBXbUJbWfDrhK8OhxUnQ7ak3dH0tbdP_QX_gKYamEB</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><source>ACM Digital Library</source><creator>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</creator><creatorcontrib>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</creatorcontrib><description>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</description><identifier>ISSN: 1049-331X</identifier><identifier>EISSN: 1557-7392</identifier><identifier>DOI: 10.1145/3591868</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Search-based software engineering ; Software and its engineering</subject><ispartof>ACM transactions on software engineering and methodology, 2023-07, Vol.32 (5), p.1-28, Article 123</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</cites><orcidid>0000-0002-8584-0716 ; 0000-0002-5607-1065 ; 0000-0003-4774-2434 ; 0000-0001-6818-5118 ; 0000-0003-3668-3600</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3591868$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,777,781,2276,27905,27906,40177,75977</link.rule.ids></links><search><creatorcontrib>Wei, Hongwei</creatorcontrib><creatorcontrib>Su, Xiaohong</creatorcontrib><creatorcontrib>Gao, Cuiyun</creatorcontrib><creatorcontrib>Zheng, Weining</creatorcontrib><creatorcontrib>Tao, Wenxin</creatorcontrib><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><title>ACM transactions on software engineering and methodology</title><addtitle>ACM TOSEM</addtitle><description>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</description><subject>Search-based software engineering</subject><subject>Software and its engineering</subject><issn>1049-331X</issn><issn>1557-7392</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9kEFLAzEUhIMoWKt495Sbp2iSt9lsjqVYKxQEt4K3Jd19aVe7m5JES_-9La2eZmA-hmEIuRX8QYhMPYIyosiLMzIQSmmmwcjzveeZYQDi45JcxfjJuQAuswFZjeh0t_FphbGNdI4xtf2SLWzEhk6C7XDrwxd1PtDSu7S1Aek4-BhZ5xu7pm-YQos_e9f2dIoJg19ij_470hI726e2puXG1hivyYWz64g3Jx2S98nTfDxls9fnl_FoxqwEkxhowVHLRjWLDBaqkQhOZ0ZjXkvHjURhRI7AsUGbF844m2sQGSipbWGMgiG5P_bWh5kBXbUJbWfDrhK8OhxUnQ7ak3dH0tbdP_QX_gKYamEB</recordid><startdate>20230721</startdate><enddate>20230721</enddate><creator>Wei, Hongwei</creator><creator>Su, Xiaohong</creator><creator>Gao, Cuiyun</creator><creator>Zheng, Weining</creator><creator>Tao, Wenxin</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8584-0716</orcidid><orcidid>https://orcid.org/0000-0002-5607-1065</orcidid><orcidid>https://orcid.org/0000-0003-4774-2434</orcidid><orcidid>https://orcid.org/0000-0001-6818-5118</orcidid><orcidid>https://orcid.org/0000-0003-3668-3600</orcidid></search><sort><creationdate>20230721</creationdate><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><author>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Search-based software engineering</topic><topic>Software and its engineering</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wei, Hongwei</creatorcontrib><creatorcontrib>Su, Xiaohong</creatorcontrib><creatorcontrib>Gao, Cuiyun</creatorcontrib><creatorcontrib>Zheng, Weining</creatorcontrib><creatorcontrib>Tao, Wenxin</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on software engineering and methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wei, Hongwei</au><au>Su, Xiaohong</au><au>Gao, Cuiyun</au><au>Zheng, Weining</au><au>Tao, Wenxin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</atitle><jtitle>ACM transactions on software engineering and methodology</jtitle><stitle>ACM TOSEM</stitle><date>2023-07-21</date><risdate>2023</risdate><volume>32</volume><issue>5</issue><spage>1</spage><epage>28</epage><pages>1-28</pages><artnum>123</artnum><issn>1049-331X</issn><eissn>1557-7392</eissn><abstract>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3591868</doi><tpages>28</tpages><orcidid>https://orcid.org/0000-0002-8584-0716</orcidid><orcidid>https://orcid.org/0000-0002-5607-1065</orcidid><orcidid>https://orcid.org/0000-0003-4774-2434</orcidid><orcidid>https://orcid.org/0000-0001-6818-5118</orcidid><orcidid>https://orcid.org/0000-0003-3668-3600</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1049-331X
ispartof	ACM transactions on software engineering and methodology, 2023-07, Vol.32 (5), p.1-28, Article 123
issn	1049-331X 1557-7392
language	eng
recordid	cdi_crossref_primary_10_1145_3591868
source	ACM Digital Library
subjects	Search-based software engineering Software and its engineering
title	A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A54%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Hypothesis%20Testing-based%20Framework%20for%20Software%20Cross-modal%20Retrieval%20in%20Heterogeneous%20Semantic%20Spaces&rft.jtitle=ACM%20transactions%20on%20software%20engineering%20and%20methodology&rft.au=Wei,%20Hongwei&rft.date=2023-07-21&rft.volume=32&rft.issue=5&rft.spage=1&rft.epage=28&rft.pages=1-28&rft.artnum=123&rft.issn=1049-331X&rft.eissn=1557-7392&rft_id=info:doi/10.1145/3591868&rft_dat=%3Cacm_cross%3E3591868%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true