A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces

Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on software engineering and methodology 2023-07, Vol.32 (5), p.1-28, Article 123
Hauptverfasser: Wei, Hongwei, Su, Xiaohong, Gao, Cuiyun, Zheng, Weining, Tao, Wenxin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 28
container_issue 5
container_start_page 1
container_title ACM transactions on software engineering and methodology
container_volume 32
creator Wei, Hongwei
Su, Xiaohong
Gao, Cuiyun
Zheng, Weining
Tao, Wenxin
description Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.
doi_str_mv 10.1145/3591868
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3591868</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3591868</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</originalsourceid><addsrcrecordid>eNo9kEFLAzEUhIMoWKt495Sbp2iSt9lsjqVYKxQEt4K3Jd19aVe7m5JES_-9La2eZmA-hmEIuRX8QYhMPYIyosiLMzIQSmmmwcjzveeZYQDi45JcxfjJuQAuswFZjeh0t_FphbGNdI4xtf2SLWzEhk6C7XDrwxd1PtDSu7S1Aek4-BhZ5xu7pm-YQos_e9f2dIoJg19ij_470hI726e2puXG1hivyYWz64g3Jx2S98nTfDxls9fnl_FoxqwEkxhowVHLRjWLDBaqkQhOZ0ZjXkvHjURhRI7AsUGbF844m2sQGSipbWGMgiG5P_bWh5kBXbUJbWfDrhK8OhxUnQ7ak3dH0tbdP_QX_gKYamEB</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><source>ACM Digital Library</source><creator>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</creator><creatorcontrib>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</creatorcontrib><description>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</description><identifier>ISSN: 1049-331X</identifier><identifier>EISSN: 1557-7392</identifier><identifier>DOI: 10.1145/3591868</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Search-based software engineering ; Software and its engineering</subject><ispartof>ACM transactions on software engineering and methodology, 2023-07, Vol.32 (5), p.1-28, Article 123</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</cites><orcidid>0000-0002-8584-0716 ; 0000-0002-5607-1065 ; 0000-0003-4774-2434 ; 0000-0001-6818-5118 ; 0000-0003-3668-3600</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3591868$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,777,781,2276,27905,27906,40177,75977</link.rule.ids></links><search><creatorcontrib>Wei, Hongwei</creatorcontrib><creatorcontrib>Su, Xiaohong</creatorcontrib><creatorcontrib>Gao, Cuiyun</creatorcontrib><creatorcontrib>Zheng, Weining</creatorcontrib><creatorcontrib>Tao, Wenxin</creatorcontrib><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><title>ACM transactions on software engineering and methodology</title><addtitle>ACM TOSEM</addtitle><description>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</description><subject>Search-based software engineering</subject><subject>Software and its engineering</subject><issn>1049-331X</issn><issn>1557-7392</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9kEFLAzEUhIMoWKt495Sbp2iSt9lsjqVYKxQEt4K3Jd19aVe7m5JES_-9La2eZmA-hmEIuRX8QYhMPYIyosiLMzIQSmmmwcjzveeZYQDi45JcxfjJuQAuswFZjeh0t_FphbGNdI4xtf2SLWzEhk6C7XDrwxd1PtDSu7S1Aek4-BhZ5xu7pm-YQos_e9f2dIoJg19ij_470hI726e2puXG1hivyYWz64g3Jx2S98nTfDxls9fnl_FoxqwEkxhowVHLRjWLDBaqkQhOZ0ZjXkvHjURhRI7AsUGbF844m2sQGSipbWGMgiG5P_bWh5kBXbUJbWfDrhK8OhxUnQ7ak3dH0tbdP_QX_gKYamEB</recordid><startdate>20230721</startdate><enddate>20230721</enddate><creator>Wei, Hongwei</creator><creator>Su, Xiaohong</creator><creator>Gao, Cuiyun</creator><creator>Zheng, Weining</creator><creator>Tao, Wenxin</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8584-0716</orcidid><orcidid>https://orcid.org/0000-0002-5607-1065</orcidid><orcidid>https://orcid.org/0000-0003-4774-2434</orcidid><orcidid>https://orcid.org/0000-0001-6818-5118</orcidid><orcidid>https://orcid.org/0000-0003-3668-3600</orcidid></search><sort><creationdate>20230721</creationdate><title>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</title><author>Wei, Hongwei ; Su, Xiaohong ; Gao, Cuiyun ; Zheng, Weining ; Tao, Wenxin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-3710e72d5db43b5d2e3f7497e6c2f092e1916e30edea68f9fa673143527a89953</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Search-based software engineering</topic><topic>Software and its engineering</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wei, Hongwei</creatorcontrib><creatorcontrib>Su, Xiaohong</creatorcontrib><creatorcontrib>Gao, Cuiyun</creatorcontrib><creatorcontrib>Zheng, Weining</creatorcontrib><creatorcontrib>Tao, Wenxin</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on software engineering and methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wei, Hongwei</au><au>Su, Xiaohong</au><au>Gao, Cuiyun</au><au>Zheng, Weining</au><au>Tao, Wenxin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces</atitle><jtitle>ACM transactions on software engineering and methodology</jtitle><stitle>ACM TOSEM</stitle><date>2023-07-21</date><risdate>2023</risdate><volume>32</volume><issue>5</issue><spage>1</spage><epage>28</epage><pages>1-28</pages><artnum>123</artnum><issn>1049-331X</issn><eissn>1557-7392</eissn><abstract>Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3591868</doi><tpages>28</tpages><orcidid>https://orcid.org/0000-0002-8584-0716</orcidid><orcidid>https://orcid.org/0000-0002-5607-1065</orcidid><orcidid>https://orcid.org/0000-0003-4774-2434</orcidid><orcidid>https://orcid.org/0000-0001-6818-5118</orcidid><orcidid>https://orcid.org/0000-0003-3668-3600</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1049-331X
ispartof ACM transactions on software engineering and methodology, 2023-07, Vol.32 (5), p.1-28, Article 123
issn 1049-331X
1557-7392
language eng
recordid cdi_crossref_primary_10_1145_3591868
source ACM Digital Library
subjects Search-based software engineering
Software and its engineering
title A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A54%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Hypothesis%20Testing-based%20Framework%20for%20Software%20Cross-modal%20Retrieval%20in%20Heterogeneous%20Semantic%20Spaces&rft.jtitle=ACM%20transactions%20on%20software%20engineering%20and%20methodology&rft.au=Wei,%20Hongwei&rft.date=2023-07-21&rft.volume=32&rft.issue=5&rft.spage=1&rft.epage=28&rft.pages=1-28&rft.artnum=123&rft.issn=1049-331X&rft.eissn=1557-7392&rft_id=info:doi/10.1145/3591868&rft_dat=%3Cacm_cross%3E3591868%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true