An in-text citation classification predictive model for a scholarly search system

We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by exper...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Scientometrics 2021-07, Vol.126 (7), p.5509-5529
Hauptverfasser: Aljohani, Naif Radi, Fayoumi, Ayman, Hassan, Saeed-Ul
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 5529
container_issue 7
container_start_page 5509
container_title Scientometrics
container_volume 126
creator Aljohani, Naif Radi
Fayoumi, Ayman
Hassan, Saeed-Ul
description We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by experts as either important or unimportant. We used an array of machine-learning models on these annotated citations: Random Forest (RF); Support Vector Machine (SVM); and Decision Tree (DT). For the classification task, the selected models employed 15 novel features: contextual; quantitative; and qualitative. We show that the RF model outperformed the comparative model by 9.52%, achieving a 92% precision-recall area under the curve. We present a prototype of a scientific publication search system based on the RF prediction model for feature engineering. This was used on a dataset of 4138 full-text articles indexed by PLOS ONE that consists of 31,839 unique references. The empirical evaluation shows that the proposed search system improves visibility of a given scientific document by including, along with its index terms, terms from the works that it cites that are predicted to be important. Overall, this yields improved search results against the queries by the user.
doi_str_mv 10.1007/s11192-021-03986-z
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2544895453</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2544895453</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-46eff9e874014356a4367e3ee80812afe7bec4874fde7660839e8860749f1ee83</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKt_wFPAczSzyWazx1L8goIIeg4xndiU7W5NUrH-eqMrePM0DPO878BDyDnwS-C8uUoA0FaMV8C4aLVinwdkArXWrNIKDsmEg9CsBcGPyUlKa15CgusJeZz1NPQs40emLmSbw9BT19mUgg9uXLcRl8Hl8I50Myyxo36I1NLkVkNnY7enCW10K5r2KePmlBx52yU8-51T8nxz_TS_Y4uH2_v5bMGcgDYzqdD7FnUjOUhRKyuFalAgaq6hsh6bF3SynP0SG6W4FgXWijey9VAoMSUXY-82Dm87TNmsh13sy0tT1VLqtpa1KFQ1Ui4OKUX0ZhvDxsa9AW6-1ZlRnSnqzI8681lCYgylAvevGP-q_0l9AS0Ecig</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2544895453</pqid></control><display><type>article</type><title>An in-text citation classification predictive model for a scholarly search system</title><source>SpringerLink Journals - AutoHoldings</source><creator>Aljohani, Naif Radi ; Fayoumi, Ayman ; Hassan, Saeed-Ul</creator><creatorcontrib>Aljohani, Naif Radi ; Fayoumi, Ayman ; Hassan, Saeed-Ul</creatorcontrib><description>We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by experts as either important or unimportant. We used an array of machine-learning models on these annotated citations: Random Forest (RF); Support Vector Machine (SVM); and Decision Tree (DT). For the classification task, the selected models employed 15 novel features: contextual; quantitative; and qualitative. We show that the RF model outperformed the comparative model by 9.52%, achieving a 92% precision-recall area under the curve. We present a prototype of a scientific publication search system based on the RF prediction model for feature engineering. This was used on a dataset of 4138 full-text articles indexed by PLOS ONE that consists of 31,839 unique references. The empirical evaluation shows that the proposed search system improves visibility of a given scientific document by including, along with its index terms, terms from the works that it cites that are predicted to be important. Overall, this yields improved search results against the queries by the user.</description><identifier>ISSN: 0138-9130</identifier><identifier>EISSN: 1588-2861</identifier><identifier>DOI: 10.1007/s11192-021-03986-z</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Bibliometrics ; Citations ; Classification ; Computational linguistics ; Computer applications ; Computer Science ; Decision trees ; Information Storage and Retrieval ; Learning algorithms ; Library Science ; Linguistics ; Machine learning ; Mental task performance ; Prediction models ; Prototypes ; Searching ; Support vector machines ; Visibility</subject><ispartof>Scientometrics, 2021-07, Vol.126 (7), p.5509-5529</ispartof><rights>Akadémiai Kiadó, Budapest, Hungary 2021</rights><rights>Akadémiai Kiadó, Budapest, Hungary 2021.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-46eff9e874014356a4367e3ee80812afe7bec4874fde7660839e8860749f1ee83</citedby><cites>FETCH-LOGICAL-c319t-46eff9e874014356a4367e3ee80812afe7bec4874fde7660839e8860749f1ee83</cites><orcidid>0000-0002-6509-9190</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11192-021-03986-z$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11192-021-03986-z$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,778,782,27907,27908,41471,42540,51302</link.rule.ids></links><search><creatorcontrib>Aljohani, Naif Radi</creatorcontrib><creatorcontrib>Fayoumi, Ayman</creatorcontrib><creatorcontrib>Hassan, Saeed-Ul</creatorcontrib><title>An in-text citation classification predictive model for a scholarly search system</title><title>Scientometrics</title><addtitle>Scientometrics</addtitle><description>We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by experts as either important or unimportant. We used an array of machine-learning models on these annotated citations: Random Forest (RF); Support Vector Machine (SVM); and Decision Tree (DT). For the classification task, the selected models employed 15 novel features: contextual; quantitative; and qualitative. We show that the RF model outperformed the comparative model by 9.52%, achieving a 92% precision-recall area under the curve. We present a prototype of a scientific publication search system based on the RF prediction model for feature engineering. This was used on a dataset of 4138 full-text articles indexed by PLOS ONE that consists of 31,839 unique references. The empirical evaluation shows that the proposed search system improves visibility of a given scientific document by including, along with its index terms, terms from the works that it cites that are predicted to be important. Overall, this yields improved search results against the queries by the user.</description><subject>Bibliometrics</subject><subject>Citations</subject><subject>Classification</subject><subject>Computational linguistics</subject><subject>Computer applications</subject><subject>Computer Science</subject><subject>Decision trees</subject><subject>Information Storage and Retrieval</subject><subject>Learning algorithms</subject><subject>Library Science</subject><subject>Linguistics</subject><subject>Machine learning</subject><subject>Mental task performance</subject><subject>Prediction models</subject><subject>Prototypes</subject><subject>Searching</subject><subject>Support vector machines</subject><subject>Visibility</subject><issn>0138-9130</issn><issn>1588-2861</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKt_wFPAczSzyWazx1L8goIIeg4xndiU7W5NUrH-eqMrePM0DPO878BDyDnwS-C8uUoA0FaMV8C4aLVinwdkArXWrNIKDsmEg9CsBcGPyUlKa15CgusJeZz1NPQs40emLmSbw9BT19mUgg9uXLcRl8Hl8I50Myyxo36I1NLkVkNnY7enCW10K5r2KePmlBx52yU8-51T8nxz_TS_Y4uH2_v5bMGcgDYzqdD7FnUjOUhRKyuFalAgaq6hsh6bF3SynP0SG6W4FgXWijey9VAoMSUXY-82Dm87TNmsh13sy0tT1VLqtpa1KFQ1Ui4OKUX0ZhvDxsa9AW6-1ZlRnSnqzI8681lCYgylAvevGP-q_0l9AS0Ecig</recordid><startdate>20210701</startdate><enddate>20210701</enddate><creator>Aljohani, Naif Radi</creator><creator>Fayoumi, Ayman</creator><creator>Hassan, Saeed-Ul</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0002-6509-9190</orcidid></search><sort><creationdate>20210701</creationdate><title>An in-text citation classification predictive model for a scholarly search system</title><author>Aljohani, Naif Radi ; Fayoumi, Ayman ; Hassan, Saeed-Ul</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-46eff9e874014356a4367e3ee80812afe7bec4874fde7660839e8860749f1ee83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Bibliometrics</topic><topic>Citations</topic><topic>Classification</topic><topic>Computational linguistics</topic><topic>Computer applications</topic><topic>Computer Science</topic><topic>Decision trees</topic><topic>Information Storage and Retrieval</topic><topic>Learning algorithms</topic><topic>Library Science</topic><topic>Linguistics</topic><topic>Machine learning</topic><topic>Mental task performance</topic><topic>Prediction models</topic><topic>Prototypes</topic><topic>Searching</topic><topic>Support vector machines</topic><topic>Visibility</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Aljohani, Naif Radi</creatorcontrib><creatorcontrib>Fayoumi, Ayman</creatorcontrib><creatorcontrib>Hassan, Saeed-Ul</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><jtitle>Scientometrics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Aljohani, Naif Radi</au><au>Fayoumi, Ayman</au><au>Hassan, Saeed-Ul</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An in-text citation classification predictive model for a scholarly search system</atitle><jtitle>Scientometrics</jtitle><stitle>Scientometrics</stitle><date>2021-07-01</date><risdate>2021</risdate><volume>126</volume><issue>7</issue><spage>5509</spage><epage>5529</epage><pages>5509-5529</pages><issn>0138-9130</issn><eissn>1588-2861</eissn><abstract>We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by experts as either important or unimportant. We used an array of machine-learning models on these annotated citations: Random Forest (RF); Support Vector Machine (SVM); and Decision Tree (DT). For the classification task, the selected models employed 15 novel features: contextual; quantitative; and qualitative. We show that the RF model outperformed the comparative model by 9.52%, achieving a 92% precision-recall area under the curve. We present a prototype of a scientific publication search system based on the RF prediction model for feature engineering. This was used on a dataset of 4138 full-text articles indexed by PLOS ONE that consists of 31,839 unique references. The empirical evaluation shows that the proposed search system improves visibility of a given scientific document by including, along with its index terms, terms from the works that it cites that are predicted to be important. Overall, this yields improved search results against the queries by the user.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s11192-021-03986-z</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-6509-9190</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0138-9130
ispartof Scientometrics, 2021-07, Vol.126 (7), p.5509-5529
issn 0138-9130
1588-2861
language eng
recordid cdi_proquest_journals_2544895453
source SpringerLink Journals - AutoHoldings
subjects Bibliometrics
Citations
Classification
Computational linguistics
Computer applications
Computer Science
Decision trees
Information Storage and Retrieval
Learning algorithms
Library Science
Linguistics
Machine learning
Mental task performance
Prediction models
Prototypes
Searching
Support vector machines
Visibility
title An in-text citation classification predictive model for a scholarly search system
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T03%3A33%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20in-text%20citation%20classification%20predictive%20model%20for%20a%20scholarly%20search%20system&rft.jtitle=Scientometrics&rft.au=Aljohani,%20Naif%20Radi&rft.date=2021-07-01&rft.volume=126&rft.issue=7&rft.spage=5509&rft.epage=5529&rft.pages=5509-5529&rft.issn=0138-9130&rft.eissn=1588-2861&rft_id=info:doi/10.1007/s11192-021-03986-z&rft_dat=%3Cproquest_cross%3E2544895453%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2544895453&rft_id=info:pmid/&rfr_iscdi=true