Feature engineering for MEDLINE citation categorization with MeSH

Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of bi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	BMC bioinformatics 2015-04, Vol.16 (1), p.113-113, Article 113
Hauptverfasser:	Jimeno Yepes, Antonio Jose, Plaza, Laura, Carrillo-de-Albornoz, Jorge, Mork, James G, Aronson, Alan R
Format:	Artikel
Sprache:	eng
Schlagworte:	Abstracting and Indexing as Topic - methods Algorithms Analysis Artificial Intelligence Comparative analysis Data mining Humans Information Storage and Retrieval Medical Subject Headings MEDLINE Semantics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	113
container_issue	1
container_start_page	113
container_title	BMC bioinformatics
container_volume	16
creator	Jimeno Yepes, Antonio Jose Plaza, Laura Carrillo-de-Albornoz, Jorge Mork, James G Aronson, Alan R
description	Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.
doi_str_mv	10.1186/s12859-015-0539-7
format	Article
fullrecord	<record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4407321</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A541358755</galeid><sourcerecordid>A541358755</sourcerecordid><originalsourceid>FETCH-LOGICAL-c570t-1a925459c403494279a4e66ea2f6cfd86184750dcdb64445e5b6285f4b4a4c9a3</originalsourceid><addsrcrecordid>eNptkV1rFDEUhoNYbK3-AG9kwBu9mDaZyeeNsNStXdhWsHodspkz08hs0iYZv369WaaWLpRcJCfneV8450XoDcEnhEh-mkgjmaoxYTVmrarFM3REqCB1QzB7_uh9iF6m9ANjIiRmL9Bhw6QUQjVHaHEOJk8RKvCD8wDR-aHqQ6wul5_Wq6tlZV022QVfWZNhCNH9nctfLt9Ul3B98Qod9GZM8Pr-Pkbfz5ffzi7q9ZfPq7PFurZM4FwToxpGmbIUt1TRRihDgXMwTc9t30lOJBUMd7bbcEopA7bhZbqebqihVpn2GH2cfW-nzRY6Cz5HM-rb6LYm_tHBOL3f8e5GD-GnphSLtiHF4P29QQx3E6Ssty5ZGEfjIUxJEy6Y5EoyWdB3MzqYEbTzfSiOdofrBaOkZVIwVqiTJ6hyOtg6Gzz0rvzvCT7sCQqT4XcezJSSXl1_3WfJzNoYUorQP0xKsN6lr-f0dUlf79LXomjePl7Rg-J_3O0_BxCn9g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1675869858</pqid></control><display><type>article</type><title>Feature engineering for MEDLINE citation categorization with MeSH</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>PubMed Central Open Access</source><source>Springer Nature OA Free Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>SpringerLink Journals - AutoHoldings</source><creator>Jimeno Yepes, Antonio Jose ; Plaza, Laura ; Carrillo-de-Albornoz, Jorge ; Mork, James G ; Aronson, Alan R</creator><creatorcontrib>Jimeno Yepes, Antonio Jose ; Plaza, Laura ; Carrillo-de-Albornoz, Jorge ; Mork, James G ; Aronson, Alan R</creatorcontrib><description>Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/s12859-015-0539-7</identifier><identifier>PMID: 25887792</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Abstracting and Indexing as Topic - methods ; Algorithms ; Analysis ; Artificial Intelligence ; Comparative analysis ; Data mining ; Humans ; Information Storage and Retrieval ; Medical Subject Headings ; MEDLINE ; Semantics</subject><ispartof>BMC bioinformatics, 2015-04, Vol.16 (1), p.113-113, Article 113</ispartof><rights>COPYRIGHT 2015 BioMed Central Ltd.</rights><rights>Jimeno Yepes et al.; licensee BioMed Central. 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c570t-1a925459c403494279a4e66ea2f6cfd86184750dcdb64445e5b6285f4b4a4c9a3</citedby><cites>FETCH-LOGICAL-c570t-1a925459c403494279a4e66ea2f6cfd86184750dcdb64445e5b6285f4b4a4c9a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4407321/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4407321/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,864,885,27922,27923,53789,53791</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/25887792$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Jimeno Yepes, Antonio Jose</creatorcontrib><creatorcontrib>Plaza, Laura</creatorcontrib><creatorcontrib>Carrillo-de-Albornoz, Jorge</creatorcontrib><creatorcontrib>Mork, James G</creatorcontrib><creatorcontrib>Aronson, Alan R</creatorcontrib><title>Feature engineering for MEDLINE citation categorization with MeSH</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.</description><subject>Abstracting and Indexing as Topic - methods</subject><subject>Algorithms</subject><subject>Analysis</subject><subject>Artificial Intelligence</subject><subject>Comparative analysis</subject><subject>Data mining</subject><subject>Humans</subject><subject>Information Storage and Retrieval</subject><subject>Medical Subject Headings</subject><subject>MEDLINE</subject><subject>Semantics</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNptkV1rFDEUhoNYbK3-AG9kwBu9mDaZyeeNsNStXdhWsHodspkz08hs0iYZv369WaaWLpRcJCfneV8450XoDcEnhEh-mkgjmaoxYTVmrarFM3REqCB1QzB7_uh9iF6m9ANjIiRmL9Bhw6QUQjVHaHEOJk8RKvCD8wDR-aHqQ6wul5_Wq6tlZV022QVfWZNhCNH9nctfLt9Ul3B98Qod9GZM8Pr-Pkbfz5ffzi7q9ZfPq7PFurZM4FwToxpGmbIUt1TRRihDgXMwTc9t30lOJBUMd7bbcEopA7bhZbqebqihVpn2GH2cfW-nzRY6Cz5HM-rb6LYm_tHBOL3f8e5GD-GnphSLtiHF4P29QQx3E6Ssty5ZGEfjIUxJEy6Y5EoyWdB3MzqYEbTzfSiOdofrBaOkZVIwVqiTJ6hyOtg6Gzz0rvzvCT7sCQqT4XcezJSSXl1_3WfJzNoYUorQP0xKsN6lr-f0dUlf79LXomjePl7Rg-J_3O0_BxCn9g</recordid><startdate>20150408</startdate><enddate>20150408</enddate><creator>Jimeno Yepes, Antonio Jose</creator><creator>Plaza, Laura</creator><creator>Carrillo-de-Albornoz, Jorge</creator><creator>Mork, James G</creator><creator>Aronson, Alan R</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20150408</creationdate><title>Feature engineering for MEDLINE citation categorization with MeSH</title><author>Jimeno Yepes, Antonio Jose ; Plaza, Laura ; Carrillo-de-Albornoz, Jorge ; Mork, James G ; Aronson, Alan R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c570t-1a925459c403494279a4e66ea2f6cfd86184750dcdb64445e5b6285f4b4a4c9a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Abstracting and Indexing as Topic - methods</topic><topic>Algorithms</topic><topic>Analysis</topic><topic>Artificial Intelligence</topic><topic>Comparative analysis</topic><topic>Data mining</topic><topic>Humans</topic><topic>Information Storage and Retrieval</topic><topic>Medical Subject Headings</topic><topic>MEDLINE</topic><topic>Semantics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jimeno Yepes, Antonio Jose</creatorcontrib><creatorcontrib>Plaza, Laura</creatorcontrib><creatorcontrib>Carrillo-de-Albornoz, Jorge</creatorcontrib><creatorcontrib>Mork, James G</creatorcontrib><creatorcontrib>Aronson, Alan R</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jimeno Yepes, Antonio Jose</au><au>Plaza, Laura</au><au>Carrillo-de-Albornoz, Jorge</au><au>Mork, James G</au><au>Aronson, Alan R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature engineering for MEDLINE citation categorization with MeSH</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2015-04-08</date><risdate>2015</risdate><volume>16</volume><issue>1</issue><spage>113</spage><epage>113</epage><pages>113-113</pages><artnum>113</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>25887792</pmid><doi>10.1186/s12859-015-0539-7</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1471-2105
ispartof	BMC bioinformatics, 2015-04, Vol.16 (1), p.113-113, Article 113
issn	1471-2105 1471-2105
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4407321
source	MEDLINE; DOAJ Directory of Open Access Journals; PubMed Central Open Access; Springer Nature OA Free Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central; SpringerLink Journals - AutoHoldings
subjects	Abstracting and Indexing as Topic - methods Algorithms Analysis Artificial Intelligence Comparative analysis Data mining Humans Information Storage and Retrieval Medical Subject Headings MEDLINE Semantics
title	Feature engineering for MEDLINE citation categorization with MeSH
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T13%3A20%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20engineering%20for%20MEDLINE%20citation%20categorization%20with%20MeSH&rft.jtitle=BMC%20bioinformatics&rft.au=Jimeno%20Yepes,%20Antonio%20Jose&rft.date=2015-04-08&rft.volume=16&rft.issue=1&rft.spage=113&rft.epage=113&rft.pages=113-113&rft.artnum=113&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/s12859-015-0539-7&rft_dat=%3Cgale_pubme%3EA541358755%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1675869858&rft_id=info:pmid/25887792&rft_galeid=A541358755&rfr_iscdi=true