Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches

In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which ar...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics (Oxford, England) England), 2012-03, Vol.28 (6), p.867-875
Hauptverfasser: THIEU, Thanh, JOSHI, Sneha, WARREN, Samantha, KORKIN, Dmitry
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 875
container_issue 6
container_start_page 867
container_title Bioinformatics (Oxford, England)
container_volume 28
creator THIEU, Thanh
JOSHI, Sneha
WARREN, Samantha
KORKIN, Dmitry
description In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.
doi_str_mv 10.1093/bioinformatics/bts042
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_929504035</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>929504035</sourcerecordid><originalsourceid>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</originalsourceid><addsrcrecordid>eNpVkE1q3EAQhRsT47_4CA7aBK8U9680yi6Y2A4MeOOsRalVPdNh1K10tQKGLHyJXDAnicYzdvCqXsH3XhWPsQvBPwneqKvORx9cTANkb-mqy8S1PGAnQlV1qRdCvHvVXB2zU6IfnHPDTXXEjqWUC2MqccJ-L33GBHlKWAw--LAqoivWkfLfpz8j5HVcYSh82EI2-xjoc2HjMELaog6fnWUHhH1B04jpl9_KDUJ6DoMwLxBWE6xeMBjHFMGukd6zQwcbwvP9PGPfb74-XN-Vy_vbb9dflqXVXOdSqsYKVH3Fa8W1tY5DVytdLare9Lp2ymhrDAopoLO1dCDqRVV3aPsGheqlOmOXu9z58M8JKbeDJ4ub-TGME7WNbAzXXJmZNDvSpkiU0LVj8gOkx1bwdtt7-7b3dtf77PuwvzB1A_avrpeiZ-DjHgCysHEJgvX0nzOVkYIL9Q9RzZV9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>929504035</pqid></control><display><type>article</type><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><source>EZB Electronic Journals Library</source><creator>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</creator><creatorcontrib>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</creatorcontrib><description>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/bts042</identifier><identifier>PMID: 22285561</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Animals ; Biological and medical sciences ; Data Mining ; Fundamental and applied biological sciences. Psychology ; General aspects ; Host-Pathogen Interactions ; Humans ; Infection - metabolism ; Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) ; Natural Language Processing ; Proteins - metabolism ; PubMed ; Software ; Support Vector Machine</subject><ispartof>Bioinformatics (Oxford, England), 2012-03, Vol.28 (6), p.867-875</ispartof><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</citedby><cites>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=25652101$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/22285561$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>THIEU, Thanh</creatorcontrib><creatorcontrib>JOSHI, Sneha</creatorcontrib><creatorcontrib>WARREN, Samantha</creatorcontrib><creatorcontrib>KORKIN, Dmitry</creatorcontrib><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</description><subject>Animals</subject><subject>Biological and medical sciences</subject><subject>Data Mining</subject><subject>Fundamental and applied biological sciences. Psychology</subject><subject>General aspects</subject><subject>Host-Pathogen Interactions</subject><subject>Humans</subject><subject>Infection - metabolism</subject><subject>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</subject><subject>Natural Language Processing</subject><subject>Proteins - metabolism</subject><subject>PubMed</subject><subject>Software</subject><subject>Support Vector Machine</subject><issn>1367-4803</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpVkE1q3EAQhRsT47_4CA7aBK8U9680yi6Y2A4MeOOsRalVPdNh1K10tQKGLHyJXDAnicYzdvCqXsH3XhWPsQvBPwneqKvORx9cTANkb-mqy8S1PGAnQlV1qRdCvHvVXB2zU6IfnHPDTXXEjqWUC2MqccJ-L33GBHlKWAw--LAqoivWkfLfpz8j5HVcYSh82EI2-xjoc2HjMELaog6fnWUHhH1B04jpl9_KDUJ6DoMwLxBWE6xeMBjHFMGukd6zQwcbwvP9PGPfb74-XN-Vy_vbb9dflqXVXOdSqsYKVH3Fa8W1tY5DVytdLare9Lp2ymhrDAopoLO1dCDqRVV3aPsGheqlOmOXu9z58M8JKbeDJ4ub-TGME7WNbAzXXJmZNDvSpkiU0LVj8gOkx1bwdtt7-7b3dtf77PuwvzB1A_avrpeiZ-DjHgCysHEJgvX0nzOVkYIL9Q9RzZV9</recordid><startdate>20120315</startdate><enddate>20120315</enddate><creator>THIEU, Thanh</creator><creator>JOSHI, Sneha</creator><creator>WARREN, Samantha</creator><creator>KORKIN, Dmitry</creator><general>Oxford University Press</general><scope>IQODW</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20120315</creationdate><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><author>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Animals</topic><topic>Biological and medical sciences</topic><topic>Data Mining</topic><topic>Fundamental and applied biological sciences. Psychology</topic><topic>General aspects</topic><topic>Host-Pathogen Interactions</topic><topic>Humans</topic><topic>Infection - metabolism</topic><topic>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</topic><topic>Natural Language Processing</topic><topic>Proteins - metabolism</topic><topic>PubMed</topic><topic>Software</topic><topic>Support Vector Machine</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>THIEU, Thanh</creatorcontrib><creatorcontrib>JOSHI, Sneha</creatorcontrib><creatorcontrib>WARREN, Samantha</creatorcontrib><creatorcontrib>KORKIN, Dmitry</creatorcontrib><collection>Pascal-Francis</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>THIEU, Thanh</au><au>JOSHI, Sneha</au><au>WARREN, Samantha</au><au>KORKIN, Dmitry</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2012-03-15</date><risdate>2012</risdate><volume>28</volume><issue>6</issue><spage>867</spage><epage>875</epage><pages>867-875</pages><issn>1367-4803</issn><eissn>1367-4811</eissn><abstract>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><pmid>22285561</pmid><doi>10.1093/bioinformatics/bts042</doi><tpages>9</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1367-4803
ispartof Bioinformatics (Oxford, England), 2012-03, Vol.28 (6), p.867-875
issn 1367-4803
1367-4811
language eng
recordid cdi_proquest_miscellaneous_929504035
source Oxford Journals Open Access Collection; MEDLINE; PubMed Central; Alma/SFX Local Collection; EZB Electronic Journals Library
subjects Animals
Biological and medical sciences
Data Mining
Fundamental and applied biological sciences. Psychology
General aspects
Host-Pathogen Interactions
Humans
Infection - metabolism
Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)
Natural Language Processing
Proteins - metabolism
PubMed
Software
Support Vector Machine
title Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T21%3A19%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Literature%20mining%20of%20host%E2%80%95pathogen%20interactions:%20comparing%20feature-based%20supervised%20learning%20and%20language-based%20approaches&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=THIEU,%20Thanh&rft.date=2012-03-15&rft.volume=28&rft.issue=6&rft.spage=867&rft.epage=875&rft.pages=867-875&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/bts042&rft_dat=%3Cproquest_cross%3E929504035%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=929504035&rft_id=info:pmid/22285561&rfr_iscdi=true