Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches
In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which ar...
Gespeichert in:
Veröffentlicht in: | Bioinformatics (Oxford, England) England), 2012-03, Vol.28 (6), p.867-875 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 875 |
---|---|
container_issue | 6 |
container_start_page | 867 |
container_title | Bioinformatics (Oxford, England) |
container_volume | 28 |
creator | THIEU, Thanh JOSHI, Sneha WARREN, Samantha KORKIN, Dmitry |
description | In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.
Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol. |
doi_str_mv | 10.1093/bioinformatics/bts042 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_929504035</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>929504035</sourcerecordid><originalsourceid>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</originalsourceid><addsrcrecordid>eNpVkE1q3EAQhRsT47_4CA7aBK8U9680yi6Y2A4MeOOsRalVPdNh1K10tQKGLHyJXDAnicYzdvCqXsH3XhWPsQvBPwneqKvORx9cTANkb-mqy8S1PGAnQlV1qRdCvHvVXB2zU6IfnHPDTXXEjqWUC2MqccJ-L33GBHlKWAw--LAqoivWkfLfpz8j5HVcYSh82EI2-xjoc2HjMELaog6fnWUHhH1B04jpl9_KDUJ6DoMwLxBWE6xeMBjHFMGukd6zQwcbwvP9PGPfb74-XN-Vy_vbb9dflqXVXOdSqsYKVH3Fa8W1tY5DVytdLare9Lp2ymhrDAopoLO1dCDqRVV3aPsGheqlOmOXu9z58M8JKbeDJ4ub-TGME7WNbAzXXJmZNDvSpkiU0LVj8gOkx1bwdtt7-7b3dtf77PuwvzB1A_avrpeiZ-DjHgCysHEJgvX0nzOVkYIL9Q9RzZV9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>929504035</pqid></control><display><type>article</type><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><source>EZB Electronic Journals Library</source><creator>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</creator><creatorcontrib>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</creatorcontrib><description>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.
Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/bts042</identifier><identifier>PMID: 22285561</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Animals ; Biological and medical sciences ; Data Mining ; Fundamental and applied biological sciences. Psychology ; General aspects ; Host-Pathogen Interactions ; Humans ; Infection - metabolism ; Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) ; Natural Language Processing ; Proteins - metabolism ; PubMed ; Software ; Support Vector Machine</subject><ispartof>Bioinformatics (Oxford, England), 2012-03, Vol.28 (6), p.867-875</ispartof><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</citedby><cites>FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=25652101$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/22285561$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>THIEU, Thanh</creatorcontrib><creatorcontrib>JOSHI, Sneha</creatorcontrib><creatorcontrib>WARREN, Samantha</creatorcontrib><creatorcontrib>KORKIN, Dmitry</creatorcontrib><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.
Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</description><subject>Animals</subject><subject>Biological and medical sciences</subject><subject>Data Mining</subject><subject>Fundamental and applied biological sciences. Psychology</subject><subject>General aspects</subject><subject>Host-Pathogen Interactions</subject><subject>Humans</subject><subject>Infection - metabolism</subject><subject>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</subject><subject>Natural Language Processing</subject><subject>Proteins - metabolism</subject><subject>PubMed</subject><subject>Software</subject><subject>Support Vector Machine</subject><issn>1367-4803</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpVkE1q3EAQhRsT47_4CA7aBK8U9680yi6Y2A4MeOOsRalVPdNh1K10tQKGLHyJXDAnicYzdvCqXsH3XhWPsQvBPwneqKvORx9cTANkb-mqy8S1PGAnQlV1qRdCvHvVXB2zU6IfnHPDTXXEjqWUC2MqccJ-L33GBHlKWAw--LAqoivWkfLfpz8j5HVcYSh82EI2-xjoc2HjMELaog6fnWUHhH1B04jpl9_KDUJ6DoMwLxBWE6xeMBjHFMGukd6zQwcbwvP9PGPfb74-XN-Vy_vbb9dflqXVXOdSqsYKVH3Fa8W1tY5DVytdLare9Lp2ymhrDAopoLO1dCDqRVV3aPsGheqlOmOXu9z58M8JKbeDJ4ub-TGME7WNbAzXXJmZNDvSpkiU0LVj8gOkx1bwdtt7-7b3dtf77PuwvzB1A_avrpeiZ-DjHgCysHEJgvX0nzOVkYIL9Q9RzZV9</recordid><startdate>20120315</startdate><enddate>20120315</enddate><creator>THIEU, Thanh</creator><creator>JOSHI, Sneha</creator><creator>WARREN, Samantha</creator><creator>KORKIN, Dmitry</creator><general>Oxford University Press</general><scope>IQODW</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20120315</creationdate><title>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</title><author>THIEU, Thanh ; JOSHI, Sneha ; WARREN, Samantha ; KORKIN, Dmitry</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c404t-239c1e3d607304ccf0ab734686d5d47f354c55e121abc72fa17867becd9e13d23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Animals</topic><topic>Biological and medical sciences</topic><topic>Data Mining</topic><topic>Fundamental and applied biological sciences. Psychology</topic><topic>General aspects</topic><topic>Host-Pathogen Interactions</topic><topic>Humans</topic><topic>Infection - metabolism</topic><topic>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</topic><topic>Natural Language Processing</topic><topic>Proteins - metabolism</topic><topic>PubMed</topic><topic>Software</topic><topic>Support Vector Machine</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>THIEU, Thanh</creatorcontrib><creatorcontrib>JOSHI, Sneha</creatorcontrib><creatorcontrib>WARREN, Samantha</creatorcontrib><creatorcontrib>KORKIN, Dmitry</creatorcontrib><collection>Pascal-Francis</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>THIEU, Thanh</au><au>JOSHI, Sneha</au><au>WARREN, Samantha</au><au>KORKIN, Dmitry</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2012-03-15</date><risdate>2012</risdate><volume>28</volume><issue>6</issue><spage>867</spage><epage>875</epage><pages>867-875</pages><issn>1367-4803</issn><eissn>1367-4811</eissn><abstract>In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.
Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><pmid>22285561</pmid><doi>10.1093/bioinformatics/bts042</doi><tpages>9</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1367-4803 |
ispartof | Bioinformatics (Oxford, England), 2012-03, Vol.28 (6), p.867-875 |
issn | 1367-4803 1367-4811 |
language | eng |
recordid | cdi_proquest_miscellaneous_929504035 |
source | Oxford Journals Open Access Collection; MEDLINE; PubMed Central; Alma/SFX Local Collection; EZB Electronic Journals Library |
subjects | Animals Biological and medical sciences Data Mining Fundamental and applied biological sciences. Psychology General aspects Host-Pathogen Interactions Humans Infection - metabolism Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Natural Language Processing Proteins - metabolism PubMed Software Support Vector Machine |
title | Literature mining of host―pathogen interactions: comparing feature-based supervised learning and language-based approaches |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T21%3A19%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Literature%20mining%20of%20host%E2%80%95pathogen%20interactions:%20comparing%20feature-based%20supervised%20learning%20and%20language-based%20approaches&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=THIEU,%20Thanh&rft.date=2012-03-15&rft.volume=28&rft.issue=6&rft.spage=867&rft.epage=875&rft.pages=867-875&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/bts042&rft_dat=%3Cproquest_cross%3E929504035%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=929504035&rft_id=info:pmid/22285561&rfr_iscdi=true |