Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts

Background Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Health informatics journal 2023-04, Vol.29 (2), p.14604582231164696-14604582231164696
Hauptverfasser: Ladas, Nektarios, Borchert, Florian, Franz, Stefan, Rehberg, Alina, Strauch, Natalia, Sommer, Kim Katrin, Marschollek, Michael, Gietzelt, Matthias
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 14604582231164696
container_issue 2
container_start_page 14604582231164696
container_title Health informatics journal
container_volume 29
creator Ladas, Nektarios
Borchert, Florian
Franz, Stefan
Rehberg, Alina
Strauch, Natalia
Sommer, Kim Katrin
Marschollek, Michael
Gietzelt, Matthias
description Background Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy. Objectives In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language. Methods The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts. Results We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min. Conclusion We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.
doi_str_mv 10.1177/14604582231164696
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2802425583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_14604582231164696</sage_id><sourcerecordid>2800074555</sourcerecordid><originalsourceid>FETCH-LOGICAL-c439t-ea3d91cebfc70804d0ea5ccb2ea30a26e633489a4abfa292142828951abb3b323</originalsourceid><addsrcrecordid>eNp9kctu1TAQhiNERUvLA7BBltiwSfE1cZao4iZVgkW7jsbO5OAqcQ52jOhz8YJMespFIFh5NPPNP-P5q-qp4OdCtO1LoRuujZVSCdHopmseVCei1aKWVoiHFFO93oDj6nHON5xzxY16VB2rljeWS3tSffuYll2CeQ5xx1b0n2L4XDCzcUkszPu0fNkKqUzIEsIALkxhvb0rb8naQcaBhUiJGdawRIZf1wT-LoywlgQTmyDuCuyQkZ7HnDfJfdjjFCKNWkZWYl5T8USTGMSBZZxD_VtuxiF4UlpJPZ9VRyNMGZ_cv6fV9ZvXVxfv6ssPb99fvLqsvVbdWiOooRMe3ehbbrkeOILx3kkqcJANNkpp24EGN4LspNDSStsZAc4pp6Q6rV4cdGnt7ShrP4fscaLv4FJyL-mEWhpjFaHP_0BvlpIibUeUEm0jtRH_p8idVhtjiBIHyqcl54Rjv09hhnTbC95vvvd_-U49z-6Vi6Nb_ez4YTQB5wcgkw-_xv5b8Ts0c7oD</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2800074555</pqid></control><display><type>article</type><title>Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Sage Journals GOLD Open Access 2024</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Ladas, Nektarios ; Borchert, Florian ; Franz, Stefan ; Rehberg, Alina ; Strauch, Natalia ; Sommer, Kim Katrin ; Marschollek, Michael ; Gietzelt, Matthias</creator><creatorcontrib>Ladas, Nektarios ; Borchert, Florian ; Franz, Stefan ; Rehberg, Alina ; Strauch, Natalia ; Sommer, Kim Katrin ; Marschollek, Michael ; Gietzelt, Matthias</creatorcontrib><description>Background Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy. Objectives In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language. Methods The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts. Results We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min. Conclusion We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.</description><identifier>ISSN: 1460-4582</identifier><identifier>EISSN: 1741-2811</identifier><identifier>DOI: 10.1177/14604582231164696</identifier><identifier>PMID: 37068028</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Algorithms ; Comprehension ; Decomposition ; Design specifications ; Electronic Health Records ; Humans ; Information Storage and Retrieval ; Natural Language Processing ; Programming languages</subject><ispartof>Health informatics journal, 2023-04, Vol.29 (2), p.14604582231164696-14604582231164696</ispartof><rights>The Author(s) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c439t-ea3d91cebfc70804d0ea5ccb2ea30a26e633489a4abfa292142828951abb3b323</citedby><cites>FETCH-LOGICAL-c439t-ea3d91cebfc70804d0ea5ccb2ea30a26e633489a4abfa292142828951abb3b323</cites><orcidid>0000-0001-5918-8384</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1177/14604582231164696$$EPDF$$P50$$Gsage$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1177/14604582231164696$$EHTML$$P50$$Gsage$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,21964,27851,27922,27923,44943,45331</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37068028$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ladas, Nektarios</creatorcontrib><creatorcontrib>Borchert, Florian</creatorcontrib><creatorcontrib>Franz, Stefan</creatorcontrib><creatorcontrib>Rehberg, Alina</creatorcontrib><creatorcontrib>Strauch, Natalia</creatorcontrib><creatorcontrib>Sommer, Kim Katrin</creatorcontrib><creatorcontrib>Marschollek, Michael</creatorcontrib><creatorcontrib>Gietzelt, Matthias</creatorcontrib><title>Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts</title><title>Health informatics journal</title><addtitle>Health Informatics J</addtitle><description>Background Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy. Objectives In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language. Methods The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts. Results We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min. Conclusion We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.</description><subject>Algorithms</subject><subject>Comprehension</subject><subject>Decomposition</subject><subject>Design specifications</subject><subject>Electronic Health Records</subject><subject>Humans</subject><subject>Information Storage and Retrieval</subject><subject>Natural Language Processing</subject><subject>Programming languages</subject><issn>1460-4582</issn><issn>1741-2811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>AFRWT</sourceid><sourceid>EIF</sourceid><recordid>eNp9kctu1TAQhiNERUvLA7BBltiwSfE1cZao4iZVgkW7jsbO5OAqcQ52jOhz8YJMespFIFh5NPPNP-P5q-qp4OdCtO1LoRuujZVSCdHopmseVCei1aKWVoiHFFO93oDj6nHON5xzxY16VB2rljeWS3tSffuYll2CeQ5xx1b0n2L4XDCzcUkszPu0fNkKqUzIEsIALkxhvb0rb8naQcaBhUiJGdawRIZf1wT-LoywlgQTmyDuCuyQkZ7HnDfJfdjjFCKNWkZWYl5T8USTGMSBZZxD_VtuxiF4UlpJPZ9VRyNMGZ_cv6fV9ZvXVxfv6ssPb99fvLqsvVbdWiOooRMe3ehbbrkeOILx3kkqcJANNkpp24EGN4LspNDSStsZAc4pp6Q6rV4cdGnt7ShrP4fscaLv4FJyL-mEWhpjFaHP_0BvlpIibUeUEm0jtRH_p8idVhtjiBIHyqcl54Rjv09hhnTbC95vvvd_-U49z-6Vi6Nb_ez4YTQB5wcgkw-_xv5b8Ts0c7oD</recordid><startdate>20230401</startdate><enddate>20230401</enddate><creator>Ladas, Nektarios</creator><creator>Borchert, Florian</creator><creator>Franz, Stefan</creator><creator>Rehberg, Alina</creator><creator>Strauch, Natalia</creator><creator>Sommer, Kim Katrin</creator><creator>Marschollek, Michael</creator><creator>Gietzelt, Matthias</creator><general>SAGE Publications</general><general>SAGE PUBLICATIONS, INC</general><scope>AFRWT</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5918-8384</orcidid></search><sort><creationdate>20230401</creationdate><title>Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts</title><author>Ladas, Nektarios ; Borchert, Florian ; Franz, Stefan ; Rehberg, Alina ; Strauch, Natalia ; Sommer, Kim Katrin ; Marschollek, Michael ; Gietzelt, Matthias</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c439t-ea3d91cebfc70804d0ea5ccb2ea30a26e633489a4abfa292142828951abb3b323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Comprehension</topic><topic>Decomposition</topic><topic>Design specifications</topic><topic>Electronic Health Records</topic><topic>Humans</topic><topic>Information Storage and Retrieval</topic><topic>Natural Language Processing</topic><topic>Programming languages</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ladas, Nektarios</creatorcontrib><creatorcontrib>Borchert, Florian</creatorcontrib><creatorcontrib>Franz, Stefan</creatorcontrib><creatorcontrib>Rehberg, Alina</creatorcontrib><creatorcontrib>Strauch, Natalia</creatorcontrib><creatorcontrib>Sommer, Kim Katrin</creatorcontrib><creatorcontrib>Marschollek, Michael</creatorcontrib><creatorcontrib>Gietzelt, Matthias</creatorcontrib><collection>Sage Journals GOLD Open Access 2024</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>MEDLINE - Academic</collection><jtitle>Health informatics journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ladas, Nektarios</au><au>Borchert, Florian</au><au>Franz, Stefan</au><au>Rehberg, Alina</au><au>Strauch, Natalia</au><au>Sommer, Kim Katrin</au><au>Marschollek, Michael</au><au>Gietzelt, Matthias</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts</atitle><jtitle>Health informatics journal</jtitle><addtitle>Health Informatics J</addtitle><date>2023-04-01</date><risdate>2023</risdate><volume>29</volume><issue>2</issue><spage>14604582231164696</spage><epage>14604582231164696</epage><pages>14604582231164696-14604582231164696</pages><issn>1460-4582</issn><eissn>1741-2811</eissn><abstract>Background Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy. Objectives In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language. Methods The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts. Results We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min. Conclusion We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><pmid>37068028</pmid><doi>10.1177/14604582231164696</doi><orcidid>https://orcid.org/0000-0001-5918-8384</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1460-4582
ispartof Health informatics journal, 2023-04, Vol.29 (2), p.14604582231164696-14604582231164696
issn 1460-4582
1741-2811
language eng
recordid cdi_proquest_miscellaneous_2802425583
source MEDLINE; DOAJ Directory of Open Access Journals; Sage Journals GOLD Open Access 2024; EZB-FREE-00999 freely available EZB journals
subjects Algorithms
Comprehension
Decomposition
Design specifications
Electronic Health Records
Humans
Information Storage and Retrieval
Natural Language Processing
Programming languages
title Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T23%3A07%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Programming%20techniques%20for%20improving%20rule%20readability%20for%20rule-based%20information%20extraction%20natural%20language%20processing%20pipelines%20of%20unstructured%20and%20semi-structured%20medical%20texts&rft.jtitle=Health%20informatics%20journal&rft.au=Ladas,%20Nektarios&rft.date=2023-04-01&rft.volume=29&rft.issue=2&rft.spage=14604582231164696&rft.epage=14604582231164696&rft.pages=14604582231164696-14604582231164696&rft.issn=1460-4582&rft.eissn=1741-2811&rft_id=info:doi/10.1177/14604582231164696&rft_dat=%3Cproquest_cross%3E2800074555%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2800074555&rft_id=info:pmid/37068028&rft_sage_id=10.1177_14604582231164696&rfr_iscdi=true