Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data
The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in na...
Gespeichert in:
Veröffentlicht in: | Biodiversity Information Science and Standards 2019-06, Vol.3 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | Biodiversity Information Science and Standards |
container_volume | 3 |
creator | Mozzherin, Dmitry Myltsev, Alexander Patterson, David |
description | The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in natural language processing (NLP) and machine learning (ML) promise to facilitate the creation of such metadata.
One obvious approach to improve BHL usability is to extract and provide an index of scientific names thereby enabling biologists to find useful information easier and faster. The Global Names Architecture (GNA) detects, verifies, collects, and indexes scientific names from many sources. Six years ago GNA developers created an index of the scientific names in the BHL by parsing every page one by one. This took 45 days to accomplish. Almost immediately BHL users began to find problems in the index and suggest improvements. However, the cost of repeating such a gigantic job was insurmountable and as a result the index remained nearly unchanged for 6 years.
Two problems were at the heart of dealing with the “Big Data” of the BHL, the time it took to transfer the raw data
prior
to processing, and the computational time it took to detect the names themselves.To solve these problems we could either throw more hardware resources into the problem (expensive), or find ways to dramatically improve performance of the tasks (cheaper). We decided to achieve our goal by utilizing hardware more effectively, and by using fast, scalable programming languages.
We wrote several Open Source applications in Go and Scala to detect candidate scientific names then verify them as names by comparing them to 27 million scientific name-strings aggregated by GNA. We were able to speed up data mobilization from 24 hours to 11 minutes, and decrease the time for name detection from 35 days to 5 hours. Name-verification time decreased from 10 days to 9 hours. Overall our computing requirements shrank from 4 high-end servers to one modern laptop. As a result we achieved our goal and indexed BHL in only 14 hours and unlocked the reality of iterative improvements to the scientific name index.
We also wanted to make it possible to study BHL data in its entirety remotely, in real-time. We created an HTTP2 service that is able to stream gigantic amounts of BHL textual data together with scientific names to a researcher. Sending the text of 50 million pages with an associated 250 million name occurrence |
doi_str_mv | 10.3897/biss.3.35353 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2282396336</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2282396336</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1463-3a22d716f7c2a33a65ed4aeef1d4041f10265d630ca3cd83725a8eadbdbdab843</originalsourceid><addsrcrecordid>eNpNkEFPAjEQhRujiUS5-QOaeGWx7ex2l6OiiAmJFzw33bYLg9Jiu2j49xbxYN7hTTJfZvIeITecjaGZ1HctpjSGMVRZZ2QgshcsL87_zZdkmNKGMSYmQjSyGZDlDL1Fv6LJoPM9dmio11uXKHr6gMHil4sJ-wOdu4i9Xjm6wDbqeBjREOk6fNM-0LSO6N8zv6KPutfX5KLTH8kN__yKvM2eltN5sXh9fpneLwrDSwkFaCFszWVXG6EBtKycLbVzHbclK3nHmZCVlcCMBmMbqEWlG6dtm6XbpoQrcnu6u4vhc-9SrzZhH31-qXI8ARMJIDM1OlEmhpSi69Qu4jYnUJypY3XqWJ0C9Vsd_AATyGHN</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2282396336</pqid></control><display><type>article</type><title>Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data</title><source>Pensoft Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Mozzherin, Dmitry ; Myltsev, Alexander ; Patterson, David</creator><creatorcontrib>Mozzherin, Dmitry ; Myltsev, Alexander ; Patterson, David</creatorcontrib><description>The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in natural language processing (NLP) and machine learning (ML) promise to facilitate the creation of such metadata.
One obvious approach to improve BHL usability is to extract and provide an index of scientific names thereby enabling biologists to find useful information easier and faster. The Global Names Architecture (GNA) detects, verifies, collects, and indexes scientific names from many sources. Six years ago GNA developers created an index of the scientific names in the BHL by parsing every page one by one. This took 45 days to accomplish. Almost immediately BHL users began to find problems in the index and suggest improvements. However, the cost of repeating such a gigantic job was insurmountable and as a result the index remained nearly unchanged for 6 years.
Two problems were at the heart of dealing with the “Big Data” of the BHL, the time it took to transfer the raw data
prior
to processing, and the computational time it took to detect the names themselves.To solve these problems we could either throw more hardware resources into the problem (expensive), or find ways to dramatically improve performance of the tasks (cheaper). We decided to achieve our goal by utilizing hardware more effectively, and by using fast, scalable programming languages.
We wrote several Open Source applications in Go and Scala to detect candidate scientific names then verify them as names by comparing them to 27 million scientific name-strings aggregated by GNA. We were able to speed up data mobilization from 24 hours to 11 minutes, and decrease the time for name detection from 35 days to 5 hours. Name-verification time decreased from 10 days to 9 hours. Overall our computing requirements shrank from 4 high-end servers to one modern laptop. As a result we achieved our goal and indexed BHL in only 14 hours and unlocked the reality of iterative improvements to the scientific name index.
We also wanted to make it possible to study BHL data in its entirety remotely, in real-time. We created an HTTP2 service that is able to stream gigantic amounts of BHL textual data together with scientific names to a researcher. Sending the text of 50 million pages with an associated 250 million name occurrences takes ~5 hours. For comparison, simply copying BHL text data from Smithsonian Institute to University of Illinois using more traditional methods took us 10 days.
What do we hope to achieve with these tools as next steps? To make it possible for everyone to make new discoveries by computing in real-time across the complete BHL text. For example 20% of all names in BHL are abbreviated, and, as a result, very poorly searchable given their existing full-text indexing. We plan to develop algorithms to expand abbreviated genera reliably. Digitized texts contain huge amounts of character recognition mistakes. The tools might help to detect badly digitized pages and mark them for re-digitization. Tools can help to extract scientific names that are identical to "normal" words, such as "Atlanta", or "America", to find common names in texts, and to localize information on locations, adding new search contexts. Finally, we are exploring tools that allow researchers to stream such results back to source thereby growing the “Big Data” and ultimately improving the BHL’s end-user experience.</description><identifier>ISSN: 2535-0897</identifier><identifier>EISSN: 2535-0897</identifier><identifier>DOI: 10.3897/biss.3.35353</identifier><language>eng</language><publisher>Sofia: Pensoft Publishers</publisher><subject>Big Data ; Biodiversity ; Computer applications ; Computers ; Digitization ; Learning algorithms ; Metadata ; Names ; Special libraries ; Vernacular names</subject><ispartof>Biodiversity Information Science and Standards, 2019-06, Vol.3</ispartof><rights>2019. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c1463-3a22d716f7c2a33a65ed4aeef1d4041f10265d630ca3cd83725a8eadbdbdab843</citedby><orcidid>0000-0003-2645-7335 ; 0000-0003-1593-1417</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Mozzherin, Dmitry</creatorcontrib><creatorcontrib>Myltsev, Alexander</creatorcontrib><creatorcontrib>Patterson, David</creatorcontrib><title>Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data</title><title>Biodiversity Information Science and Standards</title><description>The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in natural language processing (NLP) and machine learning (ML) promise to facilitate the creation of such metadata.
One obvious approach to improve BHL usability is to extract and provide an index of scientific names thereby enabling biologists to find useful information easier and faster. The Global Names Architecture (GNA) detects, verifies, collects, and indexes scientific names from many sources. Six years ago GNA developers created an index of the scientific names in the BHL by parsing every page one by one. This took 45 days to accomplish. Almost immediately BHL users began to find problems in the index and suggest improvements. However, the cost of repeating such a gigantic job was insurmountable and as a result the index remained nearly unchanged for 6 years.
Two problems were at the heart of dealing with the “Big Data” of the BHL, the time it took to transfer the raw data
prior
to processing, and the computational time it took to detect the names themselves.To solve these problems we could either throw more hardware resources into the problem (expensive), or find ways to dramatically improve performance of the tasks (cheaper). We decided to achieve our goal by utilizing hardware more effectively, and by using fast, scalable programming languages.
We wrote several Open Source applications in Go and Scala to detect candidate scientific names then verify them as names by comparing them to 27 million scientific name-strings aggregated by GNA. We were able to speed up data mobilization from 24 hours to 11 minutes, and decrease the time for name detection from 35 days to 5 hours. Name-verification time decreased from 10 days to 9 hours. Overall our computing requirements shrank from 4 high-end servers to one modern laptop. As a result we achieved our goal and indexed BHL in only 14 hours and unlocked the reality of iterative improvements to the scientific name index.
We also wanted to make it possible to study BHL data in its entirety remotely, in real-time. We created an HTTP2 service that is able to stream gigantic amounts of BHL textual data together with scientific names to a researcher. Sending the text of 50 million pages with an associated 250 million name occurrences takes ~5 hours. For comparison, simply copying BHL text data from Smithsonian Institute to University of Illinois using more traditional methods took us 10 days.
What do we hope to achieve with these tools as next steps? To make it possible for everyone to make new discoveries by computing in real-time across the complete BHL text. For example 20% of all names in BHL are abbreviated, and, as a result, very poorly searchable given their existing full-text indexing. We plan to develop algorithms to expand abbreviated genera reliably. Digitized texts contain huge amounts of character recognition mistakes. The tools might help to detect badly digitized pages and mark them for re-digitization. Tools can help to extract scientific names that are identical to "normal" words, such as "Atlanta", or "America", to find common names in texts, and to localize information on locations, adding new search contexts. Finally, we are exploring tools that allow researchers to stream such results back to source thereby growing the “Big Data” and ultimately improving the BHL’s end-user experience.</description><subject>Big Data</subject><subject>Biodiversity</subject><subject>Computer applications</subject><subject>Computers</subject><subject>Digitization</subject><subject>Learning algorithms</subject><subject>Metadata</subject><subject>Names</subject><subject>Special libraries</subject><subject>Vernacular names</subject><issn>2535-0897</issn><issn>2535-0897</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNpNkEFPAjEQhRujiUS5-QOaeGWx7ex2l6OiiAmJFzw33bYLg9Jiu2j49xbxYN7hTTJfZvIeITecjaGZ1HctpjSGMVRZZ2QgshcsL87_zZdkmNKGMSYmQjSyGZDlDL1Fv6LJoPM9dmio11uXKHr6gMHil4sJ-wOdu4i9Xjm6wDbqeBjREOk6fNM-0LSO6N8zv6KPutfX5KLTH8kN__yKvM2eltN5sXh9fpneLwrDSwkFaCFszWVXG6EBtKycLbVzHbclK3nHmZCVlcCMBmMbqEWlG6dtm6XbpoQrcnu6u4vhc-9SrzZhH31-qXI8ARMJIDM1OlEmhpSi69Qu4jYnUJypY3XqWJ0C9Vsd_AATyGHN</recordid><startdate>20190613</startdate><enddate>20190613</enddate><creator>Mozzherin, Dmitry</creator><creator>Myltsev, Alexander</creator><creator>Patterson, David</creator><general>Pensoft Publishers</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FH</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><orcidid>https://orcid.org/0000-0003-2645-7335</orcidid><orcidid>https://orcid.org/0000-0003-1593-1417</orcidid></search><sort><creationdate>20190613</creationdate><title>Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data</title><author>Mozzherin, Dmitry ; Myltsev, Alexander ; Patterson, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1463-3a22d716f7c2a33a65ed4aeef1d4041f10265d630ca3cd83725a8eadbdbdab843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Big Data</topic><topic>Biodiversity</topic><topic>Computer applications</topic><topic>Computers</topic><topic>Digitization</topic><topic>Learning algorithms</topic><topic>Metadata</topic><topic>Names</topic><topic>Special libraries</topic><topic>Vernacular names</topic><toplevel>online_resources</toplevel><creatorcontrib>Mozzherin, Dmitry</creatorcontrib><creatorcontrib>Myltsev, Alexander</creatorcontrib><creatorcontrib>Patterson, David</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Biodiversity Information Science and Standards</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mozzherin, Dmitry</au><au>Myltsev, Alexander</au><au>Patterson, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data</atitle><jtitle>Biodiversity Information Science and Standards</jtitle><date>2019-06-13</date><risdate>2019</risdate><volume>3</volume><issn>2535-0897</issn><eissn>2535-0897</eissn><abstract>The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in natural language processing (NLP) and machine learning (ML) promise to facilitate the creation of such metadata.
One obvious approach to improve BHL usability is to extract and provide an index of scientific names thereby enabling biologists to find useful information easier and faster. The Global Names Architecture (GNA) detects, verifies, collects, and indexes scientific names from many sources. Six years ago GNA developers created an index of the scientific names in the BHL by parsing every page one by one. This took 45 days to accomplish. Almost immediately BHL users began to find problems in the index and suggest improvements. However, the cost of repeating such a gigantic job was insurmountable and as a result the index remained nearly unchanged for 6 years.
Two problems were at the heart of dealing with the “Big Data” of the BHL, the time it took to transfer the raw data
prior
to processing, and the computational time it took to detect the names themselves.To solve these problems we could either throw more hardware resources into the problem (expensive), or find ways to dramatically improve performance of the tasks (cheaper). We decided to achieve our goal by utilizing hardware more effectively, and by using fast, scalable programming languages.
We wrote several Open Source applications in Go and Scala to detect candidate scientific names then verify them as names by comparing them to 27 million scientific name-strings aggregated by GNA. We were able to speed up data mobilization from 24 hours to 11 minutes, and decrease the time for name detection from 35 days to 5 hours. Name-verification time decreased from 10 days to 9 hours. Overall our computing requirements shrank from 4 high-end servers to one modern laptop. As a result we achieved our goal and indexed BHL in only 14 hours and unlocked the reality of iterative improvements to the scientific name index.
We also wanted to make it possible to study BHL data in its entirety remotely, in real-time. We created an HTTP2 service that is able to stream gigantic amounts of BHL textual data together with scientific names to a researcher. Sending the text of 50 million pages with an associated 250 million name occurrences takes ~5 hours. For comparison, simply copying BHL text data from Smithsonian Institute to University of Illinois using more traditional methods took us 10 days.
What do we hope to achieve with these tools as next steps? To make it possible for everyone to make new discoveries by computing in real-time across the complete BHL text. For example 20% of all names in BHL are abbreviated, and, as a result, very poorly searchable given their existing full-text indexing. We plan to develop algorithms to expand abbreviated genera reliably. Digitized texts contain huge amounts of character recognition mistakes. The tools might help to detect badly digitized pages and mark them for re-digitization. Tools can help to extract scientific names that are identical to "normal" words, such as "Atlanta", or "America", to find common names in texts, and to localize information on locations, adding new search contexts. Finally, we are exploring tools that allow researchers to stream such results back to source thereby growing the “Big Data” and ultimately improving the BHL’s end-user experience.</abstract><cop>Sofia</cop><pub>Pensoft Publishers</pub><doi>10.3897/biss.3.35353</doi><orcidid>https://orcid.org/0000-0003-2645-7335</orcidid><orcidid>https://orcid.org/0000-0003-1593-1417</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2535-0897 |
ispartof | Biodiversity Information Science and Standards, 2019-06, Vol.3 |
issn | 2535-0897 2535-0897 |
language | eng |
recordid | cdi_proquest_journals_2282396336 |
source | Pensoft Open Access Journals; EZB-FREE-00999 freely available EZB journals |
subjects | Big Data Biodiversity Computer applications Computers Digitization Learning algorithms Metadata Names Special libraries Vernacular names |
title | Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T10%3A19%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Finding%20scientific%20names%20in%20Biodiversity%20Heritage%20Library,%20or%20how%20to%20shrink%20Big%20Data&rft.jtitle=Biodiversity%20Information%20Science%20and%20Standards&rft.au=Mozzherin,%20Dmitry&rft.date=2019-06-13&rft.volume=3&rft.issn=2535-0897&rft.eissn=2535-0897&rft_id=info:doi/10.3897/biss.3.35353&rft_dat=%3Cproquest_cross%3E2282396336%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2282396336&rft_id=info:pmid/&rfr_iscdi=true |