Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)
The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers cla...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Lino Ferreira da Silva Barros, Maicon Herverton |
description | The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers classified according to the Sustainable Development Goals (SDGs) by the Scopus team. The original dataset consists of 15,488 records and 20 columns. We preprocessed the data to train a language model capable of classifying Brazilian research projects according to the SDGs.
During preprocessing, we removed duplicate records, multi-label entries, samples missing abstracts, and unnecessary columns. The preprocessed dataset contains 13,789 records and two columns, where the SDG classification is represented in the "label" column. The classification ranged from 1 to 17 representing all 17 SDGs in order.
After preprocessing the dataset, we balanced it by equalizing the majority and minority classes to 300 records per class. In other words, for majority classes with more than 300 records, we reduced the count to 300. For minority classes with fewer than 300 records, we generated the remaining records using the generative model Mixtral-8x7B-Instruct-v0.1, using the real abstracts as examples. This dataset serves as a valuable resource for training language models tailored to classify scientific texts from Brazil based on the SDGs.
The 17 SDGs are:
1. No Poverty
2. Zero Hunger
3. Good Health and Well-being
4. Quality Education
5. Gender Equality
6. Clean Water and Sanitation
7. Affordable and Clean Energy
8. Decent Work and Economic Growth
9. Industry Innovation and Infrastructure
10. Reduced Inequality
11. Sustainable Cities and Communities
12. Responsible Consumption and Production
13. Climate Action
14. Life Below Water
15. Life on Land
16. Peace Justice and Strong Institutions
17. Partnerships for the Goals |
doi_str_mv | 10.17632/hzgs5kz2bc.1 |
format | Dataset |
fullrecord | <record><control><sourceid>datacite_PQ8</sourceid><recordid>TN_cdi_datacite_primary_10_17632_hzgs5kz2bc_1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_17632_hzgs5kz2bc_1</sourcerecordid><originalsourceid>FETCH-datacite_primary_10_17632_hzgs5kz2bc_13</originalsourceid><addsrcrecordid>eNqVjj0LwkAQBa-xELW031LBaO6CWqvxoxTU-tjEVRfPGG5PIfn1igjWVsOD4TFKdXU81NNJYkaX-izja22yfKibys091uwYCzgU_CQvHCrYYkkeZpkEj3kQSDGgUICeiXUSmdgk_QEsHIrwiekIWQW7hwTkAjNHkNKT3L28URFgfUcn0Nula-m3VeP0XtT5sqWi1XK_2ETH93_OgWzp-Ya-sjq2n1r7q7U6-dd_Ab03Tvc</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>dataset</recordtype></control><display><type>dataset</type><title>Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)</title><source>DataCite</source><creator>Lino Ferreira da Silva Barros, Maicon Herverton</creator><creatorcontrib>Lino Ferreira da Silva Barros, Maicon Herverton</creatorcontrib><description>The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers classified according to the Sustainable Development Goals (SDGs) by the Scopus team. The original dataset consists of 15,488 records and 20 columns. We preprocessed the data to train a language model capable of classifying Brazilian research projects according to the SDGs.
During preprocessing, we removed duplicate records, multi-label entries, samples missing abstracts, and unnecessary columns. The preprocessed dataset contains 13,789 records and two columns, where the SDG classification is represented in the "label" column. The classification ranged from 1 to 17 representing all 17 SDGs in order.
After preprocessing the dataset, we balanced it by equalizing the majority and minority classes to 300 records per class. In other words, for majority classes with more than 300 records, we reduced the count to 300. For minority classes with fewer than 300 records, we generated the remaining records using the generative model Mixtral-8x7B-Instruct-v0.1, using the real abstracts as examples. This dataset serves as a valuable resource for training language models tailored to classify scientific texts from Brazil based on the SDGs.
The 17 SDGs are:
1. No Poverty
2. Zero Hunger
3. Good Health and Well-being
4. Quality Education
5. Gender Equality
6. Clean Water and Sanitation
7. Affordable and Clean Energy
8. Decent Work and Economic Growth
9. Industry Innovation and Infrastructure
10. Reduced Inequality
11. Sustainable Cities and Communities
12. Responsible Consumption and Production
13. Climate Action
14. Life Below Water
15. Life on Land
16. Peace Justice and Strong Institutions
17. Partnerships for the Goals</description><identifier>DOI: 10.17632/hzgs5kz2bc.1</identifier><language>eng</language><publisher>Mendeley Data</publisher><subject>Deep Learning ; Large Language Model ; Machine Learning</subject><creationdate>2024</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-0275-3298</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,1887</link.rule.ids><linktorsrc>$$Uhttps://commons.datacite.org/doi.org/10.17632/hzgs5kz2bc.1$$EView_record_in_DataCite.org$$FView_record_in_$$GDataCite.org$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Lino Ferreira da Silva Barros, Maicon Herverton</creatorcontrib><title>Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)</title><description>The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers classified according to the Sustainable Development Goals (SDGs) by the Scopus team. The original dataset consists of 15,488 records and 20 columns. We preprocessed the data to train a language model capable of classifying Brazilian research projects according to the SDGs.
During preprocessing, we removed duplicate records, multi-label entries, samples missing abstracts, and unnecessary columns. The preprocessed dataset contains 13,789 records and two columns, where the SDG classification is represented in the "label" column. The classification ranged from 1 to 17 representing all 17 SDGs in order.
After preprocessing the dataset, we balanced it by equalizing the majority and minority classes to 300 records per class. In other words, for majority classes with more than 300 records, we reduced the count to 300. For minority classes with fewer than 300 records, we generated the remaining records using the generative model Mixtral-8x7B-Instruct-v0.1, using the real abstracts as examples. This dataset serves as a valuable resource for training language models tailored to classify scientific texts from Brazil based on the SDGs.
The 17 SDGs are:
1. No Poverty
2. Zero Hunger
3. Good Health and Well-being
4. Quality Education
5. Gender Equality
6. Clean Water and Sanitation
7. Affordable and Clean Energy
8. Decent Work and Economic Growth
9. Industry Innovation and Infrastructure
10. Reduced Inequality
11. Sustainable Cities and Communities
12. Responsible Consumption and Production
13. Climate Action
14. Life Below Water
15. Life on Land
16. Peace Justice and Strong Institutions
17. Partnerships for the Goals</description><subject>Deep Learning</subject><subject>Large Language Model</subject><subject>Machine Learning</subject><fulltext>true</fulltext><rsrctype>dataset</rsrctype><creationdate>2024</creationdate><recordtype>dataset</recordtype><sourceid>PQ8</sourceid><recordid>eNqVjj0LwkAQBa-xELW031LBaO6CWqvxoxTU-tjEVRfPGG5PIfn1igjWVsOD4TFKdXU81NNJYkaX-izja22yfKibys091uwYCzgU_CQvHCrYYkkeZpkEj3kQSDGgUICeiXUSmdgk_QEsHIrwiekIWQW7hwTkAjNHkNKT3L28URFgfUcn0Nula-m3VeP0XtT5sqWi1XK_2ETH93_OgWzp-Ya-sjq2n1r7q7U6-dd_Ab03Tvc</recordid><startdate>20240624</startdate><enddate>20240624</enddate><creator>Lino Ferreira da Silva Barros, Maicon Herverton</creator><general>Mendeley Data</general><scope>DYCCY</scope><scope>PQ8</scope><orcidid>https://orcid.org/0000-0002-0275-3298</orcidid></search><sort><creationdate>20240624</creationdate><title>Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)</title><author>Lino Ferreira da Silva Barros, Maicon Herverton</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-datacite_primary_10_17632_hzgs5kz2bc_13</frbrgroupid><rsrctype>datasets</rsrctype><prefilter>datasets</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Deep Learning</topic><topic>Large Language Model</topic><topic>Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Lino Ferreira da Silva Barros, Maicon Herverton</creatorcontrib><collection>DataCite (Open Access)</collection><collection>DataCite</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lino Ferreira da Silva Barros, Maicon Herverton</au><format>book</format><genre>unknown</genre><ristype>DATA</ristype><title>Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)</title><date>2024-06-24</date><risdate>2024</risdate><abstract>The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers classified according to the Sustainable Development Goals (SDGs) by the Scopus team. The original dataset consists of 15,488 records and 20 columns. We preprocessed the data to train a language model capable of classifying Brazilian research projects according to the SDGs.
During preprocessing, we removed duplicate records, multi-label entries, samples missing abstracts, and unnecessary columns. The preprocessed dataset contains 13,789 records and two columns, where the SDG classification is represented in the "label" column. The classification ranged from 1 to 17 representing all 17 SDGs in order.
After preprocessing the dataset, we balanced it by equalizing the majority and minority classes to 300 records per class. In other words, for majority classes with more than 300 records, we reduced the count to 300. For minority classes with fewer than 300 records, we generated the remaining records using the generative model Mixtral-8x7B-Instruct-v0.1, using the real abstracts as examples. This dataset serves as a valuable resource for training language models tailored to classify scientific texts from Brazil based on the SDGs.
The 17 SDGs are:
1. No Poverty
2. Zero Hunger
3. Good Health and Well-being
4. Quality Education
5. Gender Equality
6. Clean Water and Sanitation
7. Affordable and Clean Energy
8. Decent Work and Economic Growth
9. Industry Innovation and Infrastructure
10. Reduced Inequality
11. Sustainable Cities and Communities
12. Responsible Consumption and Production
13. Climate Action
14. Life Below Water
15. Life on Land
16. Peace Justice and Strong Institutions
17. Partnerships for the Goals</abstract><pub>Mendeley Data</pub><doi>10.17632/hzgs5kz2bc.1</doi><orcidid>https://orcid.org/0000-0002-0275-3298</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.17632/hzgs5kz2bc.1 |
ispartof | |
issn | |
language | eng |
recordid | cdi_datacite_primary_10_17632_hzgs5kz2bc_1 |
source | DataCite |
subjects | Deep Learning Large Language Model Machine Learning |
title | Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs) |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T13%3A36%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-datacite_PQ8&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.au=Lino%20Ferreira%20da%20Silva%20Barros,%20Maicon%20Herverton&rft.date=2024-06-24&rft_id=info:doi/10.17632/hzgs5kz2bc.1&rft_dat=%3Cdatacite_PQ8%3E10_17632_hzgs5kz2bc_1%3C/datacite_PQ8%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |