OpenLLMText Dataset

The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL). 60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. 60k of t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chen, Yutian, Kang, Hao, Zhai, Yiyan, Li, Liangze, Singh, Rita, Raj, Bhiksha
Format: Dataset
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Chen, Yutian
Kang, Hao
Zhai, Yiyan
Li, Liangze
Singh, Rita
Raj, Bhiksha
description The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL). 60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. 60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data. 60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).
doi_str_mv 10.5281/zenodo.8285325
format Dataset
fullrecord <record><control><sourceid>datacite_PQ8</sourceid><recordid>TN_cdi_datacite_primary_10_5281_zenodo_8285325</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_5281_zenodo_8285325</sourcerecordid><originalsourceid>FETCH-datacite_primary_10_5281_zenodo_82853253</originalsourceid><addsrcrecordid>eNpjYBAzNNAzNbIw1K9KzctPydezMLIwNTYy5WQQ9i9IzfPx8Q1JrShRcEksSSxOLeFhYE1LzClO5YXS3Ax6bq4hzh66KUD55MyS1PiCoszcxKLKeEODeJCp8RBT46GmGpOsAQB4Wy_O</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>dataset</recordtype></control><display><type>dataset</type><title>OpenLLMText Dataset</title><source>DataCite</source><creator>Chen, Yutian ; Kang, Hao ; Zhai, Yiyan ; Li, Liangze ; Singh, Rita ; Raj, Bhiksha</creator><creatorcontrib>Chen, Yutian ; Kang, Hao ; Zhai, Yiyan ; Li, Liangze ; Singh, Rita ; Raj, Bhiksha</creatorcontrib><description>The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL). 60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. 60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data. 60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).</description><identifier>DOI: 10.5281/zenodo.8285325</identifier><language>eng</language><publisher>Zenodo</publisher><creationdate>2023</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-8008-9014</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,1894</link.rule.ids><linktorsrc>$$Uhttps://commons.datacite.org/doi.org/10.5281/zenodo.8285325$$EView_record_in_DataCite.org$$FView_record_in_$$GDataCite.org$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Chen, Yutian</creatorcontrib><creatorcontrib>Kang, Hao</creatorcontrib><creatorcontrib>Zhai, Yiyan</creatorcontrib><creatorcontrib>Li, Liangze</creatorcontrib><creatorcontrib>Singh, Rita</creatorcontrib><creatorcontrib>Raj, Bhiksha</creatorcontrib><title>OpenLLMText Dataset</title><description>The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL). 60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. 60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data. 60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).</description><fulltext>true</fulltext><rsrctype>dataset</rsrctype><creationdate>2023</creationdate><recordtype>dataset</recordtype><sourceid>PQ8</sourceid><recordid>eNpjYBAzNNAzNbIw1K9KzctPydezMLIwNTYy5WQQ9i9IzfPx8Q1JrShRcEksSSxOLeFhYE1LzClO5YXS3Ax6bq4hzh66KUD55MyS1PiCoszcxKLKeEODeJCp8RBT46GmGpOsAQB4Wy_O</recordid><startdate>20230826</startdate><enddate>20230826</enddate><creator>Chen, Yutian</creator><creator>Kang, Hao</creator><creator>Zhai, Yiyan</creator><creator>Li, Liangze</creator><creator>Singh, Rita</creator><creator>Raj, Bhiksha</creator><general>Zenodo</general><scope>DYCCY</scope><scope>PQ8</scope><orcidid>https://orcid.org/0000-0001-8008-9014</orcidid></search><sort><creationdate>20230826</creationdate><title>OpenLLMText Dataset</title><author>Chen, Yutian ; Kang, Hao ; Zhai, Yiyan ; Li, Liangze ; Singh, Rita ; Raj, Bhiksha</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-datacite_primary_10_5281_zenodo_82853253</frbrgroupid><rsrctype>datasets</rsrctype><prefilter>datasets</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Chen, Yutian</creatorcontrib><creatorcontrib>Kang, Hao</creatorcontrib><creatorcontrib>Zhai, Yiyan</creatorcontrib><creatorcontrib>Li, Liangze</creatorcontrib><creatorcontrib>Singh, Rita</creatorcontrib><creatorcontrib>Raj, Bhiksha</creatorcontrib><collection>DataCite (Open Access)</collection><collection>DataCite</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Yutian</au><au>Kang, Hao</au><au>Zhai, Yiyan</au><au>Li, Liangze</au><au>Singh, Rita</au><au>Raj, Bhiksha</au><format>book</format><genre>unknown</genre><ristype>DATA</ristype><title>OpenLLMText Dataset</title><date>2023-08-26</date><risdate>2023</risdate><abstract>The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL). 60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. 60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data. 60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data. 60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).</abstract><pub>Zenodo</pub><doi>10.5281/zenodo.8285325</doi><orcidid>https://orcid.org/0000-0001-8008-9014</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.5281/zenodo.8285325
ispartof
issn
language eng
recordid cdi_datacite_primary_10_5281_zenodo_8285325
source DataCite
title OpenLLMText Dataset
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T19%3A07%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-datacite_PQ8&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.au=Chen,%20Yutian&rft.date=2023-08-26&rft_id=info:doi/10.5281/zenodo.8285325&rft_dat=%3Cdatacite_PQ8%3E10_5281_zenodo_8285325%3C/datacite_PQ8%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true