Lost in Translation: Large Language Models in Non-English Content Analysis

In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as ch...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-06
Hauptverfasser:	Gabriel, Nicholas, Bhatia, Aliya
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Content analysis English language Language Languages Large language models Multilingualism Non-English languages Search engines
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Gabriel, Nicholas Bhatia, Aliya
description	In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2825642450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2825642450</sourcerecordid><originalsourceid>FETCH-proquest_journals_28256424503</originalsourceid><addsrcrecordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2825642450</pqid></control><display><type>article</type><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><source>Free E- Journals</source><creator>Gabriel, Nicholas ; Bhatia, Aliya</creator><creatorcontrib>Gabriel, Nicholas ; Bhatia, Aliya</creatorcontrib><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Content analysis ; English language ; Language ; Languages ; Large language models ; Multilingualism ; Non-English languages ; Search engines</subject><ispartof>arXiv.org, 2023-06</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><title>arXiv.org</title><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><subject>Artificial intelligence</subject><subject>Content analysis</subject><subject>English language</subject><subject>Language</subject><subject>Languages</subject><subject>Large language models</subject><subject>Multilingualism</subject><subject>Non-English languages</subject><subject>Search engines</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Gabriel, Nicholas</creator><creator>Bhatia, Aliya</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230612</creationdate><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><author>Gabriel, Nicholas ; Bhatia, Aliya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28256424503</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Content analysis</topic><topic>English language</topic><topic>Language</topic><topic>Languages</topic><topic>Large language models</topic><topic>Multilingualism</topic><topic>Non-English languages</topic><topic>Search engines</topic><toplevel>online_resources</toplevel><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gabriel, Nicholas</au><au>Bhatia, Aliya</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Lost in Translation: Large Language Models in Non-English Content Analysis</atitle><jtitle>arXiv.org</jtitle><date>2023-06-12</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2825642450
source	Free E- Journals
subjects	Artificial intelligence Content analysis English language Language Languages Large language models Multilingualism Non-English languages Search engines
title	Lost in Translation: Large Language Models in Non-English Content Analysis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T09%3A43%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Lost%20in%20Translation:%20Large%20Language%20Models%20in%20Non-English%20Content%20Analysis&rft.jtitle=arXiv.org&rft.au=Gabriel,%20Nicholas&rft.date=2023-06-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2825642450%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2825642450&rft_id=info:pmid/&rfr_iscdi=true