Lost in Translation: Large Language Models in Non-English Content Analysis

In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as ch...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-06
Hauptverfasser: Gabriel, Nicholas, Bhatia, Aliya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Gabriel, Nicholas
Bhatia, Aliya
description In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2825642450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2825642450</sourcerecordid><originalsourceid>FETCH-proquest_journals_28256424503</originalsourceid><addsrcrecordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2825642450</pqid></control><display><type>article</type><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><source>Free E- Journals</source><creator>Gabriel, Nicholas ; Bhatia, Aliya</creator><creatorcontrib>Gabriel, Nicholas ; Bhatia, Aliya</creatorcontrib><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Content analysis ; English language ; Language ; Languages ; Large language models ; Multilingualism ; Non-English languages ; Search engines</subject><ispartof>arXiv.org, 2023-06</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><title>arXiv.org</title><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><subject>Artificial intelligence</subject><subject>Content analysis</subject><subject>English language</subject><subject>Language</subject><subject>Languages</subject><subject>Large language models</subject><subject>Multilingualism</subject><subject>Non-English languages</subject><subject>Search engines</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Gabriel, Nicholas</creator><creator>Bhatia, Aliya</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230612</creationdate><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><author>Gabriel, Nicholas ; Bhatia, Aliya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28256424503</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Content analysis</topic><topic>English language</topic><topic>Language</topic><topic>Languages</topic><topic>Large language models</topic><topic>Multilingualism</topic><topic>Non-English languages</topic><topic>Search engines</topic><toplevel>online_resources</toplevel><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gabriel, Nicholas</au><au>Bhatia, Aliya</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Lost in Translation: Large Language Models in Non-English Content Analysis</atitle><jtitle>arXiv.org</jtitle><date>2023-06-12</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_2825642450
source Free E- Journals
subjects Artificial intelligence
Content analysis
English language
Language
Languages
Large language models
Multilingualism
Non-English languages
Search engines
title Lost in Translation: Large Language Models in Non-English Content Analysis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T09%3A43%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Lost%20in%20Translation:%20Large%20Language%20Models%20in%20Non-English%20Content%20Analysis&rft.jtitle=arXiv.org&rft.au=Gabriel,%20Nicholas&rft.date=2023-06-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2825642450%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2825642450&rft_id=info:pmid/&rfr_iscdi=true