Lost in Translation: Large Language Models in Non-English Content Analysis
In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as ch...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-06 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Gabriel, Nicholas Bhatia, Aliya |
description | In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2825642450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2825642450</sourcerecordid><originalsourceid>FETCH-proquest_journals_28256424503</originalsourceid><addsrcrecordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2825642450</pqid></control><display><type>article</type><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><source>Free E- Journals</source><creator>Gabriel, Nicholas ; Bhatia, Aliya</creator><creatorcontrib>Gabriel, Nicholas ; Bhatia, Aliya</creatorcontrib><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Content analysis ; English language ; Language ; Languages ; Large language models ; Multilingualism ; Non-English languages ; Search engines</subject><ispartof>arXiv.org, 2023-06</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><title>arXiv.org</title><description>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</description><subject>Artificial intelligence</subject><subject>Content analysis</subject><subject>English language</subject><subject>Language</subject><subject>Languages</subject><subject>Large language models</subject><subject>Multilingualism</subject><subject>Non-English languages</subject><subject>Search engines</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyrEKwjAUheEgCBbtOwScC_GmqcVNSkWkOnUvgcaaEm40Nx18eyv4AC7nP8O3YAlIucvKHGDFUqJRCAHFHpSSCbs0niK3yNugkZyO1uOBNzoMZl4cJj2fq--No6-6ecxqHJylB688RoORH1G7N1nasOVdOzLpr2u2PdVtdc6ewb8mQ7Eb_RRmTB2UoIocciXkf-oDZd48CA</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Gabriel, Nicholas</creator><creator>Bhatia, Aliya</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230612</creationdate><title>Lost in Translation: Large Language Models in Non-English Content Analysis</title><author>Gabriel, Nicholas ; Bhatia, Aliya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28256424503</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Content analysis</topic><topic>English language</topic><topic>Language</topic><topic>Languages</topic><topic>Large language models</topic><topic>Multilingualism</topic><topic>Non-English languages</topic><topic>Search engines</topic><toplevel>online_resources</toplevel><creatorcontrib>Gabriel, Nicholas</creatorcontrib><creatorcontrib>Bhatia, Aliya</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gabriel, Nicholas</au><au>Bhatia, Aliya</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Lost in Translation: Large Language Models in Non-English Content Analysis</atitle><jtitle>arXiv.org</jtitle><date>2023-06-12</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-06 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2825642450 |
source | Free E- Journals |
subjects | Artificial intelligence Content analysis English language Language Languages Large language models Multilingualism Non-English languages Search engines |
title | Lost in Translation: Large Language Models in Non-English Content Analysis |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T09%3A43%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Lost%20in%20Translation:%20Large%20Language%20Models%20in%20Non-English%20Content%20Analysis&rft.jtitle=arXiv.org&rft.au=Gabriel,%20Nicholas&rft.date=2023-06-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2825642450%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2825642450&rft_id=info:pmid/&rfr_iscdi=true |