Lost in Translation: Large Language Models in Non-English Content Analysis
In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as ch...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa,
Google's PaLM) have become the dominant approach for building AI systems to
analyze and generate language online. However, the automated systems that
increasingly mediate our interactions online -- such as chatbots, content
moderation systems, and search engines -- are primarily designed for and work
far more effectively in English than in the world's other 7,000 languages.
Recently, researchers and technology companies have attempted to extend the
capabilities of large language models into languages other than English by
building what are called multilingual language models.
In this paper, we explain how these multilingual language models work and
explore their capabilities and limits. Part I provides a simple technical
explanation of how large language models work, why there is a gap in available
data between English and other languages, and how multilingual language models
attempt to bridge that gap. Part II accounts for the challenges of doing
content analysis with large language models in general and multilingual
language models in particular. Part III offers recommendations for companies,
researchers, and policymakers to keep in mind when considering researching,
developing and deploying large and multilingual language models. |
---|---|
DOI: | 10.48550/arxiv.2306.07377 |