Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings

An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs)...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gupta, Gyandev, Rastegarpanah, Bashir, Iyer, Amalendu, Rubin, Joshua, Kenthapadi, Krishnaram
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Gupta, Gyandev Rastegarpanah, Bashir Iyer, Amalendu Rubin, Joshua Kenthapadi, Krishnaram
description	An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.
doi_str_mv	10.48550/arxiv.2312.02337
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_02337</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_02337</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-19d05652a0269e639231ccd135053cc4fd42290b13a5861747069ceda53bbc0d3</originalsourceid><addsrcrecordid>eNotj7tOwzAYRr0woMIDMOEXSPAltmu2UspFSsVAFqbo9yWppdRBsVOVt6ctTN83HZ2D0B0lZbUUgjzAdAyHknHKSsI4V9foa-shzVOIPX4OKU_BzDmMEQb8uQtdTjhE3PhjfsTNzuOVO0DM0Hs8driG2M_nvx2dH4onSN7hzd545064dIOuOhiSv_3fBWpeNs36rag_Xt_Xq7oAqVRBtSNCCgaESe0l1yc5ax3lgghubdW5ijFNDOUglpKqShGprXcguDGWOL5A93_YS1v7PYU9TD_tubG9NPJfvsdLbQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings</title><source>arXiv.org</source><creator>Gupta, Gyandev ; Rastegarpanah, Bashir ; Iyer, Amalendu ; Rubin, Joshua ; Kenthapadi, Krishnaram</creator><creatorcontrib>Gupta, Gyandev ; Rastegarpanah, Bashir ; Iyer, Amalendu ; Rubin, Joshua ; Kenthapadi, Krishnaram</creatorcontrib><description>An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.</description><identifier>DOI: 10.48550/arxiv.2312.02337</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-12</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,887</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.02337$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.02337$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Gupta, Gyandev</creatorcontrib><creatorcontrib>Rastegarpanah, Bashir</creatorcontrib><creatorcontrib>Iyer, Amalendu</creatorcontrib><creatorcontrib>Rubin, Joshua</creatorcontrib><creatorcontrib>Kenthapadi, Krishnaram</creatorcontrib><title>Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings</title><description>An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tOwzAYRr0woMIDMOEXSPAltmu2UspFSsVAFqbo9yWppdRBsVOVt6ctTN83HZ2D0B0lZbUUgjzAdAyHknHKSsI4V9foa-shzVOIPX4OKU_BzDmMEQb8uQtdTjhE3PhjfsTNzuOVO0DM0Hs8driG2M_nvx2dH4onSN7hzd545064dIOuOhiSv_3fBWpeNs36rag_Xt_Xq7oAqVRBtSNCCgaESe0l1yc5ax3lgghubdW5ijFNDOUglpKqShGprXcguDGWOL5A93_YS1v7PYU9TD_tubG9NPJfvsdLbQ</recordid><startdate>20231204</startdate><enddate>20231204</enddate><creator>Gupta, Gyandev</creator><creator>Rastegarpanah, Bashir</creator><creator>Iyer, Amalendu</creator><creator>Rubin, Joshua</creator><creator>Kenthapadi, Krishnaram</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231204</creationdate><title>Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings</title><author>Gupta, Gyandev ; Rastegarpanah, Bashir ; Iyer, Amalendu ; Rubin, Joshua ; Kenthapadi, Krishnaram</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-19d05652a0269e639231ccd135053cc4fd42290b13a5861747069ceda53bbc0d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Gupta, Gyandev</creatorcontrib><creatorcontrib>Rastegarpanah, Bashir</creatorcontrib><creatorcontrib>Iyer, Amalendu</creatorcontrib><creatorcontrib>Rubin, Joshua</creatorcontrib><creatorcontrib>Kenthapadi, Krishnaram</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gupta, Gyandev</au><au>Rastegarpanah, Bashir</au><au>Iyer, Amalendu</au><au>Rubin, Joshua</au><au>Kenthapadi, Krishnaram</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings</atitle><date>2023-12-04</date><risdate>2023</risdate><abstract>An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.</abstract><doi>10.48550/arxiv.2312.02337</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.02337
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_02337
source	arXiv.org
subjects	Computer Science - Computation and Language
title	Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-03T09%3A17%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Measuring%20Distributional%20Shifts%20in%20Text:%20The%20Advantage%20of%20Language%20Model-Based%20Embeddings&rft.au=Gupta,%20Gyandev&rft.date=2023-12-04&rft_id=info:doi/10.48550/arxiv.2312.02337&rft_dat=%3Carxiv_GOX%3E2312_02337%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true