Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais

In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French ann...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Mazoyer, Béatrice, Hervé, Nicolas, Hudelot, Céline, Cage, Julia
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Mazoyer, Béatrice
Hervé, Nicolas
Hudelot, Céline
Cage, Julia
description In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.
doi_str_mv 10.48550/arxiv.2001.04139
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2001_04139</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2001_04139</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2001_041393</originalsourceid><addsrcrecordid>eNqFj0FrAjEQhXPpQWp_gKfOzZNrtiqoV7H0LD0Ky5DMlkDMhkyyXSn9If23nRXvHh5vePP44Ck1q3W13m42eolpcH31pnVd6XW92k3U34liOs-JKWTMrgsMngZn0BND7EoCj2ClkMmMbwgiLpFS71hiAjsX60WBLgJhsCiQEqD1ZQBLkL-JJN7DSCkSsFCt4E2XYmFoE4az-TG_6BgoA4YvL-dUPbXomV7u_qxe34-fh4_FbUQTk7tgujbjmOY2ZvW48Q8ALlcu</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais</title><source>arXiv.org</source><creator>Mazoyer, Béatrice ; Hervé, Nicolas ; Hudelot, Céline ; Cage, Julia</creator><creatorcontrib>Mazoyer, Béatrice ; Hervé, Nicolas ; Hudelot, Céline ; Cage, Julia</creatorcontrib><description>In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.</description><identifier>DOI: 10.48550/arxiv.2001.04139</identifier><language>eng</language><subject>Computer Science - Information Retrieval ; Computer Science - Social and Information Networks</subject><creationdate>2020-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2001.04139$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2001.04139$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mazoyer, Béatrice</creatorcontrib><creatorcontrib>Hervé, Nicolas</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><creatorcontrib>Cage, Julia</creatorcontrib><title>Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais</title><description>In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.</description><subject>Computer Science - Information Retrieval</subject><subject>Computer Science - Social and Information Networks</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFj0FrAjEQhXPpQWp_gKfOzZNrtiqoV7H0LD0Ky5DMlkDMhkyyXSn9If23nRXvHh5vePP44Ck1q3W13m42eolpcH31pnVd6XW92k3U34liOs-JKWTMrgsMngZn0BND7EoCj2ClkMmMbwgiLpFS71hiAjsX60WBLgJhsCiQEqD1ZQBLkL-JJN7DSCkSsFCt4E2XYmFoE4az-TG_6BgoA4YvL-dUPbXomV7u_qxe34-fh4_FbUQTk7tgujbjmOY2ZvW48Q8ALlcu</recordid><startdate>20200113</startdate><enddate>20200113</enddate><creator>Mazoyer, Béatrice</creator><creator>Hervé, Nicolas</creator><creator>Hudelot, Céline</creator><creator>Cage, Julia</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200113</creationdate><title>Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais</title><author>Mazoyer, Béatrice ; Hervé, Nicolas ; Hudelot, Céline ; Cage, Julia</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2001_041393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Information Retrieval</topic><topic>Computer Science - Social and Information Networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Mazoyer, Béatrice</creatorcontrib><creatorcontrib>Hervé, Nicolas</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><creatorcontrib>Cage, Julia</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mazoyer, Béatrice</au><au>Hervé, Nicolas</au><au>Hudelot, Céline</au><au>Cage, Julia</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais</atitle><date>2020-01-13</date><risdate>2020</risdate><abstract>In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.</abstract><doi>10.48550/arxiv.2001.04139</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2001.04139
ispartof
issn
language eng
recordid cdi_arxiv_primary_2001_04139
source arXiv.org
subjects Computer Science - Information Retrieval
Computer Science - Social and Information Networks
title Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A18%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Repr%5C'esentations%20lexicales%20pour%20la%20d%5C'etection%20non%20supervis%5C'ee%20d'%5C'ev%5C'enements%20dans%20un%20flux%20de%20tweets%20:%20%5C'etude%20sur%20des%20corpus%20fran%5Cc%7Bc%7Dais%20et%20anglais&rft.au=Mazoyer,%20B%C3%A9atrice&rft.date=2020-01-13&rft_id=info:doi/10.48550/arxiv.2001.04139&rft_dat=%3Carxiv_GOX%3E2001_04139%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true