CoAM: Corpus of All-Type Multiword Expressions
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation. Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE,...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.
MWE identification, i.e., detecting MWEs in text, can play a key role in
downstream tasks such as machine translation. Existing datasets for MWE
identification are inconsistently annotated, limited to a single type of MWE,
or limited in size. To enable reliable and comprehensive evaluation, we created
CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences
constructed through a multi-step process to enhance data quality consisting of
human annotation, human review, and automated consistency checking. MWEs in
CoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained
error analysis. Annotations for CoAM were collected using a new interface
created with our interface generator, which allows easy and flexible annotation
of MWEs in any form, including discontinuous ones. Through experiments using
CoAM, we find that a fine-tuned large language model outperforms the current
state-of-the-art approach for MWE identification. Furthermore, analysis using
our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to
identify across approaches. |
---|---|
DOI: | 10.48550/arxiv.2412.18151 |