LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias
Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Textual noise, such as typos or abbreviations, is a well-known issue that
penalizes vanilla Transformers for most downstream tasks. We show that this is
also the case for sentence similarity, a fundamental task in multiple domains,
e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached
using cross-encoders, where the two sentences are concatenated in the input
allowing the model to exploit the inter-relations between them. Previous works
addressing the noise issue mainly rely on data augmentation strategies, showing
improved robustness when dealing with corrupted samples that are similar to the
ones used for training. However, all these methods still suffer from the token
distribution shift induced by typos. In this work, we propose to tackle textual
noise by equipping cross-encoders with a novel LExical-aware Attention module
(LEA) that incorporates lexical similarities between words in both sentences.
By using raw text similarities, our approach avoids the tokenization shift
problem obtaining improved robustness. We demonstrate that the attention bias
introduced by LEA helps cross-encoders to tackle complex scenarios with textual
noise, specially in domains with short-text descriptions and limited context.
Experiments using three popular Transformer encoders in five e-commerce
datasets for product matching show that LEA consistently boosts performance
under the presence of noise, while remaining competitive on the original
(clean) splits. We also evaluate our approach in two datasets for textual
entailment and paraphrasing showing that LEA is robust to typos in domains with
longer sentences and more natural context. Additionally, we thoroughly analyze
several design choices in our approach, providing insights about the impact of
the decisions made and fostering future research in cross-encoders dealing with
typos. |
---|---|
DOI: | 10.48550/arxiv.2307.02912 |