Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?
A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answe...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A distinction is often drawn between a model's ability to predict a label for
an evaluation sample that is directly memorised from highly similar training
samples versus an ability to predict the label via some method of
generalisation. In the context of using Language Models for question-answering,
discussion continues to occur as to the extent to which questions are answered
through memorisation. We consider this issue for questions that would ideally
be answered through reasoning over an associated context. We propose a method
of identifying evaluation samples for which it is very unlikely our model would
have memorised the answers. Our method is based on semantic similarity of input
tokens and label tokens between training and evaluation samples. We show that
our method offers advantages upon some prior approaches in that it is able to
surface evaluation-train pairs that have overlap in either contiguous or
discontiguous sequences of tokens. We use this method to identify unmemorisable
subsets of our evaluation datasets. We train two Language Models in a multitask
fashion whereby the second model differs from the first only in that it has two
additional datasets added to the training regime that are designed to impart
simple numerical reasoning strategies of a sort known to improve performance on
some of our evaluation datasets but not on others. We then show that there is
performance improvement between the two models on the unmemorisable subsets of
the evaluation datasets that were expected to benefit from the additional
training datasets. Specifically, performance on unmemorisable subsets of two of
our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and
25.7% respectively while other evaluation datasets have no significant change
in performance. |
---|---|
DOI: | 10.48550/arxiv.2311.12337 |