Toloka Visual Question Answering Benchmark
In this paper, we present Toloka Visual Question Answering, a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task. In this task, given an image and a textual question, one has to draw the...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we present Toloka Visual Question Answering, a new
crowdsourced dataset allowing comparing performance of machine learning systems
against human level of expertise in the grounding visual question answering
task. In this task, given an image and a textual question, one has to draw the
bounding box around the object correctly responding to that question. Every
image-question pair contains the response, with only one correct response per
image. Our dataset contains 45,199 pairs of images and questions in English,
provided with ground truth bounding boxes, split into train and two test
subsets. Besides describing the dataset and releasing it under a CC BY license,
we conducted a series of experiments on open source zero-shot baseline models
and organized a multi-phase competition at WSDM Cup that attracted 48
participants worldwide. However, by the time of paper submission, no machine
learning model outperformed the non-expert crowdsourcing baseline according to
the intersection over union evaluation score. |
---|---|
DOI: | 10.48550/arxiv.2309.16511 |