A Brazilian Portuguese Dataset for Offline Handwritten Text Recognition (BRESSAY)

The BRESSAY dataset comprises images of handwritten essays in Brazilian Portuguese, which present a series of challenges to optical recognition models. These images were sourced from multiple online platforms, limiting our ability to standardize the capture process. Due to these varied sources and t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Arthur F. S. Neto, Byron L. D. Bezerra, Sávio S. Araújo, Wiliane M. A. S. Souza, Kléberson F. Alves, Macileide F. Oliveira, Samara V. S. Lins, Hugo J. F. Hazin, Pedro H. V. Rocha, Alejandro H. Toselli
Format: Dataset
Sprache:por
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The BRESSAY dataset comprises images of handwritten essays in Brazilian Portuguese, which present a series of challenges to optical recognition models. These images were sourced from multiple online platforms, limiting our ability to standardize the capture process. Due to these varied sources and the lack of a uniform collection method, the dataset provides a realistic reflection of real-world conditions. Each essay is unique, contributed by different writers, and addresses a specific content topic. Furthermore, the constraints placed on the writers often lead to various handwriting scenarios, including hard-to-read words, connected words, noise, overwriting, and struck-through texts. Technical Details The BRESSAY dataset represents a comprehensive collection of handwritten essays in Brazilian Portuguese, offering detailed insights into various handwriting scenarios. It covers a total of 1,000 pages, each contributed by a unique writer, resulting in 1,000 distinct handwriting styles. This aspect of the dataset adds a layer of diversity, which is further emphasized by the total of 4,214 paragraphs, 30,090 lines, and 416,826 words. Regarding unique tokens, we have 41,318 unique words, and 107 unique characters. Data Structure The dataset is organized as follows: data/: Main folder containing segmented essay images lines/: Images of individual lines   PNG files: Line images   TXT files: Transcriptions of lines pages/: Full page essay images   PNG files: Page images   TXT files: Transcriptions of pages paragraphs/: Images of paragraphs   PNG files: Paragraph images   TXT files: Transcriptions of paragraphs words/: Images of individual words   PNG files: Word images   TXT files: Transcriptions of words sets/: Contains partition files test.txt: Names of images in the test set validation.txt: Names of images in the validation set training.txt: Names of images in the training set Dataset Usage and Annotations Each name in test.txt, validation.txt and training.txt represents the name of the page and all its content (words, lines, paragraphs) must be in the respective partition. Annotations used in the dataset:   ##@@???@@##: Superscript text that has become unidentifiable and unreadable.   $$@@???@@$$: Subscript text that has become unidentifiable and unreadable.   @@???@@: Text that cannot be read or identified due to its illegibility.   ##--xxx--##: Text that has been added as a superscript and subsequently crossed out, rendering it illegible.   $$--xxx--$$: Tex
DOI:10.5281/zenodo.11637680