A Brazilian Portuguese Dataset for Offline Handwritten Text Recognition (BRESSAY)
The BRESSAY dataset comprises images of handwritten essays in Brazilian Portuguese, which present a series of challenges to optical recognition models. These images were sourced from multiple online platforms, limiting our ability to standardize the capture process. Due to these varied sources and t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Dataset |
Sprache: | por |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The BRESSAY dataset comprises images of handwritten essays in Brazilian Portuguese, which present a series of challenges to optical recognition models. These images were sourced from multiple online platforms, limiting our ability to standardize the capture process. Due to these varied sources and the lack of a uniform collection method, the dataset provides a realistic reflection of real-world conditions. Each essay is unique, contributed by different writers, and addresses a specific content topic. Furthermore, the constraints placed on the writers often lead to various handwriting scenarios, including hard-to-read words, connected words, noise, overwriting, and struck-through texts.
Technical Details
The BRESSAY dataset represents a comprehensive collection of handwritten essays in Brazilian Portuguese, offering detailed insights into various handwriting scenarios. It covers a total of 1,000 pages, each contributed by a unique writer, resulting in 1,000 distinct handwriting styles. This aspect of the dataset adds a layer of diversity, which is further emphasized by the total of 4,214 paragraphs, 30,090 lines, and 416,826 words. Regarding unique tokens, we have 41,318 unique words, and 107 unique characters.
Data Structure
The dataset is organized as follows:
data/: Main folder containing segmented essay images
lines/: Images of individual lines
PNG files: Line images
TXT files: Transcriptions of lines
pages/: Full page essay images
PNG files: Page images
TXT files: Transcriptions of pages
paragraphs/: Images of paragraphs
PNG files: Paragraph images
TXT files: Transcriptions of paragraphs
words/: Images of individual words
PNG files: Word images
TXT files: Transcriptions of words
sets/: Contains partition files
test.txt: Names of images in the test set
validation.txt: Names of images in the validation set
training.txt: Names of images in the training set
Dataset Usage and Annotations
Each name in test.txt, validation.txt and training.txt represents the name of the page and all its content (words, lines, paragraphs) must be in the respective partition.
Annotations used in the dataset:
##@@???@@##: Superscript text that has become unidentifiable and unreadable.
$$@@???@@$$: Subscript text that has become unidentifiable and unreadable.
@@???@@: Text that cannot be read or identified due to its illegibility.
##--xxx--##: Text that has been added as a superscript and subsequently crossed out, rendering it illegible.
$$--xxx--$$: Tex |
---|---|
DOI: | 10.5281/zenodo.11637680 |