Consistent Text Categorization using Data Augmentation in e-Commerce
The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product ca...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The categorization of massive e-Commerce data is a crucial, well-studied
task, which is prevalent in industrial settings. In this work, we aim to
improve an existing product categorization model that is already in use by a
major web company, serving multiple applications. At its core, the product
categorization model is a text classification model that takes a product title
as an input and outputs the most suitable category out of thousands of
available candidates. Upon a closer inspection, we found inconsistencies in the
labeling of similar items. For example, minor modifications of the product
title pertaining to colors or measurements majorly impacted the model's output.
This phenomenon can negatively affect downstream recommendation or search
applications, leading to a sub-optimal user experience.
To address this issue, we propose a new framework for consistent text
categorization. Our goal is to improve the model's consistency while
maintaining its production-level performance. We use a semi-supervised approach
for data augmentation and presents two different methods for utilizing
unlabeled samples. One method relies directly on existing catalogs, while the
other uses a generative model. We compare the pros and cons of each approach
and present our experimental results. |
---|---|
DOI: | 10.48550/arxiv.2305.05402 |