Curated-Processed-Reannotated Turkish e-commerce sentimet analysis dataset
The dataset was compiled from publicly available sources, including Hugging Face, GitHub, and Kaggle. To ensure data quality, we performed preprocessing steps such as deduplication, removal of non-Turkish entries, and exclusion of short reviews (fewer than three words). Python and the pandas library...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The dataset was compiled from publicly available sources, including Hugging Face, GitHub, and Kaggle. To ensure data quality, we performed preprocessing steps such as deduplication, removal of non-Turkish entries, and exclusion of short reviews (fewer than three words). Python and the pandas library were used for data cleaning and formatting.
For sentiment labeling, we used ChatGPT4-o-mini in a zero-shot approach, batch-processing approximately 100 reviews per request. We chose zero-shot labeling after observing that providing additional instructions led to a decline in labeling accuracy with ChatGPT4-o-mini. The prompt instructed the model to classify each review’s sentiment as Positive, Negative, or Neutral, without any specific examples or prior information. The prompt format was:
• System Message: "You are a sentiment analysis assistant."
• User Message: "Please analyze the sentiment of the following review dictionary and return the result in the format 'id,label' where label should be one of these: Positive, Negative, or Neutral."
This zero-shot approach resulted in high consistency, which we validated by comparing the model's output with human annotations, observing a strong correlation in sentiment labeling accuracy. This ensured reliable labeling across the entire dataset. |
---|---|
DOI: | 10.17632/nvkcfnkh47.1 |