A comparative study of automated legal text classification using random forests and deep learning

Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to tex...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2022-03, Vol.59 (2), p.102798, Article 102798
Hauptverfasser:	Chen, Haihua, Wu, Lei, Chen, Jiangping, Lu, Wei, Ding, Junhua
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial neural networks Automation Classification Comparative studies Deep learning Documents Domain concept Domains Information processing Legal information Legal text classification Machine learning Neural networks Random forests Text categorization Texts Word embedding
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learning has been applied to improving text classification performance. Its effectiveness needs further exploration in domains such as the legal field. This paper investigates legal text classification with a large collection of labeled U.S. case documents through comparing the effectiveness of different text classification techniques. We propose a machine learning algorithm using domain concepts as features and random forests as the classifier. Our experiment results on 30,000 full U.S. case documents in 50 categories demonstrated that our approach significantly outperforms a deep learning system built on multiple pre-trained word embeddings and deep neural networks. In addition, applying only the top 400 domain concepts as features for building the random forests could achieve the best performance. This study provides a reference to select machine learning techniques for building high-performance text classification systems in the legal domain or other fields. •We apply domain concepts to legal text classification based on PCA and RFs to demonstrate its powerful ability for legal text.•We conduct a systematic comparative study on a legal area classification dataset by using domain concept-based machine learning algorithms and pre-trained word embeddings-based deep learning algorithms.•We propose a framework, which includes the strategy for selecting machine learning models in terms of four indicators: data, performance, computation, and interpretation.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2021.102798