TOPIC EXTRACTION USING CLAUSE SEGMENTATION AND HIGH-FREQUENCY WORDS

The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Markman Vita G, Martell Craig H, Finger Lutz T, Zhang Yongzheng
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items. The system then automatically extracts a set of topics from the set of clauses by generating a set of n-grams from the set of clauses and excluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram. Finally, the system displays the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items.