TOPIC EXTRACTION USING CLAUSE SEGMENTATION AND HIGH-FREQUENCY WORDS

The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Markman Vita G, Martell Craig H, Finger Lutz T, Zhang Yongzheng
Format:	Patent
Sprache:	eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items. The system then automatically extracts a set of topics from the set of clauses by generating a set of n-grams from the set of clauses and excluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram. Finally, the system displays the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items.