PatSeg: A Sequential Patent Segmentation Approach
Patents are an important source of information in industry and academia. However, quickly grasping the essence of a given patent is difficult as they typically are very long and written in a rather inaccessible style. These essential information, especially the invention itself and the experimental...
Gespeichert in:
Veröffentlicht in: | Big data research 2020-03, Vol.19-20, p.100133, Article 100133 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Patents are an important source of information in industry and academia. However, quickly grasping the essence of a given patent is difficult as they typically are very long and written in a rather inaccessible style. These essential information, especially the invention itself and the experimental part of the invention, are usually contained in the description section. However, in many patents the description parts are neither annotated nor easily detectable. Here, we describe our novel PatSeg method for patent segmentation, which aims at automatically and directly identifying the most important parts of a patent. PatSeg uses a two-step approach, where a patent is first segmented into text blocks in an unsupervised fashion followed by a supervised classification step for each identified segment. In contrast to previous work, PatSeg uses semantic word embeddings in both phases and applies a sequential learning algorithm for the second step. These modifications lead to, on average, an improvement of 9.47% (8.78%, 9.00%) in terms of F1-score (precision, recall) and 7.29 in terms of accuracy in comparison to a baseline, as evaluated on two novel and manually segmented gold standard patent corpora. The method also is easily parallelizable, fast, making it applicable for truly large patent collections. |
---|---|
ISSN: | 2214-5796 2214-580X |
DOI: | 10.1016/j.bdr.2020.100133 |