テキストマイニングを用いた筆者識別へのスコアリング導入 ―文字数やテキスト数,文体的特徴が得点分布に及ぼす影響
Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivari...
Gespeichert in:
Veröffentlicht in: | Nihon Hokagaku Gijutsu Gakkai Shi = Japanese Journal of Forensic Science and Technology 2017, Vol.22(2), pp.91-108 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | jpn |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivariate analyses) across one suspected text of a blogger, one control text of a blogger, and irrelevant texts of four bloggers. The writing style factors were (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, (4) positioning of commas, (5) rate of usage of Kanji, Hiragana, etc. and (6) sentence length. We completed (1) principal components analysis, (2) corresponding analysis, and (3) multi-dimensional scaling. We obtained scores from arrangements of texts on two dimensions, convex hull polygon (CHP) consisting of control texts was overlapped with that of irrelevant texts (a score of 0). Besides not overlapping each CHP of control and irrelevant texts, (a score of +2) a suspected text arranged into CHP of control texts, (a score of +1) one not arranged into CHP of control texts but near a control text, and (a score of −1) one near an irrelevant text. We totaled the scores in one unit of analysis (18 results) and analyzed the total scores of the 240 units of analysis for 10 bloggers under the following design: 2 (author combination of suspected and control texts: same, different)×4 (number of characters: 250, 500, 1000, 1500)×3 (number of control and irrelevant texts: 3, 6, 9). The results indicated the scoring method was able to identify the authors. AUCs of number of characters were statistically significant, but the number of texts was not significant. Furthermore, rate of usage of non-independent words and parts-of-speech were quite useful to identify authors. |
---|---|
ISSN: | 1880-1323 1881-4689 |
DOI: | 10.3408/jafst.715 |