Method of identifying data type and locating in a file
A method of identifying the types of data contained in an electronic file of unknown data type by gathering exemplary files of each data type of interest; counting the number of unique n-grams within each exemplary file; determining a weight for each unique n-gram; listing the unique n-grams in the...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Patent |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A method of identifying the types of data contained in an electronic file of unknown data type by gathering exemplary files of each data type of interest; counting the number of unique n-grams within each exemplary file; determining a weight for each unique n-gram; listing the unique n-grams in the exemplary files of a particular data type by descending magnitude of weight for each data type of interest; selecting the top m weighted n-grams and their associated weights; establishing a threshold for each data type of interest; selecting a length of data from the electronic file; listing every n-gram in the data selected; giving each listed n-gram, that was also selected, the weight that that n-gram was given for each data type of interest; summing the weights given to each n-gram according to data type; comparing the sums to the thresholds established in order to determine the types, if any, of the selected data; recording the location of the selected data if it is of a data type of interest; stopping if the number of selected lengths of data reached a user-definable number, otherwise selecting another length of data from the file that is the same length as that selected previously, where the newly selected data overlaps with the previously selected data by at least one position; and repeating the steps from listing every n-gram to stopping using the newly selected data. |
---|