Concept exploration and discovery from business documents for software engineering projets using dual mode filtering
This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Dissertation |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by extracting relevant concepts from the textual documents of the client’s organization. Such a time-consuming task is usually done manually which is prone to fatigue, errors, and omissions. The business documents are considered unstructured and are less formal and straightforward than software requirements specifications created by an expert. In addition, our research is done on documents written in French, for which text analysis tools are less accessible or advanced than those written in English. As a result, the presented system integrates accessible tools in a processing pipeline with the goal of increasing the quality of the extracted list of concepts.
Our first contribution is the definition of a high-level process used to extract domain concepts which can help the rapid discovery of knowledge by software experts. To avoid undesirable noise from high level linguistic tools, the process is mainly composed of positive and negative base filters which are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. We introduce a new metric to assess the performance discovery speed of relevant concepts. We also present a method to help obtain a gold standard definition of software engineering oriented concepts for knowledge extraction tasks.
Our second contribution is a statistical method to extract large and complex multiword expressions which are found in business documents. These concepts, which can sometimes be exemplified as named entities or standard expressions, are essential to the full comprehension of business corpora but are seldom extracted by existing methods because of their form, the sparseness of occurrences and the fact that they are usually excluded by the candidate generation step. Current extraction methods usually do not target these types of expressions and perform poorly on their length range. This article describes a hybrid method based on the local maxima technique with added lin |
---|