Concept exploration and discovery from business documents for software engineering projets using dual mode filtering

This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Ménard, Pierre André
Format:	Dissertation
Sprache:	eng
Schlagworte:	acronyme complexe multimots concept Documents administratifs Exploration de données (Informatique) expression extraction forage de texte identification d’expression ingénierie logicielle Linguistique informatique Logiciels Développement modèle de domaine recherche d’information
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by extracting relevant concepts from the textual documents of the client’s organization. Such a time-consuming task is usually done manually which is prone to fatigue, errors, and omissions. The business documents are considered unstructured and are less formal and straightforward than software requirements specifications created by an expert. In addition, our research is done on documents written in French, for which text analysis tools are less accessible or advanced than those written in English. As a result, the presented system integrates accessible tools in a processing pipeline with the goal of increasing the quality of the extracted list of concepts. Our first contribution is the definition of a high-level process used to extract domain concepts which can help the rapid discovery of knowledge by software experts. To avoid undesirable noise from high level linguistic tools, the process is mainly composed of positive and negative base filters which are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. We introduce a new metric to assess the performance discovery speed of relevant concepts. We also present a method to help obtain a gold standard definition of software engineering oriented concepts for knowledge extraction tasks. Our second contribution is a statistical method to extract large and complex multiword expressions which are found in business documents. These concepts, which can sometimes be exemplified as named entities or standard expressions, are essential to the full comprehension of business corpora but are seldom extracted by existing methods because of their form, the sparseness of occurrences and the fact that they are usually excluded by the candidate generation step. Current extraction methods usually do not target these types of expressions and perform poorly on their length range. This article describes a hybrid method based on the local maxima technique with added lin