Thematic knowledge extraction

Zhang, S, 2003. Thematic knowledge extraction. PhD, Nottingham Trent University.

[thumbnail of 10183412.pdf]
Preview
Text
10183412.pdf - Published version

Download (35MB) | Preview

Abstract

This thesis describes research into automatic knowledge extraction (AKE) from text. In particular, the automatic knowledge extraction is to produce or help to produce knowledge in the format of an existing hypermedia knowledge-based system: HyperTutor. The whole AKE process is divided into two main stages: concept extraction and relation extraction.

Automatic thematic concept (keyword) extraction (TCE) is described in detail. Two approaches for TCE are presented and evaluated. One of them is a machine learning approach based on artificial neural networks (ANNs). New measures of evaluation of this novel approach are introduced, based on the concept of generalisation. These include natural generalisation (NG) and pure generalisation (PG). Measures commonly used in knowledge extraction research, i.e. recall and precision, are applied in their normal binary form, but analogue versions are developed to assess the performance of the ANN-based approach. A comparison with chance (GWC) measure is also applied to the results. A stemming analysis method has also been attempted at sense-level and word-level. The results show that thematic concepts can be automatically extracted from text using an ANN plus a lexical semantic resource. The ANN alone produces best result for non-keywords and overall. Word level stemming analysis alone is the best for identifying keywords, while sense level analysis provides the most balanced results between keywords and non-keywords. The baseline comparison for the ANN method shows that the ANN method adds value to the external lexicon. The CWC measure shows that both the ANN and stemming methods work much better than chance.

Domain portability of the keyword extraction techniques developed is addressed. Although the ANN itself in not transferable, the ANN method is transferable with consistent performance between domains. The stemming analysis approach also transfers well between domains, although not as well as the ANN method.

An attempt to answer the question of how the ANN learns the problem of thematic concept extraction is also presented, based on the analysis of the weights in the trained ANN. Analysis is carried out on three different aspects: relation level analysis tries to find out if some kind of relations are more important than others, or if they are equally important to the ANN; path level analysis aims to identify what kinds of paths are more likely to lead to a noun being classified as a keyword; and analysis of category information helps to explain how category affects keyword recognition. This analysis has confirmed the hypothesis of close relationships that are distinct and characteristic of the seed word-KW relationship.

An important type of relation, named verb-noun relation, is targeted in the attempt of relation extraction. This is novel. Parse tree based and tagger based approaches have been investigated. The parse tree based method produces high precision and low recall. The main reason for the low recall is parser failure. This reflects the current limitation of the parser techniques is not advanced enough to process texts in the real world. The tagger-based approach produces high recall and low precision. The style of writing may have great impact on the results of relation extraction. The experiments have shown that both of the approaches perform better for one of the documents compared with the other.

This thesis also proposes issues for future work in AKE research.

Item Type: Thesis
Creators: Zhang, S.
Date: 2003
ISBN: 9781369316179
Identifiers:
Number
Type
PQ10183412
Other
Divisions: Schools > School of Science and Technology
Record created by: Linda Sullivan
Date Added: 25 Sep 2020 13:38
Last Modified: 23 Aug 2023 13:07
URI: https://irep.ntu.ac.uk/id/eprint/40942

Actions (login required)

Edit View Edit View

Statistics

Views

Views per month over past year

Downloads

Downloads per month over past year