A hybrid NLP & semantic knowledgebase approach for the intelligent exploration of Arabic documents

Khalil, H., 2017. A hybrid NLP & semantic knowledgebase approach for the intelligent exploration of Arabic documents. PhD, Nottingham Trent University.

[img]
Preview
Text
Hussein Khalil 2018.pdf - Published version

Download (2MB) | Preview

Abstract

In the contemporary era, a colossal amount of information is published daily on the Web in the form of articles, documents, reviews, blogs and social media posts. As most of this data is available in the form of unstructured documents, it makes it challenging and timeconsuming to extract non-trivial, previously unknown, and potentially useful knowledge from the published documents. Hence, extracting useful knowledge from unstructured text, i.e., Information Extraction, is becoming an increasingly significant aspect of knowledge discovery.
This work focuses on Information Extraction form Arabic unstructured text, which is an especially challenging task as Arabic is a highly inflectional and derivational language. The problem is compounded by the lack of mature tools and advanced research in Arabic Natural Language Processing (NLP) in comparison to European languages for instance.
The principal objective of this research work is presenting a comprehensive methodology for integrating domain knowledge with Natural Language Processing techniques that were proven effective in solving most classification problems in order to improve the Information extraction process form online unstructured data. The importance of NLP tools lies in that they play a key role in allowing semantic concept tagging of unstructured text, and so realize the Semantic Web. This work presents a novel rule-based approach that uses linguistic grammar-based techniques to extract Arabic composite names from Arabic text. Our approach uniquely exploits the genitive Arabic grammar rules; in particular, the rules regarding the identification of definite nouns (معرفة) and indefinite nouns (نكرة) to support the process of extracting composite names. Furthermore, this approach does not place any constraints on the length of the Arabic composite name. The results of our experiments show that there are improvement in recognizing Arabic composite names entity in the Arabic language text.
Our research also contributes a novel, knowledge-based approach to relation extraction from unstructured Arabic text, which is based on the principles of Functional Discourse Grammar (FDG). We further improve the approach by integrating it with Machine Learning relation classification, resulting in a hybrid relation extraction algorithm that can handle especially complex Arabic sentence structures. The accuracy of our relation classification efforts was extensively evaluated by means of experimental evaluation that evidenced the accuracy of the FDG relation extraction approach and the improvement gained by the Machine Learning integration.
The essential NLP algorithms of entity recognition and relation extraction were deployed in a Semantic Knowledge-base that was built from the outset to model the knowledge of the problem domain. The semantic modelling of the knowledgebase aided improving the accuracy of the NLP algorithms by leveraging relevant domain knowledge published in Open Linked Datasets. Moreover, the extracted information was semantically tagged and inserted into the Semantic Knowledge-base, which facilitated building advanced rules to infer new interesting information from the extracted knowledge as well as utilising advanced query mechanisms for intelligently exploring the mined problem domain knowledge.

Item Type: Thesis
Creators: Khalil, H.
Date: August 2017
Rights: This work is the intellectual property of the author, and may be owned by Nottingham Trent University. You may copy up to 5 percent of this work for private or personal study and non-commercial research. Any reuse of the information contained within this document should be fully referenced, citing author, title, university, degree level and pagination. Queries and requests for commercial use, or the use of substantial copy should be referred to the author at first instance.
Divisions: Schools > School of Science and Technology
Record created by: Linda Sullivan
Date Added: 13 Mar 2018 14:03
Last Modified: 13 Mar 2018 14:03
URI: https://irep.ntu.ac.uk/id/eprint/32938

Actions (login required)

Edit View Edit View

Views

Views per month over past year

Downloads

Downloads per month over past year