Corpus-based connectionist parsing.

Tepper, J.A. ORCID: 0000-0001-7339-0132, 2000. Corpus-based connectionist parsing. PhD, Nottingham Trent University.

[img]
Preview
Text
10183020.pdf - Published version

Download (40MB) | Preview

Abstract

The syntactic analysis or parsing of realistic subsets of natural language is the greatest obstacle to achieving practical natural language processing (NLP) systems. Classical symbolic models of syntactic parsing are typically constrained by large sets of grammar rules that attempt to capture numerous linguistic exceptions and generalisations that are prevalent in the everyday use of language. However, the grammar rules are constantly open to amendment and are unable to encode the necessary level of abstraction required to gain a reasonable coverage of the language. The greater the coverage of language, the more difficult it is to implement the rules in a parser due to the increased complexity. A further computational inadequacy of current symbolic parsing systems is that processing is serial in the majority of implementations and therefore cannot simultaneously make use of multiple sources of constraints and information.

This dissertation explores the use of connectionist networks for parsing realistic natural language domains. Connectionist natural language processing (CNLP) is a relatively new approach to language processing compared with classical symbolic methods. A decade of connectionist research has provided new methods of representation and approaches to parsing. Connectionist networks' inherently distributed knowledge representation and their parallel processing behaviour has enabled linguistic information to be represented as soft-rules or constraints rather than by 'hard-wired' symbolic rules. Connectionist networks therefore possess numerous attributes that are well- suited to language modelling.

The parsing model proposed in this dissertation integrates both connectionist and symbolic techniques to formulate a hybrid data-orientated parsing system that is trainable and able to acquire linguistic knowledge directly from preparsed sentence examples extracted from a large parsed corpus. The connectionist modules of the system enable the automatic learning of linguistic structure and provide an inherently distributed representation of linguistic information that exhibits tolerance to unfamiliar input data and that is able to generalise from previous sentence examples. The Lancaster Parsed Corpus (LPC) is used as the source of the training and test data. Three connectionist architectures are used. A Temporal Auto-Associative Simple Recurrent Network (TASRN) is required to discover the beginning of a phrase; another TASRN is required to discover the end of a phrase; and a feed-forward Multi-Layer Perceptron (MLP) network is required to recognise the phrase that has been extracted by the TASRN networks. This method of phrase segmenting and recognition provides a powerful technique for processing arbitrarily long and complex sentences. The symbolic components allow information to be stored in an easily interpretable and manipulable manner and provides the basis for organising the parse. The connectionist and symbolic components interact to form a deterministic shift-reduce parser that parses sentences from right-to- left. A modification of the Back-propagation learning algorithm that enables MLP networks' to dynamically focus on training patterns that have high errors is also presented. As learning takes place, the learning rate coefficient is adjusted in response to each individual pattern error. The results obtained from experiments with artificial and natural language domains are encouraging. It improved training times in most instances and in some cases allowed the removal of pattern replication used to balance the training data. Also, the evaluation method used to test the performance of the connectionist networks is based upon the natural and pure generalisation levels produced by the networks in response to unique linguistic input.

In contrast with previous approaches to syntactic parsing with connectionist networks, the corpus-based model proposed is able to process large and varied samples of naturally occurring English text and sentences that are of arbitrary length and complexity. The system exhibits high levels of syntactic generalisations at both the module level and the sentence level which provides the system with a realistic coverage of the language, a feature lacking in previous hybrid parsers. Crucially, the model is adaptable to the grammatical framework of the training corpus used and is not predisposed to a particular grammatical formalism thus widening the scope and reusability of the parser.

Item Type: Thesis
Creators: Tepper, J.A.
Date: 2000
ISBN: 9781369313123
Identifiers:
NumberType
PQ10183020Other
Rights: © Copyright by Jonathan Andrew Tepper 2000.
Divisions: Schools > School of Science and Technology
Record created by: Jeremy Silvester
Date Added: 02 Sep 2020 11:17
Last Modified: 15 Jun 2023 10:22
URI: https://irep.ntu.ac.uk/id/eprint/40613

Actions (login required)

Edit View Edit View

Views

Views per month over past year

Downloads

Downloads per month over past year