Algorithms for the recognition of poor quality documents

Raza, G

NTU > IRep

IRep

Algorithms for the recognition of poor quality documents

Tools

Raza, G, 1998. Algorithms for the recognition of poor quality documents. PhD, Nottingham Trent University.

Preview

Text
10183032.pdf - Published version
Download (38MB) | Preview

Official URL: http://gateway.proquest.com/openurl?url_ver=Z39.88...

Abstract

Optical Character Recognition (OCR) has engaged a number of researchers in developing suitable algorithms and systems that could translate human readable characters into machine readable codes accurately at high speed. Extensive work has presented successful results in recognizing good quality documents such as printed text but does not show satisfactory results for the poor quality documents, like low quality prints, photocopies, screen images, scanned old documents and facsimile messages.

Considering the limitations observed in past work, the present research investigates a suitable recognizer which could satisfactorily recognize poor quality documents. The work to date includes finding text lines, object extraction techniques, finding word gaps and finding words. It also includes methods for the extraction of different independent features. The features extracted during the current research include top side open, bottom side open, left side open, right side open, holes, top left corner open, top right corner open, bottom left corner open, bottom right corner open, vertical bars, horizontal bars, centre of gravity, dots of 'i' and 'j', and zones. These features are expected to be the same in the characters of different fonts and sizes and are tolerant to noise and hence can be used for the recognition of poor quality documents. Each feature contains some important information such as position in the object, length and width.

A method for the automatic creation of a database for both single and touching letters of any font and point size has been developed. Two methods (undercut and adding noise) for joining different letter combinations artificially and hence obtaining touching objects have been developed. A word recognition algorithm, based on object identification and dictionary lookup, for the recognition of poor quality documents has been described. The recognizer has two steps: finding object alternatives and making words using the alternatives and dictionary. The recognizer has been tested on fifty different facsimile messages containing 6029 machine printed words of different fonts, sizes and varied print qualities. The data was also tested using a commercial OCR software to obtain a comparative study of the recognizer and the commercial software. An overall 61.8% and 55.5% recognition rates are obtained for all facsimile messages using the recognizer and commercial software respectively. An improvement of 6.3% is found using the developed recognizer. The recognizer has also been tested on fifteen artificially created sample documents of different fonts and it gave an overall improvement of 10.3% compared with the commercial software. The results obtained confirm the effectiveness of the developed recognizer compared with the commercial system.

To improve the efficiency of the developed recognizer for a wide range of the poor quality documents further work is proposed. It involves improving existing methods for line and word extraction, feature extraction methods and extraction of additional new features. Future work considers finding methods for dealing with touching objects, document and context layout analysis, integration of a postulation algorithm into the developed recognizer and postprocessing.

Item Type:	Thesis
Creators:	Raza, G.
Date:	1998
ISBN:	9781369313246
Identifiers:	Number Type PQ10183032 Other
Divisions:	Schools > School of Science and Technology
Record created by:	Linda Sullivan
Date Added:	28 Aug 2020 12:59
Last Modified:	21 Jun 2023 10:46
URI:	https://irep.ntu.ac.uk/id/eprint/40583