A foray into optical character recognition software.
Abstract:
Optical character recognition is the reading of characters using computers. Many computer programs exist for recognition of printed text, but few exist for the recognition of hand written text. Accordantly, the purpose of this project is optical character recognition of hand written text. In order to accomplish this, a five step process will be used. First, lines of text are separated from each other using a basic algorithm. Second, characters in each line are separated from each other using a principle based on seam carving. After individual characters are located recognition becomes a two part process. First each character is typed, and the written character is added to its corresponding letter in a database. After a database is constructed, characters to be recognized are compared to characters in the database and are assigned the closest character match. Lastly a spell check is run on the output of the recognized characters. Optical character recognition is not perfect, but will improve over time with cheaper computing power and improved algorithms. By improving optical character recognition of hand written text, it is possible to scan and convert to text handwritten notes and documents.
Discussion:
Optical character recognition is the conversion of printed or handwritten text into a machine readable format commonly involving digitalization. The basic format is simple: find and convert characters but the implementation is not. Processes are needed to separate lines, characters, and words, and then to convert these parts into a machine readable format. Spell check is then generally used to improve results, thus making the recognition of words more accurate than the recognition of characters. Separation of lines is needed to keep the recognized characters in the correct order. Separation of characters is needed to identify them, and separation of words keeps characters grouped together.
In this project a simple form of line separation is used. This is all that is needed, as the separation is used only to keep the order of the characters correct. The height of characters is trimmed later on. Characters are separated using a newer and more novel approach than some other optical character recognition programs. A variation of seam carving, a technique first published is August 2007, is used. Seams are carved between characters, separating the characters but not parts of the characters, such as the dot above an i, as is done by other methods, such as separating all contiguous blocks of color. The formatting of characters is then made uniform by cropping them and putting them into the center of a square bitmap. This bitmap is then scaled to ten by ten pixels for faster recognition and less storage space in the recognition profile. This bitmap is then converted into a string of zeros and ones for easy storage within an XML file and easy matching for recognition.
In order to create recognition data to match characters, a training process is used. Uniformly formatted characters are matched with text typed in a text box, typed before training is started, and added to the recognition profile XML file for storage. To recognize characters, a simple matching method is used in which the character in the recognition data with the most pixels matching the pixels of the character to be recognized is outputted.
To separate words, the space between all characters is averaged, and spaces are inserted between characters when the interval between them is at least ten percent higher than the average. Recognized words are then put through a statistical spell checker. A dictionary, which holds words and their frequency, is created using a text file that contains books from Project Gutenberg and a list of words from wiktionary. The word to be checked is matched with the dictionary and returned unchanged if there is a correlation. If there is no match, all possible one character deletions, insertions, alterations, and transpositions are created and added to an array. These permutations are then checked with the dictionary and all correct possibilities are added to an array. If there is one correct possibility, it is returned. Otherwise the correct possibility with the highest frequency in the dictionary is returned. If there are no correct possibilities all possible one character deletions, insertions, alterations, and transpositions for each permutation from the first array are created and added to another array, creating a list of all possible two character deletions, insertions, alterations, and transpositions. These permutations are then checked with the dictionary and all correct possibilities are added to an array. Again if there is one correct possibility it is returned, otherwise the correct possibility with the highest frequency in the dictionary is returned. If no correct possibilities have been found after two character permutations, the word is returned unchanged as no correction was found.