Sahil Badla

Publication Date

Spring 2014

Degree Type

Master's Project


Computer Science


This project investigates the principles of optical character recognition used in the Tesseract OCR engine and techniques to improve its efficiency and runtime. Optical character recognition (OCR) method has been used in converting printed text into editable text in various applications over a variety of devices such as Scanners, computers, tablets etc. But now Mobile is taking over the computer in all the domains but OCR still remains one not so conquered field. So programmers need to improve the efficiency of the OCR system to make it run properly on Mobile devices. This paper focuses on improving the Tesseract OCR efficiency for Hindi language to run on Mobile devices as there a not many applications for the same and most of them are either not open source or not for mobile devices. Improving Hindi text extraction will increase Tesseract's performance for Mobile phone apps and in turn will draw developers to contribute towards Hindi OCR . This paper presents a preprocessing technique being applied to the Tesseract Engine to improve the recognition of the characters keeping the runtime low. Hence the system runs smoothly and efficiently on mobile devices(Android) as it does on the bigger machines.