Tesseract (software)
Encyclopedia
Tesseract is a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 engine for various operating systems.

Originally developed as proprietary software at Hewlett-Packard
Hewlett-Packard
Hewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...

 between 1985 and 1995, it had very little work done on it in the following decade. It was then released as open source in 2005 by Hewlett Packard and UNLV
University of Nevada, Las Vegas
University of Nevada-Las Vegas is a public, coeducational university located in the Las Vegas suburb of Paradise, Nevada, USA. The campus is located approximately east of the Las Vegas Strip. The institution includes a Shadow Lane Campus, located just east of the University Medical Center of...

. Tesseract development is currently sponsored by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

. It is released under the Apache License
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....

, Version 2.0.

Tesseract is considered one of the most accurate free software OCR engines currently available.

History

The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.

Currently Tesseract builds under Linux with GCC
GNU Compiler Collection
The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...

 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library
Standard Template Library
The Standard Template Library is a C++ software library which later evolved into the C++ Standard Library. It provides four components called algorithms, containers, functors, and iterators. More specifically, the C++ Standard Library is based on the STL published by SGI. Both include some...

 and may be more efficient than Standard Template Library lists, but is reportedly harder to debug in the event of a segmentation fault
Segmentation fault
A segmentation fault , bus error or access violation is generally an attempt to access memory that the CPU cannot physically address. It occurs when the hardware notifies an operating system about a memory access violation. The OS kernel then sends a signal to the process which caused the exception...

. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. The migration to C++ is a step towards eliminating this conversion, though it is not yet complete.

Features

Tesseract was in the top 3 OCR engines in terms of character accuracy in 1995. It is available for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

, Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

 and Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

, however, due to limited resources only Windows and Ubuntu
Ubuntu (operating system)
Ubuntu is a computer operating system based on the Debian Linux distribution and distributed as free and open source software. It is named after the Southern African philosophy of Ubuntu...

 are rigorously tested by developers.

Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR
HOCR
hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself...

 positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional.

The initial versions of Tessaract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish (standard and Fraktur script), German, Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.

If Tessaract is used to process right-to-left text such Arabic or Hebrew the results are ordered as though it is left-to-right text.

Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus
OCRopus
OCRopus is a free document analysis and optical character recognition system released under the Apache License, Version 2.0 with a very modular design through the use of plugins...

.

User interfaces

Tesseract does not come with a GUI and is instead run from the command-line interface
Command-line interface
A command-line interface is a mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks...

.

There are several separate projects which provide a GUI for Tesseract:
  • FreeOCR – a Windows Tesseract GUI
  • gImageReader – GTK GUI frontend for Tesseract that supports selecting columns and parts of the document. It can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.
  • gscan2pdf – A GUI to produce PDFs or DjVus from scanned documents
  • OCRFeeder
    OCRFeeder
    OCRFeeder is a free software desktop OCR suite for GNOME. It converts paper documents to digital document files or makes them accessible to visually impaired users....

     – Features a complete GTK graphical user interface that allows the users to correct any unrecognized characters, defined or correct bounding boxes, set paragraph styles, clean the input images, import PDFs, save and load the project, export everything to multiple formats, etc.
  • OcrGui – A Linux GUI, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR
    GOCR
    GOCR is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files into text files.- Features :...

    . It includes spell checking using Hunspell
    Hunspell
    Hunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding, originally designed for the Hungarian language....

    , an open source spell checker.
  • Qiqqa
    Qiqqa
    Qiqqa is freeware and freemium reference management software that allows researchers to work with thousands of PDFs. It combines PDF reference management tools, a citation manager and a mind map brainstorming tool. It integrates with Microsoft Word XP, 2003, 2007 and 2010 and BibTeX/LaTeX to...

     – A freeware PDF reference management tool that uses Tesseract to interpret scanned PDFs for full-index searching.
  • Tesseract GUI – A Mac OS X free software GUI
  • TextRipper – a Linux Tesseract and/or Ocrad GUI with multiple -page, -column, and -file selection support.
  • VietOCR – A Java-based cross-platform
    Cross-platform
    In computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...

     GUI that includes a language pack for Vietnamese and special post-processing tools for Vietnamese
  • YAGF – Graphical front-end (Qt 4.x) for cuneiform and tesseract

Libraries using Tesseract engine

  • ABCocr .NET - an OCR component for Microsoft's .NET Framework, with support for 64-bit systems, built around a custom version of the Tesseract 3 engine.
  • hOcr2Pdf.NET – a .NET
    .NET
    .NET may refer to:* .NET Framework, a software framework by Microsoft* .net, a top-level domain* .net * .NET Passport, an old name for Windows Live ID* .NET Messenger Service...

     library to convert Tesseract recognized images into PDF with search capabilities using HtmlAgilityPack and iTextSharp
    IText
    iText is a free and open source library for creating and manipulating PDF files in Java. It was written by Bruno Lowagie, Paulo Soares, and others. As of version 5.0.0 it is distributed under the Affero General Public License version 3. Previous versions of iText were distributed under the...

    .

Reception

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal
Linux Journal
Linux Journal is a monthly technology magazine published by Belltown Media, Inc. of Houston, Texas. The magazine focuses specifically on Linux, allowing the content to be a highly specialized source of information for open source enthusiasts.-History:...

termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK