OCRopus
Encyclopedia
OCRopus is a free
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 document analysis
Document Layout Analysis
Document Layout Analysis is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading...

 and optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 (OCR) system released under the Apache License
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....

, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.

OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence
German Research Centre for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz , lit. German Research Center for Artificial Intelligence, is one of the world's largest nonprofit contract research institutes in the field of innovative software technology based on Artificial Intelligence methods.DFKI was founded in 1988...

 in Kaiserslautern, Germany and is sponsored by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

.

OCRopus is developed for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

; however, users have reported success with OCRopus on Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

 and an application called TakOCR has been developed that installs OCRopus on Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

 and provides a simple droplet interface.

How it works

OCRopus is an OCR system that combines pluggable layout analysis
Document Layout Analysis
Document Layout Analysis is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading...

, pluggable character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

, and pluggable language modeling.
It aims primarily for high-volume document conversion, namely for Google Book Search
Google Book Search
Google Books is a service from Google that searches the full text of books that Google has scanned, converted to text using optical character recognition, and stored in its digital database. The service was formerly known as Google Print when it was introduced at the Frankfurt Book Fair in October...

, but also for desktop and office use or for vision impaired people.

OCRopus used Tesseract
Tesseract (software)
Tesseract is a free software optical character recognition engine for various operating systems.Originally developed as proprietary software at Hewlett-Packard between 1985 and 1995, it had very little work done on it in the following decade. It was then released as open source in 2005 by Hewlett...

 as its only character recognition plugin, but it uses its own engine in the 0.4 release. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition
Handwriting recognition
Handwriting recognition is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or...

 engine which may be repaired in the future.

OCRopus's layout analysis plugin does image preprocessing and layout analysis: it chops up the scanned document and passes the sections to a character recognition plugin for line-by-line or character-by-character recognition.

As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST, optional as of version pre-0.4.

History

Release history:
  • Initial announcement - 9 April 2007
  • 0.1.0 (alpha) - 22 Oct 2007
  • 0.1.1 (alpha) 14 Dec 2007 - Improved build system
  • 0.2 (alpha 2) - 31 May 2008
  • 0.3 (alpha 3)- 16 Oct 2008.
  • pre-0.4 (alpha 4) available for download May 2009
  • 0.4.3 July 2009
  • 0.4.4 March 2010
  • Updated Roadmap

Usage

OCRopus can be used from the command line or inside gscan2pdf. Once installed, it can be invoked by specifying the input images. It will output hOCR
HOCR
hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself...

 (HTML-based) code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK