OCRopus
Encyclopedia
OCRopus is a free
document analysis
and optical character recognition
(OCR) system released under the Apache License
, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.
OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence
in Kaiserslautern, Germany and is sponsored by Google
.
OCRopus is developed for Linux
; however, users have reported success with OCRopus on Mac OS X
and an application called TakOCR has been developed that installs OCRopus on Mac OS X
and provides a simple droplet interface.
, pluggable character recognition
, and pluggable language modeling.
It aims primarily for high-volume document conversion, namely for Google Book Search
, but also for desktop and office use or for vision impaired people.
OCRopus used Tesseract
as its only character recognition plugin, but it uses its own engine in the 0.4 release. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition
engine which may be repaired in the future.
OCRopus's layout analysis plugin does image preprocessing and layout analysis: it chops up the scanned document and passes the sections to a character recognition plugin for line-by-line or character-by-character recognition.
As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST, optional as of version pre-0.4.
(HTML-based) code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
document analysis
Document Layout Analysis
Document Layout Analysis is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading...
and optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
(OCR) system released under the Apache License
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....
, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.
OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence
German Research Centre for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz , lit. German Research Center for Artificial Intelligence, is one of the world's largest nonprofit contract research institutes in the field of innovative software technology based on Artificial Intelligence methods.DFKI was founded in 1988...
in Kaiserslautern, Germany and is sponsored by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
.
OCRopus is developed for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
; however, users have reported success with OCRopus on Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
and an application called TakOCR has been developed that installs OCRopus on Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
and provides a simple droplet interface.
How it works
OCRopus is an OCR system that combines pluggable layout analysisDocument Layout Analysis
Document Layout Analysis is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading...
, pluggable character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
, and pluggable language modeling.
It aims primarily for high-volume document conversion, namely for Google Book Search
Google Book Search
Google Books is a service from Google that searches the full text of books that Google has scanned, converted to text using optical character recognition, and stored in its digital database. The service was formerly known as Google Print when it was introduced at the Frankfurt Book Fair in October...
, but also for desktop and office use or for vision impaired people.
OCRopus used Tesseract
Tesseract (software)
Tesseract is a free software optical character recognition engine for various operating systems.Originally developed as proprietary software at Hewlett-Packard between 1985 and 1995, it had very little work done on it in the following decade. It was then released as open source in 2005 by Hewlett...
as its only character recognition plugin, but it uses its own engine in the 0.4 release. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition
Handwriting recognition
Handwriting recognition is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or...
engine which may be repaired in the future.
OCRopus's layout analysis plugin does image preprocessing and layout analysis: it chops up the scanned document and passes the sections to a character recognition plugin for line-by-line or character-by-character recognition.
As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST, optional as of version pre-0.4.
History
Release history:- Initial announcement - 9 April 2007
- 0.1.0 (alpha) - 22 Oct 2007
- 0.1.1 (alpha) 14 Dec 2007 - Improved build system
- 0.2 (alpha 2) - 31 May 2008
- 0.3 (alpha 3)- 16 Oct 2008.
- pre-0.4 (alpha 4) available for download May 2009
- 0.4.3 July 2009
- 0.4.4 March 2010
- Updated Roadmap
Usage
OCRopus can be used from the command line or inside gscan2pdf. Once installed, it can be invoked by specifying the input images. It will output hOCRHOCR
hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself...
(HTML-based) code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).
External links
- OCRopus (project page on Google Code)
- OCRopus Wiki
- IUPR Publication Server (papers behind many of the algorithms used in OCRopus)
- OCRopus course (outline of OCRopus code and how to contribute)