ReCAPTCHA
Encyclopedia
reCAPTCHA is a system originally developed at Carnegie Mellon University's
Carnegie Mellon University
Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States....

 main Pittsburgh campus. It uses CAPTCHA
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

 to help digitize
Digitizing
Digitizing or digitization is the representation of an object, image, sound, document or a signal by a discrete set of its points or samples. The result is called digital representation or, more specifically, a digital image, for the object, and digital form, for the signal...

 the text of books while protecting websites from bot
Internet bot
Internet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone...

s attempting to access restricted areas. On September 16, 2009, Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 acquired reCAPTCHA. reCAPTCHA is currently digitizing the archives of The New York Times
The New York Times
The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. The New York Times has won 106 Pulitzer Prizes, the most of any news organization...

and books from Google Books. Twenty years of The New York Times have been digitized and the project planned to have completed the remaining years by the end of 2010.

reCAPTCHA supplies subscribing websites with images of words that optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.

The system is reported to display over 200 million CAPTCHAs every day, and among its subscribers are such popular sites as Facebook
Facebook
Facebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...

, TicketMaster
Ticketmaster
Ticketmaster Entertainment, Inc. is an independent American ticket sales and distribution company based in West Hollywood, California, USA, with operations in many countries around the world. In 2010 it merged with Live Nation to become Live Nation Entertainment...

, Twitter
Twitter
Twitter is an online social networking and microblogging service that enables its users to send and read text-based posts of up to 140 characters, informally known as "tweets".Twitter was created in March 2006 by Jack Dorsey and launched that July...

, 4chan
4chan
4chan is an English-language imageboard website. Launched on October 1, 2003, its boards were originally used for the posting of pictures and discussion of manga and anime...

, CNN.com, and StumbleUpon
StumbleUpon
StumbleUpon is a discovery engine that finds and recommends web content to its users. Its features allow users to discover and rate Web pages, photos, and videos that are personalized to their tastes and interests using peer-sourcing and social-networking principles.Toolbar versions exist for...

. Craigslist
Craigslist
Craigslist is a centralized network of online communities featuring free online classified advertisements, with sections devoted to jobs, housing, personals, for sale, services, community, gigs, résumés, and discussion forums....

 began using reCAPTCHA in June 2008. The U.S. National Telecommunications and Information Administration
National Telecommunications and Information Administration
The National Telecommunications and Information Administration is an agency of the United States Department of Commerce that serves as the President's principal adviser on telecommunications policies pertaining to the United States' economic and technological advancement and to regulation of the...

 also used reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition
DTV transition in the United States
The DTV transition in the United States was the switchover from analog to exclusively digital broadcasting of free over-the-air television programming...

.

Origin

The reCAPTCHA program originated with Guatemala
Guatemala
Guatemala is a country in Central America bordered by Mexico to the north and west, the Pacific Ocean to the southwest, Belize to the northeast, the Caribbean to the east, and Honduras and El Salvador to the southeast...

n computer scientist
Computer scientist
A computer scientist is a scientist who has acquired knowledge of computer science, the study of the theoretical foundations of information and computation and their application in computer systems....

 Luis von Ahn
Luis von Ahn
Luis von Ahn is an entrepreneur and an associate professor in the Computer Science Department at Carnegie Mellon University. He is known as one of the pioneers of the idea of crowdsourcing. He is the founder of the company reCAPTCHA, which was sold to Google in 2009...

, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles."

Operation

Scanned text is subjected to analysis by two different optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known. The system assumes that if the human types the control word correctly, the questionable word is also correct. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.

Implementation

reCAPTCHA tests are taken from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

 API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

 with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment), but the reCAPTCHA software itself is not open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

.

reCAPTCHA offers plugins for several web-application platforms, like ASP.NET
ASP.NET
ASP.NET is a Web application framework developed and marketed by Microsoft to allow programmers to build dynamic Web sites, Web applications and Web services. It was first released in January 2002 with version 1.0 of the .NET Framework, and is the successor to Microsoft's Active Server Pages ...

, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

, or PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

, to ease the implementation of the service.

Security

The basis of the CAPTCHA
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

 system is to prevent automated access to a system by computer programs or "bots". On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed a solve rate of 18%. On August 1, 2010, Chad Houck gave a presentation to the DEF CON
DEF CON
DEF CON is one of the world's largest annual computer hacker conventions, held every year in Las Vegas, Nevada...

 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time. The reCAPTCHA system was modified on 21 July 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system such as a high security lock out if a valid response isn't given 32 times in a row. reCAPTCHA frequently modifies its system which would require the author of a similar program to frequently update the method of decoding, which may frustrate potential abusers.

Mailhide

reCAPTCHA has also created project Mailhide, which protects email addresses on web pages from being harvested
E-mail address harvesting
Email harvesting is the process of obtaining lists of email addresses using various methods for use in bulk email or other purposes usually grouped as spam.-Methods:...

 by spammers
E-mail spam
Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...

. By default, the email address is converted into a format that does not allow a crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

 to see the full email address. For example, "mailme@example.com" would be converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One can also edit the popup code so that none of the address is visible.

External links

  • The reCAPTCHA project
  • Try reCAPTCHA at google.com
  • ReCAPTCHA: The job you didn't even know you had Two-page article in The Walrus
    The Walrus
    The Walrus is a Canadian general interest magazine which publishes long form journalism on Canadian and international affairs, along with fiction and poetry by Canadian writers. It launched in September 2003, as an attempt to create a Canadian equivalent to American magazines such as Harper's, The...

    magazine
  • Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham and Manuel Blum. 2008. "CAPTCHA: Human-Based Character Recognition via Web Security Measures" Science 12 September 2008: Vol. 321 no. 5895 pp. 1465-1468. http://dx.doi.org/10.1126/science.1160379


The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK