CAPTCHA - AbsoluteAstronomy.com

A CAPTCHA is a type of challenge-response

Challenge-response authentication

In computer security, challenge-response authentication is a family of protocols in which one party presents a question and another party must provide a valid answer to be authenticated....

test used in computing

Computing

Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

as an attempt to ensure that the response is generated by a person. The process usually involves one computer (a server

Server (computing)

In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

) asking a user to complete a simple test which the computer is able to generate and grade. Because other computers are assumed to be unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. Thus, it is sometimes described as a reverse Turing test

Reverse Turing test

The term reverse Turing test has no single clear definition, but has been used to describe various situations based on the Turing test in which the objective and/or one or more of the roles have been reversed between computers and humans....

, because it is administered by a machine and targeted at a human, in contrast to the standard Turing test

Turing test

The Turing test is a test of a machine's ability to exhibit intelligent behaviour. In Turing's original illustrative example, a human judge engages in a natural language conversation with a human and a machine designed to generate performance indistinguishable from that of a human being. All...

that is typically administered by a human and targeted at a machine. A common type of CAPTCHA requires the user to type letters or digits from a distorted image that appears on the screen.

The term "CAPTCHA" was coined in 2000 by Luis von Ahn

Luis von Ahn

Luis von Ahn is an entrepreneur and an associate professor in the Computer Science Department at Carnegie Mellon University. He is known as one of the pioneers of the idea of crowdsourcing. He is the founder of the company reCAPTCHA, which was sold to Google in 2009...

, Manuel Blum

Manuel Blum

Manuel Blum is a computer scientist who received the Turing Award in 1995 "In recognition of his contributions to the foundations of computational complexity theory and its application to cryptography and program checking".-Biography:Blum attended MIT, where he received his bachelor's degree and...

, Nicholas J. Hopper, and John Langford

John Langford (computer scientist)

John Langford is a computer scientist, working as a senior researcher at Yahoo! Research. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Previously, he was...

(all of Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States....

). It is an acronym based on the word "capture" and standing for "Completely Automated Public Turing test to tell Computers and Humans Apart". Carnegie Mellon University attempted to trademark the term,
but the trademark application was abandoned on 21 April 2008.

Characteristics:
A CAPTCHA is a means of automatically generating challenges which intends to:

Provide a problem easy enough for all humans to solve.
Prevent standard automated software from filling out a form

A check box in a form that reads "check this box please" is the simplest (and perhaps least effective) form of a CAPTCHA.
CAPTCHAs do not have to rely on difficult problems in artificial intelligence, although they can.

Applications

CAPTCHAs are used in attempts to prevent automated software from performing actions which degrade the quality of service of a given system, whether due to abuse or resource expenditure. CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam

E-mail spam

Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...

, such as the webmail services of Gmail

Gmail

Gmail is a free, advertising-supported email service provided by Google. Users may access Gmail as secure webmail, as well via POP3 or IMAP protocols. Gmail was launched as an invitation-only beta release on April 1, 2004 and it became available to the general public on February 7, 2007, though...

, Hotmail

Hotmail

Windows Live Hotmail, formerly known as MSN Hotmail and commonly referred to simply as Hotmail, is a free web-based email service operated by Microsoft as part of its Windows Live group. It was founded by Sabeer Bhatia and Jack Smith and launched in July 1996 as "HoTMaiL". It was one of the first...

, and Yahoo! Mail

Yahoo! Mail

Yahoo! Mail is a web mail service provided by Yahoo!. It was inaugurated in 1997, and, according to comScore, Yahoo! Mail was the second largest web-based email service with 273.1 million users as of November 2010....

.

CAPTCHAs are also used to minimize automated posting to blog

Blog

A blog is a type of website or part of a website supposed to be updated with new content from time to time. Blogs are usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in...

s, forums

Internet forum

An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are at least temporarily archived...

and wikis, whether as a result of commercial promotion, or harassment and vandalism. CAPTCHAs also serve an important function in rate limiting. Automated usage of a service might be desirable until such usage is done to excess and to the detriment of human users. In such cases, administrators can use CAPTCHA to enforce automated usage policies based on given thresholds. The article rating systems used by many news web sites are another example of an online facility vulnerable to manipulation by automated software.

Accessibility

Because CAPTCHAs rely on visual perception, users unable to view a CAPTCHA due to a disability will be unable to perform the task protected by a CAPTCHA. Therefore, sites implementing CAPTCHAs may provide an audio version of the CAPTCHA in addition to the visual method. The official CAPTCHA site recommends providing an audio CAPTCHA for accessibility reasons, but it is not usable for deafblind people or for users of text web browsers. This combination is not universally adopted, with most websites (including Wikipedia

Wikipedia

Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

) offering only the visual CAPTCHA, with or without providing the option of generating a new image if one is too difficult to read.

Attempts at more accessible CAPTCHAs

Even audio and visual CAPTCHAs will require manual intervention for some users, such as those who have disabilities. There have been various attempts at creating more accessible CAPTCHAs, including the use of JavaScript, mathematical questions ("how much is 1+1") and common sense questions ("what color is the sky on a clear day"). However, these approaches may worsen accessibility for people with intellectual and developmental disabilities, for instance dyscalculia. Some CAPTCHAs of this kind do not meet the criteria for a successful CAPTCHA because they are not automatically generated or do not present a new problem or test for each attack.

One interesting approach to text-based CAPTCHAs is to create a central "anti-bot server", used by many websites, which selects for each call one puzzle, randomly, from a very large set of many different automatically-generated puzzles, of many different kinds. Such a solution can be made useable for blind and visually impaired people who otherwise find prevalent image-based CAPTCHAs to be insurmountable obstacles to completing web forms.

Circumvention

There are several approaches available to defeating CAPTCHAs:

exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA,
improving character recognition software, or
using cheap human labor
Human-based computation
Human-based computation is a computer science technique in which a computational process performs its function by outsourcing certain steps to humans...

to process the tests (see below).

Insecure implementation

Like any security system, design flaws in a system implementation can prevent the theoretical security from being realized. Many CAPTCHA implementations, especially those which have not been designed and reviewed by experts in the fields of security, are prone to common attacks.

Some CAPTCHA protection systems can be bypassed without using OCR

Optical character recognition

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

simply by re-using the session ID

Session ID

In computer science, a session identifier, session ID or session token is a piece of data that is used in network communications to identify a session, a series of related message exchanges. Session identifiers become necessary in cases where the communications infrastructure uses a stateless...

of a known CAPTCHA image. A correctly designed CAPTCHA does not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second guess after an incorrect OCR attempt. Other CAPTCHA implementations use a hash

Cryptographic hash function

A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...

(such as an MD5

MD5

The MD5 Message-Digest Algorithm is a widely used cryptographic hash function that produces a 128-bit hash value. Specified in RFC 1321, MD5 has been employed in a wide variety of security applications, and is also commonly used to check data integrity...

hash) of the solution as a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked. Further, the hash could assist an OCR based attempt. A more secure scheme would use an HMAC

HMAC

In cryptography, HMAC is a specific construction for calculating a message authentication code involving a cryptographic hash function in combination with a secret key. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message...

. Finally, some implementations use only a small fixed pool of CAPTCHA images. Eventually, when enough CAPTCHA image solutions have been collected by an attacker over a period of time, the CAPTCHA can be broken by simply looking up solutions in a table, based on a hash of the challenge image.

Computer character recognition

A number of research projects have attempted (often successfully) to beat visual CAPTCHAs by creating programs that contain the following functionality:

Pre-processing
Noise reduction
Noise reduction is the process of removing noise from a signal.All recording devices, both analogue or digital, have traits which make them susceptible to noise...

: Removal of background clutter and noise.
Segmentation
Segmentation (image processing)
In computer vision, segmentation refers to the process of partitioning a digital image into multiple segments . The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze...

: Splitting the image into regions which each contain a single character.
Classification: Identifying the character in each region.

Steps 1 and 3 are easy tasks for computers. The only step where humans still outperform computers is segmentation. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.

Several research projects have broken real world CAPTCHAs, including one of Yahoo's early CAPTCHAs called "EZ-Gimpy", the CAPTCHAs used by popular sites such as PayPal, LiveJournal, phpBB, the e-banking CAPTCHAs used by a lot of financial institutions, and CAPTCHAs used by other services. In January 2008 Network Security Research released their program for automated Yahoo! CAPTCHA recognition. Windows Live Hotmail and Gmail

Gmail

, the other two major free email providers, were cracked shortly after.

In February 2008 it was reported that spammers had achieved a success rate of 30% to 35%, using a bot, in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA. A Newcastle University research team has defeated the segmentation part of Microsoft's CAPTCHA with a 90% success rate, and claim that this could lead to a complete crack with a greater than 60% rate.

Human solvers

CAPTCHA is vulnerable to a relay attack

Relay attack

A Relay attack is a type of attack related to man-in-the-middle and replay attacks, in which an attacker relays verbatim a message from the sender to a valid receiver of the message...

that uses humans to solve the puzzles. One approach involves relaying the puzzles to a group of human operators who can solve CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a CAPTCHA, it gives the CAPTCHA to the human operator to solve.

Spammers pay about $0.80 to $1.20 for each 1,000 solved CAPTCHAs to companies employing human solvers in Bangladesh, China, India, and many other developing nations. Other sources cite a price tag of as low as $0.50 for each 1,000 solved.

Another approach involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site. In October 2007, a piece of malware appeared in the wild which enticed users to solve CAPTCHAs in order to see progressively further into a series of striptease images. A more recent view is that this is unlikely to work due to unavailability of high-traffic sites and competition by similar sites.

These methods have been used by spammers to set up thousands of accounts on free email services such as Gmail and Yahoo!. Since Gmail and Yahoo! are unlikely to be blacklisted by anti-spam systems, spam sent through these compromised accounts is less likely to be blocked.

Legal concerns

The circumvention of CAPTCHAs may violate the anti-circumvention clause of the Digital Millennium Copyright Act

Digital Millennium Copyright Act

The Digital Millennium Copyright Act is a United States copyright law that implements two 1996 treaties of the World Intellectual Property Organization . It criminalizes production and dissemination of technology, devices, or services intended to circumvent measures that control access to...

(DMCA) in the United States

United States

The United States of America is a federal constitutional republic comprising fifty states and a federal district...

. In 2007, Ticketmaster

Ticketmaster

Ticketmaster Entertainment, Inc. is an independent American ticket sales and distribution company based in West Hollywood, California, USA, with operations in many countries around the world. In 2010 it merged with Live Nation to become Live Nation Entertainment...

sued software maker RMG Technologies for its product which circumvented the ticket seller's CAPTCHAs on the basis that it violated the anti-circumvention clause of the DMCA. In October 2007, an injunction

Injunction

An injunction is an equitable remedy in the form of a court order that requires a party to do or refrain from doing certain acts. A party that fails to comply with an injunction faces criminal or civil penalties and may have to pay damages or accept sanctions...

was issued stating that Ticketmaster would likely succeed in making its case. In June 2008, Ticketmaster filed for default judgment

Default judgment

Default judgment is a binding judgment in favor of either party based on some failure to take action by the other party. Most often, it is a judgment in favor of a plaintiff when the defendant has not responded to a summons or has failed to appear before a court of law...

against RMG. The Court granted Ticketmaster the default and entered an $18.2M judgment in favor of Ticketmaster.

Image-recognition CAPTCHAs

Some researchers promote image recognition CAPTCHAs as a possible alternative for text-based CAPTCHAs. Computer-based recognition algorithms require the extraction of color, texture, shape, or special point features, which cannot be correctly extracted after the designed distortions. However, humans can still recognize the original concept depicted in the images even with these distortions.

A recent example of image recognition CAPTCHA is to present the website visitor with a grid of random pictures and instruct the visitor to click on specific pictures to verify that they are not a bot (such as “Click on the pictures of the airplane, the boat and the clock”).

Image recognition CAPTCHAs face many potential problems which have not been fully studied. It is difficult for a small site to acquire a large dictionary of images to which an attacker does not have access and without a means of automatically acquiring new labelled images, an image-based challenge does not usually meet the definition of a CAPTCHA. KittenAuth, by default, had only 42 images in its database. Microsoft's "Asirra," which it is providing as a free web service, attempts to address this by means of Microsoft Research's partnership with Petfinder.com, which has provided it with more than three million images of cats and dogs, classified by people at thousands of US animal shelters. Researchers claim to have written a program that can break the Microsoft Asirra CAPTCHA. The IMAGINATION CAPTCHA, however, uses a sequence of randomized distortions on the original images to create the CAPTCHA images. Their original images can be made public without risk of image-retrieval or image-annotation based attacks.

Human solvers are a potential weakness for strategies such as Asirra. If the database of cat and dog photos can be downloaded, paying workers $0.01 to classify each photo as of either a dog or a cat means that almost the entire database of photos can be deciphered for $30,000. Photos that are subsequently added to the Asirra database are then a relatively small data set that can be classified as they first appear. Causing minor changes to images each time they appear will not prevent a computer from recognizing a repeated image as there are robust image comparator functions (e.g., image hashes

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

, color histogram

Color histogram

In image processing and photography, a color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, that span the image's color space, the set of all...

s) that are insensitive to many simple image distortions. Warping an image sufficiently to fool a computer will likely also be troublesome to a human.

Researchers at Google used image orientation and collaborative filtering as a CAPTCHA. Generally speaking, people know what "up" is but computers have a difficult time for a broad range of images. Images were pre-screened to be determined to be difficult to detect up (e.g. no skies, no faces, no text). Images were also collaboratively filtered by showing a "candidate" image along with good images for the person to rotate. If there was a large variance in answers for the candidate image, it was deemed too hard for people as well and discarded.

Many users of the phpBB

PhpBB

phpBB is a popular Internet forum package written in the PHP scripting language. The name "phpBB" is an abbreviation of PHP Bulletin Board...

forum software (which has suffered greatly from spam) have implemented an open source

Open source

The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

image recognition CAPTCHA system in the form of an addon called KittenAuth which in its default form presents a question requiring the user to select a stated type of animal from an array of thumbnail images of assorted animals. The images (and the challenge questions) can be customized, for example to present questions and images which would be easily answered by the forum's target userbase. Furthermore, for a time, RapidShare

RapidShare

RapidShare is a one-click hosting service that offers both free and commercial services. Operating from Switzerland, it is financed by the subscriptions of paying users...

free users had to get past a CAPTCHA where they had to enter letters attached only to a cat, while others were attached to dogs. This was later removed because (legitimate) users had trouble entering the correct letters.