History of machine translation - AbsoluteAstronomy.com

The history of machine translation generally starts in the 1950s, although work can be found from earlier periods. The Georgetown experiment

Georgetown-IBM experiment

The Georgetown-IBM experiment was an influential demonstration of machine translation, which was performed during January 7, 1954. Developed jointly by the Georgetown University and IBM, the experiment involved completely automatic translation of more than sixty Russian sentences into...

in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The experiment was a great success and ushered in an era of significant funding for machine translation research in the United States. The authors claimed that within three or five years, machine translation would be a solved problem. In the Soviet Union, similar experiments were performed shortly after.

However, the real progress was much slower, and after the ALPAC report

ALPAC

ALPAC was a committee of seven scientists led by John R. Pierce, established in 1964 by the U. S. Government in order to evaluate the progress in computational linguistics in general and machine translation in particular...

in 1966, which found that the ten years of research had failed to fulfill the expectations, and funding was dramatically reduced. Starting in the late 1980s, as computational power increased and became less expensive, more interest began to be shown in statistical models for machine translation

Statistical machine translation

Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora...

.

Today there is still no system that provides the holy grail of "fully automatic high quality translation of unrestricted text" (FAHQUT). However, there are many programs now available that are capable of providing useful output within strict constraints; several of them are available online, such as Google Translate

Google Translate

Google Translate is a free statistical machine translation service provided by Google Inc. to translate a section of text, document or webpage, into another language.The service was introduced in April 28, 2006 for the Arabic language...

and the SYSTRAN

SYSTRAN

SYSTRAN, founded by Dr. Peter Toma in 1968, is one of the oldest machine translation companies. SYSTRAN has done extensive work for the United States Department of Defense and the European Commission....

system which powers AltaVista's (Yahoo's since May 9, 2008) BabelFish.

The beginning

The history of machine translation dates back to the seventeenth century, when philosophers such as Leibniz and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine.

The first patents for "translating machines" were applied for in the mid 1930s. One proposal, by Georges Artsrouni was simply an automatic bilingual dictionary using paper tape. The other proposal, by Peter Troyanskii, a Russian

Russians

The Russian people are an East Slavic ethnic group native to Russia, speaking the Russian language and primarily living in Russia and neighboring countries....

, was more detailed. It included both the bilingual dictionary, and a method for dealing with grammatical roles between languages, based on Esperanto

Esperanto

is the most widely spoken constructed international auxiliary language. Its name derives from Doktoro Esperanto , the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887...

. The system was split up into three stages: the first was for a native-speaking editor in the sources language to organise the words into their logical form

Logical form

In logic, the logical form of a sentence or set of sentences is the form obtained by abstracting from the subject matter of its content terms or by regarding the content terms as mere placeholders or blanks on a form...

s and syntactic functions; the second was for the machine to "translate" these forms into the target language; and the third was for a native-speaking editor in the target language to normalise this output. His scheme remained unknown until the late 1950s, by which time computers were well-known.

The early years

The first proposals for machine translation using computers were put forward by Warren Weaver

Warren Weaver

Warren Weaver was an American scientist, mathematician, and science administrator...

, a researcher at the Rockefeller Foundation

Rockefeller Foundation

The Rockefeller Foundation is a prominent philanthropic organization and private foundation based at 420 Fifth Avenue, New York City. The preeminent institution established by the six-generation Rockefeller family, it was founded by John D. Rockefeller , along with his son John D. Rockefeller, Jr...

, in his July, 1949 memorandum. These proposals were based on information theory

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

, successes of code breaking

Cryptography

Cryptography is the practice and study of techniques for secure communication in the presence of third parties...

during the second world war

World War II

World War II, or the Second World War , was a global conflict lasting from 1939 to 1945, involving most of the world's nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis...

and speculation about universal underlying principles of natural language.

A few years after these proposals, research began in earnest at many universities in the United States

United States

The United States of America is a federal constitutional republic comprising fifty states and a federal district...

. On 7 January 1954, the Georgetown-IBM experiment

Georgetown-IBM experiment

, the first public demonstration of an MT system, was held in New York at the head office of IBM. The demonstration was widely reported in the newspapers and received much public interest. The system itself, however, was no more than what today would be called a "toy" system, having just 250 words and translating just 49 carefully selected Russian sentences into English — mainly in the field of chemistry

Chemistry

Chemistry is the science of matter, especially its chemical reactions, but also its composition, structure and properties. Chemistry is concerned with atoms and their interactions with other atoms, and particularly with the properties of chemical bonds....

. Nevertheless it encouraged the view that machine translation was imminent — and in particular stimulated the financing of the research, not just in the US but worldwide.

Early systems used large bilingual dictionaries and hand-coded rules for fixing the word order in the final output. This was eventually found to be too restrictive, and developments in linguistics at the time, for example generative linguistics

Generative linguistics

Generative linguistics is a school of thought within linguistics that makes use of the concept of a generative grammar. The term "generative grammar" is used in different ways by different people, and the term "generative linguistics" therefore has a range of different, though overlapping,...

and transformational grammar

Transformational grammar

In linguistics, a transformational grammar or transformational-generative grammar is a generative grammar, especially of a natural language, that has been developed in the Chomskyan tradition of phrase structure grammars...

were proposed to improve the quality of translations.

During this time, operational systems were installed. The United States Air Force

United States Air Force

The United States Air Force is the aerial warfare service branch of the United States Armed Forces and one of the American uniformed services. Initially part of the United States Army, the USAF was formed as a separate branch of the military on September 18, 1947 under the National Security Act of...

used a system produced by IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

and Washington University, while the Atomic Energy Commission

United States Atomic Energy Commission

The United States Atomic Energy Commission was an agency of the United States government established after World War II by Congress to foster and control the peace time development of atomic science and technology. President Harry S...

in the United States

United States

The United States of America is a federal constitutional republic comprising fifty states and a federal district...

and Euratom in Italy

Italy

Italy , officially the Italian Republic languages]] under the European Charter for Regional or Minority Languages. In each of these, Italy's official name is as follows:;;;;;;;;), is a unitary parliamentary republic in South-Central Europe. To the north it borders France, Switzerland, Austria and...

used a system developed at Georgetown University

Georgetown University

Georgetown University is a private, Jesuit, research university whose main campus is in the Georgetown neighborhood of Washington, D.C. Founded in 1789, it is the oldest Catholic university in the United States...

. While the quality of the output was poor, it nevertheless met many of the customers' needs, chiefly in terms of speed.

At the end of the 1950s, an argument was put forward by Yehoshua Bar-Hillel

Yehoshua Bar-Hillel

Yehoshua Bar-Hillel was an Israeli philosopher, mathematician, and linguist at the Hebrew University of Jerusalem, best known for his pioneering work in machine translation and formal linguistics.- Biography :...

, a researcher asked by the US government to look into machine translation against the possibility of "Fully Automatic High Quality Translation" by machines. The argument is one of semantic ambiguity or double-meaning. Consider the following sentence:

Little John was looking for his toy box. Finally he found it. The box was in the pen.

The word pen may have two meanings, the first meaning something you use to write with, the second meaning a container of some kind. To a human, the meaning is obvious, but he claimed that without a "universal encyclopedia" a machine would never be able to deal with this problem. Today, this type of semantic ambiguity can be solved by writing source texts for machine translation in a controlled language

Controlled natural language

Controlled natural languages are subsets of natural languages, obtained byrestricting the grammar and vocabulary in orderto reduce or eliminate ambiguity and complexity.Traditionally, controlled languages fall into two major types:...

that uses a vocabulary

Vocabulary

A person's vocabulary is the set of words within a language that are familiar to that person. A vocabulary usually develops with age, and serves as a useful and fundamental tool for communication and acquiring knowledge...

in which each word has exactly one meaning.

The 1960s, the ALPAC report and the seventies

Research in the 1960s in both the Soviet Union

Soviet Union

The Soviet Union , officially the Union of Soviet Socialist Republics , was a constitutionally socialist state that existed in Eurasia between 1922 and 1991....

and the United States concentrated mainly on the Russian

Russian language

Russian is a Slavic language used primarily in Russia, Belarus, Uzbekistan, Kazakhstan, Tajikistan and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia, Turkmenistan and Estonia and, to a lesser extent, the other countries that were once constituent republics...

-English language

English language

English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

pair. Chiefly the objects of translation were scientific and technical documents, such as articles from scientific journal

Scientific journal

In academic publishing, a scientific journal is a periodical publication intended to further the progress of science, usually by reporting new research. There are thousands of scientific journals in publication, and many more have been published at various points in the past...

s. The rough translations produced were sufficient to get a basic understanding of the articles. If an article discussed a subject deemed to be of security interest, it was sent to a human translator for a complete translation; if not, it was discarded.

A great blow came to machine translation research in 1966 with the publication of the ALPAC report. The report was commissioned by the US government and performed by ALPAC

ALPAC

, the Automatic Language Processing Advisory Committee, a group of seven scientists convened by the US government in 1964. The US government was concerned that there was a lack of progress being made despite significant expenditure. It concluded that machine translation was more expensive, less accurate and slower than human translation, and that despite the expenses, machine translation was not likely to reach the quality of a human translator in the near future.

The report, however, recommended that tools be developed to aid translators — automatic dictionaries, for example — and that some research in computational linguistics should continue to be supported.

The publication of the report had a profound impact on research into machine translation in the United States, and to a lesser extent the Soviet Union

Soviet Union

The Soviet Union , officially the Union of Soviet Socialist Republics , was a constitutionally socialist state that existed in Eurasia between 1922 and 1991....

and United Kingdom

United Kingdom

The United Kingdom of Great Britain and Northern IrelandIn the United Kingdom and Dependencies, other languages have been officially recognised as legitimate autochthonous languages under the European Charter for Regional or Minority Languages...

. Research, at least in the US, was almost completely abandoned for over a decade. In Canada

Canada

Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean...

, France

France

The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France...

and Germany

Germany

Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate...

, however, research continued. In the US the main exceptions were the founders of Systran (Peter Toma

Peter Toma

Dr. Peter Toma is a Hungarian-born computer scientist and linguistics researcher.Toma developed and commercialised the SYSTRAN machine translation system...

) and Logos

OpenLogos

OpenLogos is the Open Source version of the Logos Machine Translation System, one of the earliest and longest running commercial machine translation products in the world. It was developed by Logos Corporation in the United States, with additional development teams in Germany and Italy...

(Bernard Scott), who established their companies in 1968 and 1970 respectively and served the US Dept of Defense. In 1970, the Systran

SYSTRAN

system was installed for the United States Air Force

United States Air Force

and subsequently in 1976 by the Commission of the European Communities. The METEO System

METEO System

The METEO System is a machine translation system specifically designed for the translation of the weather forecasts issued daily by Environment Canada. The system was used from 1981 to the 30th of September 2001 by Environment Canada to translate forecasts issued in French in the province of Quebec...

, developed at the Université de Montréal

Université de Montréal

The Université de Montréal is a public francophone research university in Montreal, Quebec, Canada. It comprises thirteen faculties, more than sixty departments and two affiliated schools: the École Polytechnique and HEC Montréal...

, was installed in Canada

Canada

in 1977 to translate weather forecasts from English to French, and was translating close to 80,000 words per day or 30 million words per year until it was replaced by a competitor's system on the 30th September, 2001.

While research in the 1960s concentrated on limited language pairs and input, demand in the 1970s was for low-cost systems that could translate a range of technical and commercial documents. This demand was spurred by the increase of globalisation and the demand for translation in Canada

Canada

, Europe

Europe

Europe is, by convention, one of the world's seven continents. Comprising the westernmost peninsula of Eurasia, Europe is generally 'divided' from Asia to its east by the watershed divides of the Ural and Caucasus Mountains, the Ural River, the Caspian and Black Seas, and the waterways connecting...

, and Japan

Japan

Japan is an island nation in East Asia. Located in the Pacific Ocean, it lies to the east of the Sea of Japan, China, North Korea, South Korea and Russia, stretching from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south...

The 1980s and early 1990s

By the 1980s, both the diversity and the number of installed systems for machine translation had increased. A number of systems relying on mainframe

Mainframe computer

Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

technology were in use, such as Systran

SYSTRAN

, Logos

OpenLogos

, Ariane-G5, and Metal

METAL MT

A machine translation system developed at the University of Texas and at Siemens which ran on Lisp Machines.- Background :Originally titled the Linguistics Research System , it was later renamed METAL...

.

As a result of the improved availability of microcomputers, there was a market for lower-end machine translation systems. Many companies took advantage of this in Europe, Japan, and the USA. Systems were also brought onto the market in China

China

Chinese civilization may refer to:* China for more general discussion of the country.* Chinese culture* Greater China, the transnational community of ethnic Chinese.* History of China* Sinosphere, the area historically affected by Chinese culture...

, Eastern Europe

Eastern Europe

Eastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...

, Korea

Korea

Korea ) is an East Asian geographic region that is currently divided into two separate sovereign states — North Korea and South Korea. Located on the Korean Peninsula, Korea is bordered by the People's Republic of China to the northwest, Russia to the northeast, and is separated from Japan to the...

, and the Soviet Union

Soviet Union

The Soviet Union , officially the Union of Soviet Socialist Republics , was a constitutionally socialist state that existed in Eurasia between 1922 and 1991....

.

During the 1980s there was a lot of activity in MT in Japan especially. With the Fifth generation computer

Fifth generation computer

The Fifth Generation Computer Systems project was an initiative by Japan'sMinistry of International Trade and Industry, begun in 1982, to create a "fifth generation computer" which was supposed to perform much calculation using massive parallel processing...

Japan intended to leap over its competition in computer hardware and software, and one project that many large Japanese electronics firms found themselves involved in was creating software for translating to and from English (Fujitsu, Toshiba, NTT, Brother, Catena, Matsushita, Mitsubishi, Sharp, Sanyo, Hitachi, NEC, Panasonic, Kodensha, Nova, Oki).

Research during the 1980s typically relied on translation through some variety of intermediary linguistic representation involving morphological, syntactic, and semantic analysis.

At the end of the 1980s there was a large surge in a number of novel methods for machine translation. One system was developed at IBM

IBM

that was based on statistical methods. Makoto Nagao

Makoto Nagao

is a Japanese computer scientist. He contributed to various fields: machine translation, natural language processing, pattern recognition, image processing and library science...

and his group used methods based on large numbers of example translations, a technique which is now termed example-based machine translation

Example-based machine translation

The example-based machine translation approach to machine translation is often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base, at run-time...

. A defining feature of both of these approaches was the lack of syntactic and semantic rules and reliance instead on the manipulation of large text corpora

Corpus linguistics

Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

.

During the 1990s, encouraged by successes in speech recognition

Speech recognition

Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

and speech synthesis

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...

, research began into speech translation with the development of the German Verbmobil

Verbmobil

Verbmobil was a long-term interdisciplinary Language Technology research project with the aim to develop a system that can recognize, translate and produce natural utterances and thus "translate spontaneous speech robustly and bidirectionally for German/English and German/Japanese".Verbmobil...

project.

There was significant growth in the use of machine translation as a result of the advent of low-cost and more powerful computers. It was in the early 1990s that machine translation began to make the transition away from large mainframe computer

Mainframe computer

s toward personal computer

Personal computer

A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...

s and workstation

Workstation

A workstation is a high-end microcomputer designed for technical or scientific applications. Intended primarily to be used by one person at a time, they are commonly connected to a local area network and run multi-user operating systems...

s. Two companies that led the PC market for a time were Globalink and MicroTac, following which a merger of the two companies (in December 1994) was found to be in the corporate interest of both. Intergraph and Systran also began to offer PC versions around this time. Sites also became available on the internet, such as AltaVista

AltaVista

AltaVista is a web search engine owned by Yahoo!. AltaVista was once one of the most popular search engines but its popularity declined with the rise of Google...

's Babel Fish

Babel Fish (website)

Yahoo! Babel Fish is a web-based machine translation application on Yahoo! that translates text or web pages from one of several languages into another....

(using Systran

SYSTRAN

technology) and Google

Google

Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

Language Tools (also initially using Systran

SYSTRAN

technology exclusively).

Recent research

The field of machine translation has in the last few years seen major changes. Currently a large amount of research is being done into statistical machine translation

Statistical machine translation

and example-based machine translation

Example-based machine translation

The example-based machine translation approach to machine translation is often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base, at run-time...

.
In the area of speech translation, research has focused on moving from domain-limited systems to domain-unlimited translation systems. In different research projects in Europe (like ) and in the United States (STR-DUST and ) solutions for automatically translating Parliamentary speeches and broadcast news have been developed. In these scenarios the domain of the content is no longer limited to any special area, but rather the speeches to be translated cover a variety of topics.
More recently, the French-German project Quaero

Quaero

Quaero is a European research and development program with the goal of developing multimedia and multilingual indexing and management tools for professional and general public applications . The European Commission approved the aid granted by France on 11 March 2008.This program is supported by the...

investigates possibilities to make use of machine translations for a multi-lingual internet. The project seeks to translate not only webpages, but also videos and audio files found on the internet.

Today, only a few companies use statistical machine translation

Statistical machine translation

commercially, e.g. Asia Online

Asia Online

Asia Online is a privately owned company backed by individual investors and institutional venture capital. Its corporate headquarters are in Singapore, and it has significant operations in Bangkok, Thailand, with R&D activities throughout Asia and expanding sales operations in Europe and North...

, SDL International / Language Weaver (sells translation products and services), Google

Google

(uses their proprietary statistical MT system for some language combinations in Google's language tools), Microsoft

Microsoft

Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

(uses their proprietary statistical MT system to translate knowledge base articles), and Ta with you (offers a domain-adapted machine translation solution based on statistical MT with some linguistic knowledge). There has been a renewed interest in hybridisation, with researchers combining syntactic and morphological (i.e., linguistic) knowledge into statistical systems, as well as combining statistics with existing rule-based systems.

The beginning

The early years

The 1960s, the ALPAC report and the seventies

The 1980s and early 1990s

Recent research

See also

Further reading