Natural language processing toolkits
Encyclopedia
The following natural language processing toolkits are popular collections of natural language processing
software. They are suites of libraries
, frameworks
, and applications for symbolic, statistical natural language and speech processing. NLP tools usually perform sentence detection, tokenization
, POS-tagging, text chunking
, lemmatisation
, coreference
analysis and resolution, and named-entity detection among others.
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
software. They are suites of libraries
Library (computer science)
In computer science, a library is a collection of resources used to develop software. These may include pre-written code and subroutines, classes, values or type specifications....
, frameworks
Software framework
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by user code, thus providing application specific software...
, and applications for symbolic, statistical natural language and speech processing. NLP tools usually perform sentence detection, tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...
, POS-tagging, text chunking
Shallow parsing
Shallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....
, lemmatisation
Lemmatisation
Lemmatisation in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item....
, coreference
Coreference
In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...
analysis and resolution, and named-entity detection among others.
Name | Language | License | Creators | Website |
---|---|---|---|---|
AlchemyAPI | C C (programming language) C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system.... , C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... , C#, Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... , Python Python (programming language) Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive... , Perl Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular... , Ruby Ruby (programming language) Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto... |
Free or Commercial | Orchestr8 | http://www.alchemyapi.com/ |
Antelope framework | C#, VB.net Visual Basic .NET Visual Basic .NET , is an object-oriented computer programming language that can be viewed as an evolution of the classic Visual Basic , which is implemented on the .NET Framework... |
Free for research | Proxem | http://www.proxem.com/ |
Apertium Apertium Apertium is a rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.-History:... |
C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... , Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
GPL | (various) | http://wiki.apertium.org/ |
Cogito | Commercial | Expert System S.p.A. Expert System S.p.A. Expert System is a software company, founded in Italy in 1989, pioneer in developing and marketing semantic technologies to understand and manage unstructured information. Expert System's semantic approach, thanks to its capability of natural language processing, enables a rapid and complete... |
http://www.expertsystem.net/page.asp?id=1521 | |
Carabao Language Kit | Any COM+ compliant language. Customization is via data entry | Commercial with free development tools | Digital Sonata Pty Ltd | http://www.digitalsonata.com/default.aspx |
DELPH-IN | LISP Lisp A lisp is a speech impediment, historically also known as sigmatism. Stereotypically, people with a lisp are unable to pronounce sibilants , and replace them with interdentals , though there are actually several kinds of lisp... , C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
LGPL, MIT, ... | Deep Linguistic Processing with HPSG Initiative | http://www.delph-in.net/ |
Distinguo Distinguo Distinguo is a proprietary software application for Semantic search based on description logic that enables users to search for meaning instead of just keywords. This API permits developers to integrate into their applications a tool to parse natural language , and then measure the semantic... |
C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
Commercial | Ultralingua Inc. | http://ultralingua.com/en/semantic-search.htm |
Ellogon | C C (programming language) C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system.... / C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
LGPL | Georgios Petasis | http://www.ellogon.org/ |
FreeLing Freeling Freeling may refer to:* Major-General Sir Arthur Henry Freeling, Surveyor-General of South Australia from 1849-1861**Freeling, South Australia, a small town, named for Arthur Freeling... |
C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
GPL | Universitat Politècnica de Catalunya | http://nlp.lsi.upc.edu/freeling/ |
General Architecture for Text Engineering General Architecture for Text Engineering General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
LGPL | GATE open source community | http://gate.ac.uk/ |
Graph Expression | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Apache License Apache License The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer.... |
Startup huti.ru | http://code.google.com/p/graph-expression/ |
Learning Based Java Learning Based Java Learning Based Java is a special-purpose programming language based on Java and it is geared toward machine learning and natural language processing . It was developed at the Cognitive Computation Group of the University of Illinois at Urbana Champaign... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
BSD | Cognitive Computation Group at the University of Illinois | http://cogcomp.cs.illinois.edu/page/software_view/11 |
LingPipe | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
royalty free or commercial | Alias-i | http://alias-i.com/lingpipe/index.html |
LinguaStream LinguaStream LinguaStream is a generic platform for Natural Language Processing , based on incremental enrichment of electronic documents. LinguaStream is developed at the computer science research group since 2001... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Free for research | University of Caen, France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... |
http://www.linguastream.org/ |
Mallet Mallet (software project) MALLET is a Java "MAchine Learning for Language Toolkit".-Description:MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, cluster analysis, information extraction, topic modeling and other machine learning applications to... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Common Public License Common Public License In computing, the CPL is a free software / open-source software license published by IBM. The Free Software Foundation and Open Source Initiative have approved the license terms of the CPL.... |
University of Massachusetts Amherst University of Massachusetts Amherst The University of Massachusetts Amherst is a public research and land-grant university in Amherst, Massachusetts, United States and the flagship of the University of Massachusetts system... |
http://mallet.cs.umass.edu/ |
MII nlp toolkit | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
LGPL | UCLA Medical Imaging Informatics (MII) Group | http://www.mii.ucla.edu/nlp/ |
Modular Audio Recognition Framework Modular Audio Recognition Framework Modular Audio Recognition Framework is an open-source research platform and a collection of voice, sound, speech, text and natural language processing algorithms written in Java and arranged into a modular and extensible framework that attempts to facilitate addition of new algorithms. MARF may... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
BSD | The MARF Research and Development Group, Concordia University Concordia University Concordia University is a comprehensive Canadian public university located in Montreal, Quebec, one of the two universities in the city where English is the primary language of instruction... |
http://marf.sf.net |
MontyLingua MontyLingua MontyLingua is a popular natural language processing toolkit. It is a suite of libraries and programs for symbolic and statistical natural language processing for both the Python and Java programming languages. It is enriched with common sense knowledge about the everyday world from Open Mind... |
Python Python (programming language) Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive... , Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Free for research | MIT | http://web.media.mit.edu/~hugo/montylingua/ |
Natural Language Toolkit Natural Language Toolkit Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data... (NLTK) |
Python Python (programming language) Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive... |
Apache 2.0 Apache License The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer.... |
http://www.nltk.org/Home | |
NooJ NooJ NooJ is a development environment used to construct large-coverage, formalized descriptions of natural languages and to apply them to large corpora in real time.-Author:... (based on INTEX) |
.NET Framework .NET Framework The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability... -based |
Free for research | University of Franche-Comté University of Franche-Comté The University of Franche-Comté is a French university in the Academy of Besançon with five campuses: Besançon , Belfort , Montbéliard , Vesoul , and Lons-le-Saunier .... , France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... |
http://www.nooj4nlp.net/ |
OpenNLP OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Apache License 2.0 Apache Software Foundation The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers... |
Online community | http://incubator.apache.org/opennlp/index.html |
Rosette | C, C++, Java, .NET | Commercial | Basis Technology | http://rosette.basistech.com |
ScalaNLP | Scala | Apache License Apache License The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer.... |
David Hall and Daniel Ramage | http://www.scalanlp.org/ |
Stanford NLP | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
GPL | The Stanford Natural Language Processing Group | http://nlp.stanford.edu/software/index.shtml |
Rasp | C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
LGPL | University of Cambridge University of Cambridge The University of Cambridge is a public research university located in Cambridge, United Kingdom. It is the second-oldest university in both the United Kingdom and the English-speaking world , and the seventh-oldest globally... , University of Sussex University of Sussex The University of Sussex is an English public research university situated next to the East Sussex village of Falmer, within the city of Brighton and Hove. The University received its Royal Charter in August 1961.... |
http://www.informatics.susx.ac.uk/research/groups/nlp/rasp/index.html |
Natural | Javascript JavaScript JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles.... , NodeJs |
GPL | Chris Umbel | https://github.com/NaturalNode/natural |
Text Engineering Software Laboratory (Tesla) | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Eclipse Public License Eclipse Public License The Eclipse Public License is an open source software license used by the Eclipse Foundation for its software. It replaces the Common Public License and removes certain terms relating to litigations related to patents.... |
University of Cologne University of Cologne The University of Cologne is one of the oldest universities in Europe and, with over 44,000 students, one of the largest universities in Germany. The university is part of the Deutsche Forschungsgemeinschaft, an association of Germany's leading research universities... |
http://tesla.spinfo.uni-koeln.de/index.html |
Thinktelligence Delegator | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Commercial Commercial software Commercial software, or less commonly, payware, is computer software that is produced for sale or that serves commercial purposes.Commercial software is most often proprietary software, but free software packages may also be commercial software.... |
Thinktelligence Corporation | http://www.thinktelligence.com |
UIMA Uima UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics.... |
Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... / C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
Apache 2.0 Apache License The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer.... |
Apache Apache Software Foundation The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers... |
http://incubator.apache.org/uima/index.html |
WebLab-project | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
LGPL | OW2 | http://weblab-project.org/ |
UniteX | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... & C++ C++ C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell... |
LGPL | Laboratoire d'Automatique Documentaire et Linguistique | http://www-igm.univ-mlv.fr/~unitex/ |
The Dragon Toolkit | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
GPL | Drexel University Drexel University Drexel University is a private research university with the main campus located in Philadelphia, Pennsylvania, USA. It was founded in 1891 by Anthony J. Drexel, a noted financier and philanthropist. Drexel offers 70 full-time undergraduate programs and accelerated degrees... |
http://dragon.ischool.drexel.edu/ |
Factorie | Java Java (programming language) Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities... |
Apache License Apache License The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer.... |
University of Massachusetts Amherst University of Massachusetts Amherst The University of Massachusetts Amherst is a public research and land-grant university in Amherst, Massachusetts, United States and the flagship of the University of Massachusetts system... |
http://code.google.com/p/factorie/ |
Silpa Indic Language Processing Toolkit | Python Python (programming language) Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive... |
AGPL | Silpa opensource community developers | http://smc.org.in/silpa |
External links
- LingPipe's Competition (short description for every tool)
- Open directory of free NLP software at OpenNLP
- Text Analytics Wiki: Software and Tools (directory apparently up to date)
- GATE plugins chapter (give links or cite really working open source tools)