DBpedia
Encyclopedia
DBpedia is a project aiming to extract structured content
Structured content
Structured content refers to information or content that has been broken down and classified using metadata. Structured content often refers to information that has been classified using XML, but can also relate to information classified using other standard or proprietary forms of metadata....

 from the information created as part of the Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

 project. This structured information is then made available on the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

. DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related datasets
Linked Data
In computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a...

. DBpedia has been described by Tim Berners-Lee
Tim Berners-Lee
Sir Timothy John "Tim" Berners-Lee, , also known as "TimBL", is a British computer scientist, MIT professor and the inventor of the World Wide Web...

 as one of the more famous parts of the Linked Data
Linked Data
In computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a...

 project.

Background

The project was started by people at the Free University of Berlin
Free University of Berlin
Freie Universität Berlin is one of the leading and most prestigious research universities in Germany and continental Europe. It distinguishes itself through its modern and international character. It is the largest of the four universities in Berlin. Research at the university is focused on the...

 and the University of Leipzig
University of Leipzig
The University of Leipzig , located in Leipzig in the Free State of Saxony, Germany, is one of the oldest universities in the world and the second-oldest university in Germany...

, in collaboration with OpenLink Software
OpenLink Software
Founded in 1992, OpenLink Software, Inc., is a software company headquartered in Burlington, Massachusetts, USA.The company develops and deploys standards-compliant middleware products that cover:...

, and the first publicly available dataset was published in 2007. It is made available under free licences, allowing others to reuse the dataset.

Wikipedia articles consist mostly of free text, but also include structured information embedded in the articles, such as "infobox" tables, categorisation information, images, geo-coordinates and links to external Web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s. This structured information is extracted and put in a uniform dataset which can be queried.

Dataset

, the DBpedia dataset describes more than 3.64 million things, out of which 1.83 million are classified in a consistent ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations, 183,000 species and 5,400 diseases. The DBpedia data set features labels and abstracts for these 3.64 million things in up to 97 different languages; 2,724,000 links to images and 6,300,000 links to external web pages; 6,200,000 external links into other RDF datasets, 740,000 Wikipedia categories, and 18,100,000 YAGO2
YAGO (Ontology)
YAGO is a huge semantic knowledge base. Currently, YAGO knows over two million entities such as persons, organizations and cities and about twenty million facts about these entities. A web interface allows users to pose questions to YAGO in the form of queries on the YAGO homepage...

 categories. From this dataset, information spread across multiple pages can be extracted, for example book authorship can be put together from pages about the work, or the author.

The DBpedia project uses the Resource Description Framework
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

 (RDF) to represent the extracted information. , the DBpedia dataset consists of over 1 billion pieces of information (RDF triples) out of which 385 million were extracted from the English edition of Wikipedia and 665 million were extracted from other language editions.

One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different properties in templates, such as birthplace and placeofbirth. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBPedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.

Example

DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across many different Wikipedia articles. Data is accessed using an SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....

-like query language
Query language
Query languages are computer languages used to make queries into databases and information systems.Broadly, query languages can be classified according to whether they are database query languages or information retrieval query languages...

 for RDF
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

 called SPARQL
SPARQL
SPARQL is an RDF query language; its name is an acronym that stands for SPARQL Protocol and RDF Query Language. It was made a standard by the RDF Data Access Working Group of the World Wide Web Consortium, and considered as one of the key technologies of semantic web...

. For example, imagine you were interested in the Japanese shōjo manga series Tokyo Mew Mew
Tokyo Mew Mew
, also known as Mew Mew Power, is a Japanese shōjo manga series written by Reiko Yoshida and illustrated by Mia Ikumi. It was originally serialized in Nakayoshi from September 2000 to February 2003, and later published in seven tankōbon volumes by Kodansha from February 2001 to April 2003...

, and wanted to find the genres of other works written by its illustrator. DBpedia combines information from Wikipedia's entries on Tokyo Mew Mew, Mia Ikumi
Mia Ikumi
is a Japanese manga artist best known for Tokyo Mew Mew, a manga series she created with Reiko Yoshida.Ikumi was born on March 27 and is from the Osaka Prefecture. She often wears cat ears to her book signings, and interesting costumes...

 and on works such as Super Doll Licca-chan and Koi Cupid
Koi Cupid
is a seinen manga written and illustrated by Mia Ikumi. The manga has 5 volumes released in Japan. The first has been released by Broccoli Books, while the second volume is still scheduled for release In December 2008 in the UK.-Plot:...

. Since DBpedia normalises information into a single database, the following http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&should-sponge=&query=PREFIX+dbprop%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2F%3E%0D%0APREFIX+db%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2F%3E%0D%0ASELECT+%3Fwho%2C+%3Fwork%2C+%3Fgenre+WHERE+{+%0D%0A+db%3ATokyo_Mew_Mew+dbprop%3Aillustrator+%3Fwho+.%0D%0A+%3Fwork++dbpprop%3Aauthor+%3Fwho+.%0D%0A+OPTIONAL+{+%3Fwork+dbpprop%3Agenre+%3Fgenre+}+.%0D%0A}+%0D%0A&format=text%2Fhtml&debug=on&timeout=query] can be asked without needing to know exactly which entry carries each fragment of information, and will list related genres:


PREFIX dbprop:
PREFIX db:
SELECT ?who ?work ?genre WHERE {
db:Tokyo_Mew_Mew dbprop:illustrator ?who .
?work dbprop:author ?who .
OPTIONAL { ?work dbprop:genre ?genre } .
}

Uses

DBpedia has a broad scope of entities covering different areas of human knowledge. This makes it a natural hub for connecting datasets, where external datasets could link to its concepts. The DBpedia dataset is interlinked on the RDF level with various other Open Data
Open Data
Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open...

 datasets on the Web. This enables applications to enrich DBpedia data with data from these datasets. , there are more than 6.5 million interlinks between DBpedia and external datasets including: Freebase
Freebase (database)
Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people to...

, OpenCyc, UMBEL
UMBEL
UMBEL, short for Upper Mapping and Binding Exchange Layer, is an extracted subset of OpenCyc, providing the Cyc data in an RDF ontology based on SKOS and OWL 2...

, GeoNames
GeoNames
GeoNames is a geographical database available and accessible through various Web services, under a Creative Commons attribution license.- Database and web services :...

, Musicbrainz
MusicBrainz
MusicBrainz is a project that aims to create an open content music database. Similar to the freedb project, it was founded in response to the restrictions placed on the CDDB...

, CIA World Fact Book, DBLP
DBLP
DBLP is a computer science bibliography website hosted at Universität Trier, in Germany. It was originally a database and logic programming bibliography site, and has existed at least since the 1980s. DBLP listed more than 1.3 million articles on computer science in January 2010...

, Project Gutenberg
Project Gutenberg
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". Founded in 1971 by Michael S. Hart, it is the oldest digital library. Most of the items in its collection are the full texts of public domain books...

, DBtune Jamendo
Jamendo
Jamendo is a music website and a community of music authors. It bills itself as "the world's #1 platform for free and legal music downloads under Creative Commons licenses."...

, Eurostat
Eurostat
Eurostat is a Directorate-General of the European Commission located in Luxembourg. Its main responsibilities are to provide the European Union with statistical information at European level and to promote the integration of statistical methods across the Member States of the European Union,...

, Uniprot
UniProt
UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...

, Bio2RDF
Bio2RDF
Bio2RDF is a Biological database using the Semantic web technologies to provide interlinked life science data....

, and US Census data. The Thomson Reuters
Thomson Reuters
Thomson Reuters Corporation is a provider of information for the world's businesses and professionals and is created by the Thomson Corporation's purchase of Reuters Group on 17 April 2008. Thomson Reuters is headquartered at 3 Times Square, New York City, USA...

 initiative OpenCalais
Calais (Reuters Product)
Calais is a Thomson Reuters initiative to encourage the wide deployment of semantic technologies in the information and content marketplaces. Originally launched in January 2008, the initiative has garnered wide attention due to its technical capabilities and "free to all" business model.The Calais...

, the Linked Open Data project of the New York Times, the Zemanta API
Zemanta
Zemanta is a content suggestion engine for bloggers and other content creators.- Features :Zemanta analyzes user-generated content using natural language processing and semantic search technology to suggest pictures, tags and links to related articles.Zemanta suggests content from Wikipedia,...

 and DBpedia Spotlight
DBpedia Spotlight
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and Name Resolution...

 also include links to DBpedia. The BBC
BBC
The British Broadcasting Corporation is a British public service broadcaster. Its headquarters is at Broadcasting House in the City of Westminster, London. It is the largest broadcaster in the world, with about 23,000 staff...

 uses DBpedia to help organize its content.
Faviki uses DBpedia for semantic tagging.

Amazon
Amazon.com
Amazon.com, Inc. is a multinational electronic commerce company headquartered in Seattle, Washington, United States. It is the world's largest online retailer. Amazon has separate websites for the following countries: United States, Canada, United Kingdom, Germany, France, Italy, Spain, Japan, and...

 provides DBpedia Public Data Set that can be integrated into Amazon Web Services
Amazon Web Services
Amazon Web Services is a collection of remote computing services that together make up a cloud computing platform, offered over the Internet by Amazon.com...

 applications.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK