Knowledge extraction - AbsoluteAstronomy.com

Knowledge Extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to Information Extraction

Information extraction

Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

(NLP

NLP

- Artificial intelligence :* Natural language processing, a field of computer science and linguistics concerned with the interactions between computers and human languages- Medicine and biology :...

) and ETL

Extract, transform, load

Extract, transform and load is a process in database usage and especially in data warehousing that involves:* Extracting data from outside sources* Transforming it to fit operational needs...

(Data Warehouse

Data warehouse

In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.

The RDB2RDF W3C group is currently standardizing a language for extraction of RDF from relational databases. Another popular example for Knowledge Extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia

DBpedia

DBpedia is a project aiming to extract structured content from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources,...

, Freebase

Freebase (database)

Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people to...

and ).

Overview

After the standardization of knowledge representation languages such as RDF

Resource Description Framework

The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

and OWL

Web Ontology Language

The Web Ontology Language is a family of knowledge representation languages for authoring ontologies.The languages are characterised by formal semantics and RDF/XML-based serializations for the Semantic Web...

, much research has been conducted in the area, especially regarding transforming relational databases into RDF, Entity resolution, Knowledge Discovery and Ontology Learning. The general process uses traditional methods from Information Extraction

Information extraction

and ETL, which transform the data from the sources into structured formats.

The following criteria can be used to categorize approaches in this topic (some of them only account for extraction from relational databases):

Source	Which data sources are covered: Text, Relational Databases, XML, CSV
Exposition	How is the extracted knowledge made explicit (Ontology file, Semantic Database)? How can you query it?
Synchronization	Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or Dynamic. Are changes to the result written back (Bi-directional)
Reuse of vocabularies	The tool is able to reuse existing vocabularies in the extraction. For example the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab.
Automatisation	The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic.
Requires a Domain Ontology	A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (Ontology learning Ontology learning Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines... ).

Entity Linking

DBpedia Spotlight
DBpedia Spotlight
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and Name Resolution...

, OpenCalais
Calais (Reuters Product)
Calais is a Thomson Reuters initiative to encourage the wide deployment of semantic technologies in the information and content marketplaces. Originally launched in January 2008, the initiative has garnered wide attention due to its technical capabilities and "free to all" business model.The Calais...

, the Zemanta API
Zemanta
Zemanta is a content suggestion engine for bloggers and other content creators.- Features :Zemanta analyzes user-generated content using natural language processing and semantic search technology to suggest pictures, tags and links to related articles.Zemanta suggests content from Wikipedia,...

, and Extractiv analyze free text via Named Entity Recognition
Named entity recognition
Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

and then disambiguates candidates via Name Resolution and links the found entities to the DBpedia
DBpedia
DBpedia is a project aiming to extract structured content from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources,...

knowledge repository ( DBpedia Spotlight web demo).

President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner

Semantic reasoner

A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms to work with...

can for example infer that the mentioned entity is of the type Person (using FOAF (software)

FOAF (software)

FOAF is a machine-readable ontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself...

) and of type Presidents of the United States (using YAGO

YAGO (Ontology)

YAGO is a huge semantic knowledge base. Currently, YAGO knows over two million entities such as persons, organizations and cities and about twenty million facts about these entities. A web interface allows users to pose questions to YAGO in the form of queries on the YAGO homepage...

). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide further retrieval of structured data and formal knowledge.

Relational Databases to RDF

Triplify, D2R Server and Virtuoso
Virtuoso Universal Server
Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system...

RDF Views are tools that transform relational databases to RDF. During this process they allow to reuse existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column (e.g.name) or an aggregation of columns (e.g.first_name and last_name) has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity. Then properties with formally defined semantics are used (and reused) to interpret the information. For example a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary
FOAF (software)
FOAF is a machine-readable ontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself...

called foaf:homepage, thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class foaf:Person (Ontology Population). Additionally domain knowledge (in form of an ontology) could be created from the status_id, either by manually created rules (if status_id is 2, the entry belongs to class Teacher ) or by (semi)-automated methods (Ontology Learning
Ontology learning
Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines...

). Here is an example transformation:

Name	marriedTo	homepage	status_id
Peter	Marry	http://example.org/Peters_page	1
Claus	Eva	http://example.org/Claus_page	2

:Peter :marriedTo :Marry .
:marriedTo a owl:SymmetricProperty .
:Peter foaf:homepage .
:Peter a foaf:Person .
:Peter a :Student .
:Claus a :Teacher .

1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values

When building a RDB representation of a problem domain, the starting point is frequently an entity-relationship diagram (ERD). Typically, each entity is represented as a database table, each attribute of the entity becomes a column in that table, and relationships between entities are indicated by foreign keys. Each table typically defines a particular class of entity, each column one of its attributes. Each row in the table describes an entity
instance, uniquely identified by a primary key. The table rows collectively describe an entity set. In an equivalent RDF representation of the same entity set:

Each column in the table is an attribute (i.e., predicate)
Each column value is an attribute value (i.e., object)
Each row key represents an entity ID (i.e., subject)
Each row represents an entity instance
Each row (entity instance) is represented in RDF by a collection of triples with a common subject (entity ID).

So, to render an equivalent view based on RDF semantics, the basic mapping algorithm would be as follows:

create an RDFS class for each table
convert all primary keys and foreign keys into IRIs
assign a predicate IRI to each column
assign an rdf:type predicate for each row, linking it to an RDFS class IRI corresponding to the table
for each column that is neither part of a primary or foreign key, construct a triple containing the primary key IRI as the subject, the column IRI as the predicate and the column's value as the object.

Early mentioning of this basic or direct mapping can be found in Tim Berners-Lee

Tim Berners-Lee

Sir Timothy John "Tim" Berners-Lee, , also known as "TimBL", is a British computer scientist, MIT professor and the inventor of the World Wide Web...

's comparison of the ER model

Entity-relationship model

In software engineering, an entity-relationship model is an abstract and conceptual representation of data. Entity-relationship modeling is a database modeling method, used to produce a type of conceptual schema or semantic data model of a system, often a relational database, and its requirements...

to the RDF model.

Complex mappings of relational databases to RDF

The 1:1 mapping mentioned above exposes the legacy data as RDF in a straightforward way, additional refinements can be employed to improve the usefulness of RDF output respective the given Use Cases. Normally, information is lost during the transformation of an entity-relationship diagram (ERD) to relational tables (Details can be found in Object-relational impedance mismatch

Object-Relational impedance mismatch

The object-relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system is being used by a program written in an object-oriented programming language or style; particularly when objects or class definitions...

) and has to be reverse engineered

Reverse engineering

Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...

. From a conceptual view, approaches for extraction can come from two directions. The first direction tries to extract or learn an OWL schema from the given database schema. Early approaches used a fixed amount of manually created mapping rules to refine the 1:1 mapping. More elaborate methods are employing heuristics or learning algorithms to induce schematic information (methods overlap with Ontology learning

Ontology learning

Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines...

). While some approaches try to extract the information from the structure inherent in the SQL schema (analysing e.g. foreign keys), others analyse the content and the values in the tables to create conceptual hierarchies (e.g. a columns with few values are candidates for becoming categories). The second direction tries to map the schema and its contents to a pre-existing domain ontology (see also: Ontology alignment

Ontology alignment

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.-Computer Science:For computer...

). Often, however, a suitable domain ontology does not exist and has to be created first.

XML

As XML is structured as a tree, any data can be easily represented in RDF, which is structured as a graph. XML2RDF is one example of an approach that uses RDF blank nodes and transforms XML elements and attributes to RDF properties. The topic however is more complex as in the case of relational databases. In a relational table the primary key is an ideal candidate for becoming the subject of the extracted triples. An XML element, however, can be transformed - depending on the context- as a subject, a predicate or object of a triple. XSLT

XSLT

XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...

can be used a standard transformation language to manually convert XML to RDF.

Survey of Methods / Tools

Name	Data Source	Data Exposition	Data Synchronisation	Mapping Language	Vocabulary Reuse	Mapping Automat.	Req. Domain Ontology	Uses GUI
A Direct Mapping of Relational Data to RDF	Relational Data
CSV2RDF4LOD	CSV	ETL	static	none	true	manual	false	false
Convert2RDF	Delimited text file	ETL	static	RDF/DAML	true	manual	false	true
D2R Server	RDB	SPARQL	bi-directional	D2R Map	true	manual	false	false
DartGrid	RDB	own query language	dynamic	Visual Tool	true	manual	false	true
DataMaster	RDB	ETL	static	proprietary	true	manual	true	true
Google Refine's RDF Extension	CSV, XML	ETL	static	none		semi-automatic	false	true
Krextor	XML	ETL	static	xslt	true	manual	true	false
MAPONTO	RDB	ETL	static	proprietary	true	manual	true	false
METAmorphoses	RDB	ETL	static	proprietary xml based mapping language	true	manual	false	true
MappingMaster	CSV	ETL	static	MappingMaster	true	GUI	false	true
ODEMapster	RDB	ETL	static	proprietary	true	manual	true	true
OntoWiki CSV Importer Plug-in - DataCube & Tabular	CSV	ETL	static	The RDF Data Cube Vocaublary	true	semi-automatic	false	true
Poolparty Extraktor (PPX)	XML, Text	LinkedData	dynamic	RDF (SKOS)	true	semi-automatic	true	false
RDBToOnto	RDB	ETL	static	none	false	automatic, the user furthermore has the chance to fine-tune results	false	true
RDF 123	CSV	ETL	static	false	false	manual	false	true
RDOTE	RDB	ETL	static	SQL	true	manual	true	true
Relational.OWL	RDB	ETL	static	none	false	automatic	false	false
T2LD	CSV	ETL	static	false	false	automatic	false	false
The RDF Data Cube Vocabulary	Multidimensional statistical data in spreadsheets			Data Cube Vocabulary	true	manual	false
TopBraid Composer	CSV	ETL	static	SKOS	false	semi-automatic	false	true
Triplify	RDB	LinkedData	dynamic	SQL	true	manual	false	false
Virtuoso RDF Views	RDB	SPARQL	dynamic	Meta Schema Language	true	semi-automatic	false	true
Virtuoso Sponger	structured and semi-structured data sources	SPARQL	dynamic	Virtuoso PL & XSLT	true	semi-automatic	false	false
VisAVis	RDB	RDQL	dynamic	SQL	true	manual	true	true
XLWrap: Spreadsheet to RDF	CSV	ETL	static	TriG Syntax	true	manual	false	false
XML to RDF	XML	ETL	static	false	false	manual	false	false

Knowledge discovery

Knowledge discovery describes the process of automatically searching large volumes of data

Data

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

for patterns that can be considered knowledge

Knowledge

Knowledge is a familiarity with someone or something unknown, which can include information, facts, descriptions, or skills acquired through experience or education. It can refer to the theoretical or practical understanding of a subject...

about the data. It is often described as deriving knowledge

Knowledge

from the input data

Data

. Knowledge discovery developed out of the Data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

domain, and is closely related to it both in terms of methodology and terminology.

The most well-known branch of data mining

Data mining

is knowledge discovery, also known as Knowledge Discovery in Databases (KDD). Just as many other forms of knowledge discovery it creates abstraction

Abstraction

Abstraction is a process by which higher concepts are derived from the usage and classification of literal concepts, first principles, or other methods....

s of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery.

Another promising application of knowledge discovery is in the area of software modernization

Software modernization

Legacy Modernization, or Software modernization, refers to the conversion, rewriting or porting of a legacy system to a modern computer programming language, software libraries, protocols, or hardware platform...

which involves understanding existing software artifacts. This process is related to a concept of reverse engineering

Reverse engineering

Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...

. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group

Object Management Group

Object Management Group is a consortium, originally aimed at setting standards for distributed object-oriented systems, and is now focused on modeling and model-based standards.- Overview :...

(OMG) developed specification Knowledge Discovery Metamodel

Knowledge Discovery Metamodel

Knowledge Discovery Metamodel is publicly available specification from the Object Management Group . KDM is a common intermediate representation for existing software systems and their operating environments, that defines common metadata required for deep semantic integration of Application...

(KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as software mining

Software mining

Software mining is an application of knowledge discovery in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of...

is closely related to data mining

Data mining

, since existing software artifacts contain enormous business value

Business Value

In management, business value is an informal term that includes all forms of value that determine the health and well-being of the firm in the long-run...

, key for the evolution of software systems. Instead of mining individual data set

Data set

A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

s, software mining

Software mining

focuses on metadata

Metadata

The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, such as database schemas.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.