Knowledge extraction
Encyclopedia
Knowledge Extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to Information Extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

 (NLP
NLP
- Artificial intelligence :* Natural language processing, a field of computer science and linguistics concerned with the interactions between computers and human languages- Medicine and biology :...

) and ETL
Extract, transform, load
Extract, transform and load is a process in database usage and especially in data warehousing that involves:* Extracting data from outside sources* Transforming it to fit operational needs...

 (Data Warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.

The RDB2RDF W3C group is currently standardizing a language for extraction of RDF from relational databases. Another popular example for Knowledge Extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia
DBpedia
DBpedia is a project aiming to extract structured content from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources,...

, Freebase
Freebase (database)
Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people to...

 and ).

Overview

After the standardization of knowledge representation languages such as RDF
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

 and OWL
Web Ontology Language
The Web Ontology Language is a family of knowledge representation languages for authoring ontologies.The languages are characterised by formal semantics and RDF/XML-based serializations for the Semantic Web...

, much research has been conducted in the area, especially regarding transforming relational databases into RDF, Entity resolution, Knowledge Discovery and Ontology Learning. The general process uses traditional methods from Information Extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

 and ETL, which transform the data from the sources into structured formats.

The following criteria can be used to categorize approaches in this topic (some of them only account for extraction from relational databases):
Source Which data sources are covered: Text, Relational Databases, XML, CSV
Exposition How is the extracted knowledge made explicit (Ontology file, Semantic Database)? How can you query it?
Synchronization Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or Dynamic. Are changes to the result written back (Bi-directional)
Reuse of vocabularies The tool is able to reuse existing vocabularies in the extraction. For example the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab.
Automatisation The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic.
Requires a Domain Ontology A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (Ontology learning
Ontology learning
Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines...

).

Entity Linking

  1. DBpedia Spotlight
    DBpedia Spotlight
    DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and Name Resolution...

    , OpenCalais
    Calais (Reuters Product)
    Calais is a Thomson Reuters initiative to encourage the wide deployment of semantic technologies in the information and content marketplaces. Originally launched in January 2008, the initiative has garnered wide attention due to its technical capabilities and "free to all" business model.The Calais...

    , the Zemanta API
    Zemanta
    Zemanta is a content suggestion engine for bloggers and other content creators.- Features :Zemanta analyzes user-generated content using natural language processing and semantic search technology to suggest pictures, tags and links to related articles.Zemanta suggests content from Wikipedia,...

    , and Extractiv analyze free text via Named Entity Recognition
    Named entity recognition
    Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

     and then disambiguates candidates via Name Resolution and links the found entities to the DBpedia
    DBpedia
    DBpedia is a project aiming to extract structured content from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources,...

     knowledge repository ( DBpedia Spotlight web demo).


President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner
Semantic reasoner
A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms to work with...

 can for example infer that the mentioned entity is of the type Person (using FOAF (software)
FOAF (software)
FOAF is a machine-readable ontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself...

) and of type Presidents of the United States (using YAGO
YAGO (Ontology)
YAGO is a huge semantic knowledge base. Currently, YAGO knows over two million entities such as persons, organizations and cities and about twenty million facts about these entities. A web interface allows users to pose questions to YAGO in the form of queries on the YAGO homepage...

). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide further retrieval of structured data and formal knowledge.

Relational Databases to RDF

  1. Triplify, D2R Server and Virtuoso
    Virtuoso Universal Server
    Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system...

     RDF Views are tools that transform relational databases to RDF. During this process they allow to reuse existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column (e.g.name) or an aggregation of columns (e.g.first_name and last_name) has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity. Then properties with formally defined semantics are used (and reused) to interpret the information. For example a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary
    FOAF (software)
    FOAF is a machine-readable ontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself...

     called foaf:homepage, thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class foaf:Person (Ontology Population). Additionally domain knowledge (in form of an ontology) could be created from the status_id, either by manually created rules (if status_id is 2, the entry belongs to class Teacher ) or by (semi)-automated methods (Ontology Learning
    Ontology learning
    Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines...

    ). Here is an example transformation:

Name marriedTo homepage status_id
Peter Marry http://example.org/Peters_page 1
Claus Eva http://example.org/Claus_page 2


:Peter :marriedTo :Marry .
:marriedTo a owl:SymmetricProperty .
:Peter foaf:homepage .
:Peter a foaf:Person .
:Peter a :Student .
:Claus a :Teacher .

1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values

When building a RDB representation of a problem domain, the starting point is frequently an entity-relationship diagram (ERD). Typically, each entity is represented as a database table, each attribute of the entity becomes a column in that table, and relationships between entities are indicated by foreign keys. Each table typically defines a particular class of entity, each column one of its attributes. Each row in the table describes an entity
instance, uniquely identified by a primary key. The table rows collectively describe an entity set. In an equivalent RDF representation of the same entity set:
  • Each column in the table is an attribute (i.e., predicate)
  • Each column value is an attribute value (i.e., object)
  • Each row key represents an entity ID (i.e., subject)
  • Each row represents an entity instance
  • Each row (entity instance) is represented in RDF by a collection of triples with a common subject (entity ID).


So, to render an equivalent view based on RDF semantics, the basic mapping algorithm would be as follows:
  1. create an RDFS class for each table
  2. convert all primary keys and foreign keys into IRIs
  3. assign a predicate IRI to each column
  4. assign an rdf:type predicate for each row, linking it to an RDFS class IRI corresponding to the table
  5. for each column that is neither part of a primary or foreign key, construct a triple containing the primary key IRI as the subject, the column IRI as the predicate and the column's value as the object.


Early mentioning of this basic or direct mapping can be found in Tim Berners-Lee
Tim Berners-Lee
Sir Timothy John "Tim" Berners-Lee, , also known as "TimBL", is a British computer scientist, MIT professor and the inventor of the World Wide Web...

's comparison of the ER model
Entity-relationship model
In software engineering, an entity-relationship model is an abstract and conceptual representation of data. Entity-relationship modeling is a database modeling method, used to produce a type of conceptual schema or semantic data model of a system, often a relational database, and its requirements...

 to the RDF model.

Complex mappings of relational databases to RDF

The 1:1 mapping mentioned above exposes the legacy data as RDF in a straightforward way, additional refinements can be employed to improve the usefulness of RDF output respective the given Use Cases. Normally, information is lost during the transformation of an entity-relationship diagram (ERD) to relational tables (Details can be found in Object-relational impedance mismatch
Object-Relational impedance mismatch
The object-relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system is being used by a program written in an object-oriented programming language or style; particularly when objects or class definitions...

) and has to be reverse engineered
Reverse engineering
Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...

. From a conceptual view, approaches for extraction can come from two directions. The first direction tries to extract or learn an OWL schema from the given database schema. Early approaches used a fixed amount of manually created mapping rules to refine the 1:1 mapping. More elaborate methods are employing heuristics or learning algorithms to induce schematic information (methods overlap with Ontology learning
Ontology learning
Ontology learning is a subtask of information extraction. The goal of ontology learning is to semi-automatically extract relevant concepts and relations from a given corpus or other kinds of data sets to form an ontology.The automatic creation of ontologies is a task that involves many disciplines...

). While some approaches try to extract the information from the structure inherent in the SQL schema (analysing e.g. foreign keys), others analyse the content and the values in the tables to create conceptual hierarchies (e.g. a columns with few values are candidates for becoming categories). The second direction tries to map the schema and its contents to a pre-existing domain ontology (see also: Ontology alignment
Ontology alignment
Ontology alignment, or ontology matching, is the process of determining correspondences between concepts. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.-Computer Science:For computer...

). Often, however, a suitable domain ontology does not exist and has to be created first.

XML

As XML is structured as a tree, any data can be easily represented in RDF, which is structured as a graph. XML2RDF is one example of an approach that uses RDF blank nodes and transforms XML elements and attributes to RDF properties. The topic however is more complex as in the case of relational databases. In a relational table the primary key is an ideal candidate for becoming the subject of the extracted triples. An XML element, however, can be transformed - depending on the context- as a subject, a predicate or object of a triple. XSLT
XSLT
XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...

 can be used a standard transformation language to manually convert XML to RDF.

Survey of Methods / Tools

Name Data Source Data Exposition Data Synchronisation Mapping Language Vocabulary Reuse Mapping Automat. Req. Domain Ontology Uses GUI
A Direct Mapping of Relational Data to RDF Relational Data
CSV2RDF4LOD CSV ETL static none true manual false false
Convert2RDF Delimited text file ETL static RDF/DAML true manual false true
D2R Server RDB SPARQL bi-directional D2R Map true manual false false
DartGrid RDB own query language dynamic Visual Tool true manual false true
DataMaster RDB ETL static proprietary true manual true true
Google Refine's RDF Extension CSV, XML ETL static none semi-automatic false true
Krextor XML ETL static xslt true manual true false
MAPONTO RDB ETL static proprietary true manual true false
METAmorphoses RDB ETL static proprietary xml based mapping language true manual false true
MappingMaster CSV ETL static MappingMaster true GUI false true
ODEMapster RDB ETL static proprietary true manual true true
OntoWiki CSV Importer Plug-in - DataCube & Tabular CSV ETL static The RDF Data Cube Vocaublary true semi-automatic false true
Poolparty Extraktor (PPX) XML, Text LinkedData dynamic RDF (SKOS) true semi-automatic true false
RDBToOnto RDB ETL static none false automatic, the user furthermore has the chance to fine-tune results false true
RDF 123 CSV ETL static false false manual false true
RDOTE RDB ETL static SQL true manual true true
Relational.OWL RDB ETL static none false automatic false false
T2LD CSV ETL static false false automatic false false
The RDF Data Cube Vocabulary Multidimensional statistical data in spreadsheets Data Cube Vocabulary true manual false
TopBraid Composer CSV ETL static SKOS false semi-automatic false true
Triplify RDB LinkedData dynamic SQL true manual false false
Virtuoso RDF Views RDB SPARQL dynamic Meta Schema Language true semi-automatic false true
Virtuoso Sponger structured and semi-structured data sources SPARQL dynamic Virtuoso PL & XSLT true semi-automatic false false
VisAVis RDB RDQL dynamic SQL true manual true true
XLWrap: Spreadsheet to RDF CSV ETL static TriG Syntax true manual false false
XML to RDF XML ETL static false false manual false false

Knowledge discovery

Knowledge discovery describes the process of automatically searching large volumes of data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 for patterns that can be considered knowledge
Knowledge
Knowledge is a familiarity with someone or something unknown, which can include information, facts, descriptions, or skills acquired through experience or education. It can refer to the theoretical or practical understanding of a subject...

 about the data. It is often described as deriving knowledge
Knowledge
Knowledge is a familiarity with someone or something unknown, which can include information, facts, descriptions, or skills acquired through experience or education. It can refer to the theoretical or practical understanding of a subject...

 from the input data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

. Knowledge discovery developed out of the Data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 domain, and is closely related to it both in terms of methodology and terminology.

The most well-known branch of data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 is knowledge discovery, also known as Knowledge Discovery in Databases (KDD). Just as many other forms of knowledge discovery it creates abstraction
Abstraction
Abstraction is a process by which higher concepts are derived from the usage and classification of literal concepts, first principles, or other methods....

s of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery.

Another promising application of knowledge discovery is in the area of software modernization
Software modernization
Legacy Modernization, or Software modernization, refers to the conversion, rewriting or porting of a legacy system to a modern computer programming language, software libraries, protocols, or hardware platform...

 which involves understanding existing software artifacts. This process is related to a concept of reverse engineering
Reverse engineering
Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...

. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group
Object Management Group
Object Management Group is a consortium, originally aimed at setting standards for distributed object-oriented systems, and is now focused on modeling and model-based standards.- Overview :...

 (OMG) developed specification Knowledge Discovery Metamodel
Knowledge Discovery Metamodel
Knowledge Discovery Metamodel is publicly available specification from the Object Management Group . KDM is a common intermediate representation for existing software systems and their operating environments, that defines common metadata required for deep semantic integration of Application...

 (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as software mining
Software mining
Software mining is an application of knowledge discovery in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of...

 is closely related to data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

, since existing software artifacts contain enormous business value
Business Value
In management, business value is an informal term that includes all forms of value that determine the health and well-being of the firm in the long-run...

, key for the evolution of software systems. Instead of mining individual data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

s, software mining
Software mining
Software mining is an application of knowledge discovery in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of...

 focuses on metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, such as database schemas.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK