Schema matching
Encyclopedia
The terms schema matching and mapping
Data mapping
Data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks including:...

are often used interchangeably. For this article, we differentiate the two as follows: Schema
Database schema
A database schema of a database system is its structure described in a formal language supported by the database management system and refers to the organization of data to create a blueprint of how a database will be constructed...

 matching is the process of identifying that two objects are semantically related (scope of this article) while mapping refers to the transformations
Data transformation
In metadata and data warehouse, a data transformation converts data from a source data format into destination data.Data transformation can be divided into two steps:...

 between the objects. For example, in the two schemas DB1.Student (Name, SSN, Level, Major, Marks)
and DB2.Grad-Student (Name, ID, Major, Grades); possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades (100-90 A; 90-80 B..)

Automating these two approaches has been one of the fundamental tasks of data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...

. In general it is not possible to determine fully automatically the different correspondences between two schemas, primarily because of the differing and often not explicated or documented semantics of the two schemas.

Impediments to Schema Matching

Among others, common challenges to automating matching and mapping have been previously classified in especially for relational DB schemas; and in - a fairly comprehensive list of heterogeneity not limited to the relational model recognizing schematic vs semantic differences/heterogeneity. Most of these heterogeneities exist because schemas use different representations or definitions to represent the same information (schema conflicts); OR different expressions, units, and precision result in conflicting representations of the same data (data conflicts). (see Figure 1)
Research in schema matching seeks to provide automated support to the process of finding semantic matches between two schemas. This process is made harder due to heterogeneities at the following levels
  • Syntactic heterogeneity - differences in the language used for representing the elements
  • Structural heterogeneity - differences in the types, structures of the elements
  • Model / Representational heterogeneity – differences in the underlying models (database, ontologies) or their representations (relational, object-oriented, RDF,OWL)
  • Semantic heterogeneity - where the same real world entity is represented using different terms
    Synonym
    Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...

     or vice-versa
    Homonym
    In linguistics, a homonym is, in the strict sense, one of a group of words that often but not necessarily share the same spelling and the same pronunciation but have different meanings...


Schema Matching

Comprehensive surveys of automatic schema matching approaches are presented in,,.

Methodology

Discusses a generic methodology for the task of schema integration or the activities involved. According to the authors, one can view the integration process to comprise of four main steps:
  • Preintegration - An analysis of schemas is carried out before integration to decide upon some integration policy. This governs the choice of schemas to be integrated, the order of integration, and a possible assignment of preferences to entire schemas or portions of schemas.
  • Comparison of the Schemas - Schemas are analyzed and compared to determine the correspondences among concepts and detect possible conflicts. Interschema properties may be discovered while comparing schemas.
  • Conforming the Schemas - Once conflicts are detected, an effort is made to resolve them so that the merging of various schemas is possible.
  • Merging and Restructuring - Now the schemas are ready to be superimposed, giving rise to some intermediate integrated schema(s). The intermediate results are analyzed and, if necessary, restructured in order to achieve several desirable qualities.


Approaches

Approaches to schema integration can be broadly classified as ones that exploit either just schema information or schema and instance level information (see Figure 2 and and for a list of prototypes)

Schema-level matchers only consider schema information, not instance data. The available information includes the usual properties of schema elements, such as name, description, data type, relationship types (part-of, is-a, etc.), constraints, and schema structure. Working at the element (atomic elements like attributes of objects) or structure level (matching combinations of elements that appear together in a structure), these properties are used to identify matching elements in two schemas. Language-based or linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema elements. Constraint based matchers exploit constraints often contained in schemas. Such constraints are used to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Constraints in two input schemas are matched to determine the similarity of the schema elements.

Instance-level matchers use instance-level data to gather important insight into the contents and meaning of the schema elements. These are typically used in addition to schema level matches in order to boost the confidence in match results, more so when the information available at the schema level is insufficient. Matchers at this level use linguistic and constraint based characterization of instances. For example, using linguistic techniques, it might be possible to look at the Dept, DeptName and EmpName instances to conclude that DeptName is a better match candidate for Dept than EmpName. Constraints like zipcodes must be 5 digits long or format of phone numbers may allow matching of such types of instance data.

Hybrid matchers directly combine several matching approaches to determine match candidates based on multiple criteria or information sources.

Most of these techniques also employ additional information such as dictionaries, thesauri, and user-provided match or mismatch information

Reusing matching information
Another initiative has been to re-use previous matching information as auxiliary information for future matching tasks. The motivation for this work is that structures or substructures often repeat, for example in schemas in the E-commerce domain. Such a reuse of previous matches however needs to be a careful choice. It is possible that such a reuse makes sense only for some part of a new schema or only in some domains. For example, Salary and Income may be considered identical in a payroll application but not in a tax reporting application. There are several open ended challenegs in such reuse that deserves further work.

Sample Prototypes
Typically, the implementation of such matching techniques can be classified as being either rule based or learner based systems. The complementary nature of these different approaches has instigated a number of applications using a combination of techniques depending on the nature of the domain or application under consideration. Among others, both and discuss such prototype systems along with the dimensions of their classification.

Identified Relationships

The relationship types between objects that are identified at the end of a matching process are typically those with set semantics such as overlap, disjointness, exclusion, equivalence, subsumption. The logical encodings of these relationships are what they mean. Among others, an early attempt to use description logics for schema integration and identifying such relationships was presented in. Several state of the art matching tools today, some listed in, and those benchmarked in the Ontology Alignment Evaluation Initiative are capable of identifying many such simple (1:1 / 1:n / n:1 element level matches) and complex matches (n:1 / n:m element or structure level matches) between objects.

See also

  • Data Integration
    Data integration
    Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...

  • Dataspaces
    Dataspaces
    Dataspaces are an abstraction in data management that aim to overcome some of the problems encountered in data integration system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the...

  • Ontology alignment
    Ontology alignment
    Ontology alignment, or ontology matching, is the process of determining correspondences between concepts. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.-Computer Science:For computer...

  • Minimal Mappings
    Minimal mappings
    Minimal mappings are the result of an advanced technique of semantic matching, a technique used in Computer Science to identify information which is semantically related....

  • Federated database system
    Federated database system
    A federated database system is a type of meta-database management system , which transparently integrates multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized...

  • Schema crosswalk

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK