Metadata discovery
Encyclopedia
In metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, metadata discovery is the process of using automated tools to discover the semantics
Semantics
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....

 of a data element
Data element
In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has:# An identification such as a data element name# A clear data element definition# One or more representation terms...

 in data sets. This process usually ends with a set of mappings between the data source elements and a centralized metadata registry
Metadata registry
A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method.-Use of Metadata Registries:...

.

Metadata discovery is also known as metadata scanning.

Data source formats for metadata discovery

Data sets may be in a variety of different forms including:
  1. Relational database
    Relational database
    A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...

    s
  2. Spreadsheet
    Spreadsheet
    A spreadsheet is a computer application that simulates a paper accounting worksheet. It displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns. Each cell contains alphanumeric text, numeric values or formulas...

    s
  3. XML
    XML
    Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

     files
  4. Web services
  5. Software source code
    Source code
    In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

     such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and hundreds of other software languages
  6. Unstructured text documents such as Microsoft Word
    Microsoft Word
    Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...

     or PDF files

A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:

Lexical Matching

  1. Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry.
  2. Synonym match - where the discovery tool in not just given a single name but a set of synonym.
  3. Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example the tools may search for "*gender*" or "*sex*"

Semantic Matching

Semantic matching attempts to use semantics
Semantics
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....

 to associate target data with registered data element
Data element
In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has:# An identification such as a data element name# A clear data element definition# One or more representation terms...

s.
  1. Semantic Similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example the WordNet
    WordNet
    WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

     system can rank how close words are conceptually to each other. For example the terms "Person", "Individual" and "Human" may be highly similar concepts.

Statistical Matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements.
  1. Distinct Value Analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'.
  2. Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.

Vendors

The following vendors (listed in alphabetical order) provide metadata discovery and metadata mapping software and solutions

Research

  • INDUS project at the Iowa State University
    Iowa State University
    Iowa State University of Science and Technology, more commonly known as Iowa State University , is a public land-grant and space-grant research university located in Ames, Iowa, United States. Iowa State has produced astronauts, scientists, and Nobel and Pulitzer Prize winners, along with a host of...

     (see http://www.cild.iastate.edu/software/indus.html)
  • Mercury - A Distributed Metadata Management and Data Discovery System developed at the Oak Ridge National Laboratory DAAC
    Oak Ridge National Laboratory DAAC
    The Oak Ridge National Laboratory Distributed Active Archive Center for biogeochemical dynamics is one of the National Aeronautics and Space Administration's Earth Observing System Data and Information System data centers managed by the Earth Science Data and Information System Project , which...

     (see http://mercury.ornl.gov)

See also

  • metadata
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

  • data mapping
    Data mapping
    Data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks including:...

  • data warehouse
    Data warehouse
    In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

  • semantic web
    Semantic Web
    The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

  • Defense Discovery Metadata Specification
    Defense Discovery Metadata Specification
    The Department of Defense Discovery Metadata Specification is a Net-Centric Enterprise Services metadata initiative. DDMS is loosely based on the Dublin Core vocabulary. DDMS defines discovery metadata elements for resources posted to community and organizational shared spaces...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK