Dataspaces
Encyclopedia
Dataspaces are an abstraction in data management
that aim to overcome some of the problems encountered in data integration
system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-you-go" fashion as it is used. Labor-intensive aspects of data integration are postponed until they are absolutely needed.
Traditionally, data integration and data exchange
systems have aimed to offer many of the purported services of dataspace systems.
Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems in the following way. Data integration systems require semantic integration
before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system.
Dataspaces shift the emphasis to a data co-existence approach providing base functionality over all data sources, regardless of how integrated they are. For example, a DSSP
can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining
, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental fashion. Similarly, in terms of traditional database guarantees, initially a dataspace system can only provide weaker guarantees of consistency and durability. As stronger guarantees are desired, more effort can be put into making agreements among the various owners of data sources, and opening up certain interfaces (e.g., for commit protocols).
is to offer easy access and manipulation of all of the information on a person’s desktop, with possible extension to mobile devices, personal information on the Web, or even all the information accessed during a person’s lifetime.
Recent desktop search tools are an important first step for PIM, but are limited to keyword queries. Our desktops typically contain some structured data (e.g., spreadsheet
s) and there are important associations between disparate items on the desktop. Hence, the next step for PIM is to allow the user to search the desktop in more meaningful ways. For example, “find the list of juniors who took
my database course last quarter,” or “compute the aggregate balance of my bank accounts.” We would also like to search by association, e.g., “find the email that John sent me the day I came back from Hawaii,” or “retrieve the experiment files associated with my SIGMOD paper this year.” Finally, we would like to query about sources, e.g., “find all the papers where I acknowledged a particular grant,” “find all the experiments run by a particular student,” or “find all spreadsheets that have a variance column.”
The principles of dataspaces in play in this example are that
Such a group can easily amass millions of data products in just a few years. While it may be that for each file, someone in the group knows where it is and what it means, no one person may know the entire holdings nor what every file means. People accessing this data, particularly from outside the group, would like to search a master inventory that had basic file attributes, such as time period covered, geographic region, height or depth, physical variable (salinity, temperature, wind speed), kind of data product (graph, isoline plot, animation), forecast or hindcast, and so forth. Once data products of interest are located, understanding the lineage is paramount in being able to analyze and compare products: What code version was used? Which finite element grid? How long was the simulation time step? Which atmospheric dataset was used as input?
Groups will need to federate with other groups to create scientific dataspaces of regional or national scope. They will need to easily export their data in standard scientific formats, and at granularities (sub-file or multiple file) that don’t necessarily correspond to the partitions they use to store the data. Users of the federated dataspace may want to see collections of data that cut across the groups in the federation, such as all observations and data products related to water velocity, or all data related to a certain stretch of coastline for the past two months. Such collections may require local copies or additional indices for fast search.
This scenario illustrates several dataspace requirements, including
Data management
Data management comprises all the disciplines related to managing data as a valuable resource.- Overview :The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution...
that aim to overcome some of the problems encountered in data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-you-go" fashion as it is used. Labor-intensive aspects of data integration are postponed until they are absolutely needed.
Traditionally, data integration and data exchange
Data exchange
Data exchange is the process of taking data structured under a source schema and actually transforming it into data structured under a target schema, so that the target data is an accurate representation of the source data. Data exchange is similar to the related concept of data integration except...
systems have aimed to offer many of the purported services of dataspace systems.
Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems in the following way. Data integration systems require semantic integration
Semantic integration
Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists; email archives; physical, psychological, and social presence information; documents of all sorts; contacts ; search results; and advertising and marketing relevance derived...
before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system.
Dataspaces shift the emphasis to a data co-existence approach providing base functionality over all data sources, regardless of how integrated they are. For example, a DSSP
DSSP
DSSP may refer to:*Dessert spoon, a spoon with a capacity of about 2 teaspoons*DSSP , a programming language*DSSP , a method of scanning objects into 3D digital representations...
can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental fashion. Similarly, in terms of traditional database guarantees, initially a dataspace system can only provide weaker guarantees of consistency and durability. As stronger guarantees are desired, more effort can be put into making agreements among the various owners of data sources, and opening up certain interfaces (e.g., for commit protocols).
Personal Information Management
The goal of Personal Information ManagementPersonal information management
Personal information management refers to the practice and the study of the activities people perform in order to acquire, organize, maintain, retrieve and use information items such as documents , web pages and email messages for everyday use to complete tasks and fulfill a person’s various...
is to offer easy access and manipulation of all of the information on a person’s desktop, with possible extension to mobile devices, personal information on the Web, or even all the information accessed during a person’s lifetime.
Recent desktop search tools are an important first step for PIM, but are limited to keyword queries. Our desktops typically contain some structured data (e.g., spreadsheet
Spreadsheet
A spreadsheet is a computer application that simulates a paper accounting worksheet. It displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns. Each cell contains alphanumeric text, numeric values or formulas...
s) and there are important associations between disparate items on the desktop. Hence, the next step for PIM is to allow the user to search the desktop in more meaningful ways. For example, “find the list of juniors who took
my database course last quarter,” or “compute the aggregate balance of my bank accounts.” We would also like to search by association, e.g., “find the email that John sent me the day I came back from Hawaii,” or “retrieve the experiment files associated with my SIGMOD paper this year.” Finally, we would like to query about sources, e.g., “find all the papers where I acknowledged a particular grant,” “find all the experiments run by a particular student,” or “find all spreadsheets that have a variance column.”
The principles of dataspaces in play in this example are that
- a PIM tool must enable accessing all the information on the desktop, and not just an explicitly or implicitly chosen subset, and
- while PIM often involves integrating data from multiple sources, we cannot assume users will invest the time to integrate. Instead, most of the time the system will have to provide best-effort results, and tighter integrations will be created only in cases where the benefits will clearly outweigh the investment.
Scientific data management
Consider a scientific research group working on environmental observation and forecasting, such as the CORIE System1. They may be monitoring a coastal ecosystem through weather stations, shore- and buoy-mounted sensors and remote imagery. In addition they could be running atmospheric and fluid-dynamics models that simulate past, current and near future conditions. The computations may require importing data and model outputs from other groups, such as river flows and ocean circulation forecasts. The observations and simulations are the inputs to programs that generate a wide range of data products, for use within the group and by others: comparison plots between observed and simulated data, images of surface-temperature distributions, animations of salt-water intrusion into an estuary.Such a group can easily amass millions of data products in just a few years. While it may be that for each file, someone in the group knows where it is and what it means, no one person may know the entire holdings nor what every file means. People accessing this data, particularly from outside the group, would like to search a master inventory that had basic file attributes, such as time period covered, geographic region, height or depth, physical variable (salinity, temperature, wind speed), kind of data product (graph, isoline plot, animation), forecast or hindcast, and so forth. Once data products of interest are located, understanding the lineage is paramount in being able to analyze and compare products: What code version was used? Which finite element grid? How long was the simulation time step? Which atmospheric dataset was used as input?
Groups will need to federate with other groups to create scientific dataspaces of regional or national scope. They will need to easily export their data in standard scientific formats, and at granularities (sub-file or multiple file) that don’t necessarily correspond to the partitions they use to store the data. Users of the federated dataspace may want to see collections of data that cut across the groups in the federation, such as all observations and data products related to water velocity, or all data related to a certain stretch of coastline for the past two months. Such collections may require local copies or additional indices for fast search.
This scenario illustrates several dataspace requirements, including
- a dataspace-wide catalog,
- support for data lineage and
- creating collections and indexes over entities that span more than one participating source.
Further reading
- Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, Cornelia Hedeler: Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces. EDBT 2010: 573-584
- Partha Pratim Talukdar, Zachary G. Ives, Fernando Pereira: Automatically incorporating new sources in keyword search-based data integration. SIGMOD Conference 2010: 387-398
- Anish Das Sarma, Xin Luna Dong, Alon Y. Halevy: Data Modeling in Dataspace Support Platforms. Conceptual Modeling: Foundations and Applications 2009: 122-138
- Anish Das Sarma, Xin Luna Dong, Alon Y. Halevy: Data Modeling in Dataspace Support Platforms. Conceptual Modeling: Foundations and Applications 2009: 122-138
- Xin Luna Dong, Alon Y. Halevy, Cong Yu: Data integration with uncertainty. VLDB J. 18(2): 469-500 (2009)
- Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha: Learning to create data-integrating queries. PVLDB 1(1): 785-796 (2008)
- Michael J. Franklin, Alon Y. Halevy, David Maier: A first tutorial on dataspaces. PVLDB 1(2): 1516-1517 (2008)
- Bill Howe, David Maier, Nicolas Rayner, James Rucker: Quarrying dataspaces: Schemaless profiling of unfamiliar information sources. IIMAS 2008.
- Xin Dong, Alon Y. Halevy: Indexing dataspaces. SIGMOD Conference 2007: 43-54.
- Jens-Peter Dittrich, Marcos Antonio Vaz Salles: iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB 2006: 367-378.
- Michael J. Franklin, Alon Y. Halevy, David Maier: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4): 27-33 (2005).