Synthetic data
Encyclopedia
Synthetic data are "any production data applicable to a given situation that are not obtained by direct measurement" according to the McGraw-Hill Dictionary of Scientific and Technical Terms; where Craig S. Mullins, an expert in data management, defines production data as "information that is persistently stored and used by professionals to conduct business processes.".
The creation of synthetic data is an involved process of data anonymization
; that is to say that synthetic data is a subset
of anonymized data. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality
of particular aspects of the data. Many times the particular aspects come about in the form of human information (i.e. name, home address, IP address
, telephone number, social security number, credit card number, etc.).
"This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud
detection system itself, thus creating the necessary adaptation of the system to a specific environment." 4
. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.
In 1994, Fienberg
came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.5 Later, other important contributors to the development of synthetic data generation are Raghunathan, Reiter, Rubin
, Abowd, Woodcock
. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.5
. Testing and training fraud
detection systems, confidentiality systems and any type of system is devised using synthetic data. As described previously, synthetic data may seem as just a compilation of “made up” data, but there are specific algorithms and generators that are designed to create realistic data. 6 This synthetic data assists in teaching a system how to react to certain situations or criteria. Researcher doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. 4
Synthetic data is also used to protect the privacy
and confidentiality
of a set of data. Real data contains personal/private/confidential information that a programmer, software creator or research project may not want to be disclosed. 7 Synthetic data holds no personal information and cannot be traced back to any individual; therefore, the use of synthetic data reduces confidentiality and privacy issues.
s".10
"Synthetic data can be generated with random orientations and positions."8 Datasets can be get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.9
Constructing a synthesizer build involves constructing a statistical model
. In a linear regression
line example, the original data can be plotted, and a best fit linear line
can be created from the data. This linear line
is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality
of the original data.9
David Jensen from the Knowledge Discovery Laboratory mentioned how to generate synthetic data in his "Proximity 4.3 Tutorial" chapter 6: "Researchers frequently need to explore the effects of certain data characteristics on their data model
." To help construct datasets
exhibiting specific properties, such as auto-correlation
or degree disparity, proximity can generate synthetic data having one of several types of graph structure10:random graph
s that is generated by some random process;lattice graph
s having a ring structure;lattice graph
s having a grid structure, etc.
In all cases, the data generation process follows the same process:
1. Generate the empty graph structure
.
2. Generate attribute values
based on user-supplied prior probabilities.
Since the attribute values
of one object may depend on the attribute values
of related objects, the attribute generation process assigns values collectively.10
Fienberg, S. E. (1994). “Conflicts between the needs for access to statistical information and demands for confidentiality”, Journal of Official Statistics 10, 115–132.
Little, R (1993). “Statistical Analysis of Masked Data,” Journal of Official Statistics, 9, 407-426.
Raghunathan, T.E., Reiter, J.P., and Rubin, D.B. (2003). “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19, 1-16.
Reiter, J.P. (2004). “Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation,” Survey Methodology, 30, 235-242.
The creation of synthetic data is an involved process of data anonymization
Anonymity
Anonymity is derived from the Greek word ἀνωνυμία, anonymia, meaning "without a name" or "namelessness". In colloquial use, anonymity typically refers to the state of an individual's personal identity, or personally identifiable information, being publicly unknown.There are many reasons why a...
; that is to say that synthetic data is a subset
Subset
In mathematics, especially in set theory, a set A is a subset of a set B if A is "contained" inside B. A and B may coincide. The relationship of one set being a subset of another is called inclusion or sometimes containment...
of anonymized data. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality
Confidentiality
Confidentiality is an ethical principle associated with several professions . In ethics, and in law and alternative forms of legal resolution such as mediation, some types of communication between a person and one of these professionals are "privileged" and may not be discussed or divulged to...
of particular aspects of the data. Many times the particular aspects come about in the form of human information (i.e. name, home address, IP address
IP address
An Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...
, telephone number, social security number, credit card number, etc.).
Usefulness
Synthetic data are generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. This allows us to take into account unexpected results and have a basic solution or remedy, if the results prove to be unsatisfactory. Synthetic data are often generated to represent the authentic data and allows a baseline to be set4. Another use of synthetic data is to protect privacy and confidentiality of authentic data. As stated previously, synthetic data is used in testing and creating many different types of systems; below is a quote from the abstract of an article that describes a software that generates synthetic data for testing fraud detection systems that further explains its use and importance."This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud
Fraud
In criminal law, a fraud is an intentional deception made for personal gain or to damage another individual; the related adjective is fraudulent. The specific legal definition varies by legal jurisdiction. Fraud is a crime, and also a civil law violation...
detection system itself, thus creating the necessary adaptation of the system to a specific environment." 4
History
The history of the generation of synthetic data dates back to 1993. In 1993, the idea of original fully synthetic data was created by RubinDonald Rubin
Donald Bruce Rubin is the John L. Loeb Professor of Statistics at Harvard University. He was hired by Harvard in 1984, and served as chair of the department from 1985-1994....
. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.
In 1994, Fienberg
Stephen Fienberg
Stephen Elliott Fienberg is the Maurice Falk University Professor of Statistics and Social Science in the Department of Statistics, the Machine Learning Department and Cylab at Carnegie Mellon University....
came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.5 Later, other important contributors to the development of synthetic data generation are Raghunathan, Reiter, Rubin
Donald Rubin
Donald Bruce Rubin is the John L. Loeb Professor of Statistics at Harvard University. He was hired by Harvard in 1984, and served as chair of the department from 1985-1994....
, Abowd, Woodcock
Jim Woodcock
Professor Jim C. P. Woodcock FRSA FBCS FREng is a British computer scientist.Woodcock gained his PhD from the University of Liverpool. Until 2001 he was Professor of Software Engineering at the Oxford University Computing Laboratory, where he was also a Fellow of Kellogg College...
. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.5
Applications
Synthetic data are used in the process of data miningData mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
. Testing and training fraud
Fraud
In criminal law, a fraud is an intentional deception made for personal gain or to damage another individual; the related adjective is fraudulent. The specific legal definition varies by legal jurisdiction. Fraud is a crime, and also a civil law violation...
detection systems, confidentiality systems and any type of system is devised using synthetic data. As described previously, synthetic data may seem as just a compilation of “made up” data, but there are specific algorithms and generators that are designed to create realistic data. 6 This synthetic data assists in teaching a system how to react to certain situations or criteria. Researcher doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. 4
Synthetic data is also used to protect the privacy
Privacy
Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively...
and confidentiality
Confidentiality
Confidentiality is an ethical principle associated with several professions . In ethics, and in law and alternative forms of legal resolution such as mediation, some types of communication between a person and one of these professionals are "privileged" and may not be discussed or divulged to...
of a set of data. Real data contains personal/private/confidential information that a programmer, software creator or research project may not want to be disclosed. 7 Synthetic data holds no personal information and cannot be traced back to any individual; therefore, the use of synthetic data reduces confidentiality and privacy issues.
Calculations
Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithmAlgorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
s".10
"Synthetic data can be generated with random orientations and positions."8 Datasets can be get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.9
Constructing a synthesizer build involves constructing a statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...
. In a linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
line example, the original data can be plotted, and a best fit linear line
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
can be created from the data. This linear line
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality
Confidentiality
Confidentiality is an ethical principle associated with several professions . In ethics, and in law and alternative forms of legal resolution such as mediation, some types of communication between a person and one of these professionals are "privileged" and may not be discussed or divulged to...
of the original data.9
David Jensen from the Knowledge Discovery Laboratory mentioned how to generate synthetic data in his "Proximity 4.3 Tutorial" chapter 6: "Researchers frequently need to explore the effects of certain data characteristics on their data model
Data model
A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....
." To help construct datasets
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...
exhibiting specific properties, such as auto-correlation
Autocorrelation
Autocorrelation is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time separation between them...
or degree disparity, proximity can generate synthetic data having one of several types of graph structure10:random graph
Random graph
In mathematics, a random graph is a graph that is generated by some random process. The theory of random graphs lies at the intersection between graph theory and probability theory, and studies the properties of typical random graphs.-Random graph models:...
s that is generated by some random process;lattice graph
Lattice graph
The terms lattice graph, mesh graph, or grid graph refer to a number of categories of graphs whose drawing corresponds to some grid/mesh/lattice, i.e., its vertices correspond to the nodes of the mesh and its edges correspond to the ties between the nodes.-Square grid graph:A common type of a...
s having a ring structure;lattice graph
Lattice graph
The terms lattice graph, mesh graph, or grid graph refer to a number of categories of graphs whose drawing corresponds to some grid/mesh/lattice, i.e., its vertices correspond to the nodes of the mesh and its edges correspond to the ties between the nodes.-Square grid graph:A common type of a...
s having a grid structure, etc.
In all cases, the data generation process follows the same process:
1. Generate the empty graph structure
Graph (data structure)
In computer science, a graph is an abstract data structure that is meant to implement the graph and hypergraph concepts from mathematics.A graph data structure consists of a finite set of ordered pairs, called edges or arcs, of certain entities called nodes or vertices...
.
2. Generate attribute values
Attribute-value system
An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" and rows designating "objects" An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" (also...
based on user-supplied prior probabilities.
Since the attribute values
Attribute-value system
An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" and rows designating "objects" An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" (also...
of one object may depend on the attribute values
Attribute-value system
An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" and rows designating "objects" An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" (also...
of related objects, the attribute generation process assigns values collectively.10
External links
The datgen synthetic data generator: http://www.datasetgenerator.comFienberg, S. E. (1994). “Conflicts between the needs for access to statistical information and demands for confidentiality”, Journal of Official Statistics 10, 115–132.
Little, R (1993). “Statistical Analysis of Masked Data,” Journal of Official Statistics, 9, 407-426.
Raghunathan, T.E., Reiter, J.P., and Rubin, D.B. (2003). “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19, 1-16.
Reiter, J.P. (2004). “Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation,” Survey Methodology, 30, 235-242.