Data quality
Encyclopedia
Data
are of high quality "if they are fit for their intended uses in operations
, decision making
and planning
" (J. M. Juran
). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency
within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept as it related to business data processing, although of course other data have various quality issues as well.
2. The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. Government of British Columbia
3. The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. Glossary of Quality Assurance Terms
4. Glossary of data quality terms published by IAIDQ
5. Data quality: The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria
6. Complete, standards based, consistent, accurate and time stamped http://www.gs1.org/gdsn/dqfGS1
computers were used to maintain name and address data so that the mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA)
. This technology saved large companies millions of dollars compared to manually correcting customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.
Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) improving the understanding of vendor purchases to negotiate volume discounts; and 3) avoiding logistics costs in stocking and shipping parts across a large organization.
While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1
is among the groups spearheading this movement.
For companies with significant research efforts, data quality can include developing protocols
for research methods, reducing measurement error, bounds checking
of the data, cross tabulation
, modeling and outlier
detection, verifying data integrity
, etc.
perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics
to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems
to define data quality rigorously (Wand and Wang, 1996).
A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness
, currency, completeness
and relevance
. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognise this as a similar problem to "ilities".
MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ).
In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence
to customer relationship management
and supply chain management
. One industry study estimated the total cost to the US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects.
In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed.
One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year.
In fact, the problem is such a concern that companies are beginning to set up a data governance
team whose sole role in the corporation is to be responsible for data quality. In some organizations, this data governance
function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.
Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating data shadow systems
and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.
The market is going some way to providing data quality assurance
. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:
There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru
. In addition, the International Association for Information and Data Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field.
ISO 8000
is the international standard for data quality.
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
are of high quality "if they are fit for their intended uses in operations
Business operations
Business operations are those ongoing recurring activities involved in the running of a business for the purpose of producing value for the stakeholders...
, decision making
Decision making
Decision making can be regarded as the mental processes resulting in the selection of a course of action among several alternative scenarios. Every decision making process produces a final choice. The output can be an action or an opinion of choice.- Overview :Human performance in decision terms...
and planning
Planning
Planning in organizations and public policy is both the organizational process of creating and maintaining a plan; and the psychological process of thinking about the activities required to create a desired goal on some scale. As such, it is a fundamental property of intelligent behavior...
" (J. M. Juran
Joseph M. Juran
Joseph Moses Juran was a 20th century management consultant who is principally remembered as an evangelist for quality and quality management, writing several influential books on those subjects. He was the brother of Academy Award winner Nathan H...
). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency
Data consistency
Data consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across an IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions and transactions of other...
within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept as it related to business data processing, although of course other data have various quality issues as well.
Definitions
1. Data exhibited by the data in relation to the portrayal of the actual scenario.2. The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. Government of British Columbia
3. The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. Glossary of Quality Assurance Terms
4. Glossary of data quality terms published by IAIDQ
5. Data quality: The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria
6. Complete, standards based, consistent, accurate and time stamped http://www.gs1.org/gdsn/dqfGS1
History
Before the rise of the inexpensive server, massive mainframeMainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
computers were used to maintain name and address data so that the mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA)
United States Postal Service
The United States Postal Service is an independent agency of the United States government responsible for providing postal service in the United States...
. This technology saved large companies millions of dollars compared to manually correcting customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.
Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) improving the understanding of vendor purchases to negotiate volume discounts; and 3) avoiding logistics costs in stocking and shipping parts across a large organization.
While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1
GS1
Founded in 1977, GS1 is an international not-for-profit association dedicated to the development and implementation of global standards and solutions to improve the efficiency and visibility of supply and demand chains globally and across multiple sectors...
is among the groups spearheading this movement.
For companies with significant research efforts, data quality can include developing protocols
Protocol (natural sciences)
In the natural sciences a protocol is a predefined written procedural method in the design and implementation of experiments. Protocols are written whenever it is desirable to standardize a laboratory method to ensure successful replication of results by others in the same laboratory or by other...
for research methods, reducing measurement error, bounds checking
Bounds checking
In computer programming, bounds checking is any method of detecting whether a variable is within some bounds before its use. It is particularly relevant to a variable used as an index into an array to ensure its value lies within the bounds of the array...
of the data, cross tabulation
Cross tabulation
Cross tabulation is the process of creating a contingency table from the multivariate frequency distribution of statistical variables. Heavily used in survey research, cross tabulations can be produced by a range of statistical packages, including some that are specialised for the task. Survey...
, modeling and outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
detection, verifying data integrity
Data integrity
Data Integrity in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is "the representational faithfulness of information to the true state of the object that the information represents, where representational faithfulness...
, etc.
Overview
There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework seeks to integrate the product perspective (conformance to specifications) and the serviceCustomer service
Customer service is the provision of service to customers before, during and after a purchase.According to Turban et al. , “Customer service is a series of activities designed to enhance the level of customer satisfaction – that is, the feeling that a product or service has met the customer...
perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics
Semiotics
Semiotics, also called semiotic studies or semiology, is the study of signs and sign processes , indication, designation, likeness, analogy, metaphor, symbolism, signification, and communication...
to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems
Information systems
Information Systems is an academic/professional discipline bridging the business field and the well-defined computer science field that is evolving toward a new scientific area of study...
to define data quality rigorously (Wand and Wang, 1996).
A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness
Correctness
In theoretical computer science, correctness of an algorithm is asserted when it is said that the algorithm is correct with respect to a specification...
, currency, completeness
Completeness
In general, an object is complete if nothing needs to be added to it. This notion is made more specific in various fields.-Logical completeness:In logic, semantic completeness is the converse of soundness for formal systems...
and relevance
Relevance
-Introduction:The concept of relevance is studied in many different fields, including cognitive sciences, logic and library and information science. Most fundamentally, however, it is studied in epistemology...
. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognise this as a similar problem to "ilities".
MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ).
In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....
to customer relationship management
Customer relationship management
Customer relationship management is a widely implemented strategy for managing a company’s interactions with customers, clients and sales prospects. It involves using technology to organize, automate, and synchronize business processes—principally sales activities, but also those for marketing,...
and supply chain management
Supply chain management
Supply chain management is the management of a network of interconnected businesses involved in the ultimate provision of product and service packages required by end customers...
. One industry study estimated the total cost to the US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects.
In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed.
One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year.
In fact, the problem is such a concern that companies are beginning to set up a data governance
Data governance
Data governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization...
team whose sole role in the corporation is to be responsible for data quality. In some organizations, this data governance
Data governance
Data governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization...
function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.
Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating data shadow systems
Shadow system
Shadow System is a term used in information services for any application relied upon for business processes that is not under the jurisdiction of a centralized information systems department...
and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.
The market is going some way to providing data quality assurance
Data quality assurance
Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing...
. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:
- Data profilingData profilingData profiling is the process of examining the data available in an existing data source and collecting statistics and information about that data...
- initially assessing the data to understand its quality challenges - Data standardization - a business rules engineBusiness rules engineA business rules engine is a software system that executes one or more business rules in a runtime production environment. The rules might come from legal regulation , company policy , or other sources...
that ensures that data conforms to quality rules - Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
- Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'householding', or finding links between husband and wife at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.
- Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
- Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.
There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru
Guru
A guru is one who is regarded as having great knowledge, wisdom, and authority in a certain area, and who uses it to guide others . Other forms of manifestation of this principle can include parents, school teachers, non-human objects and even one's own intellectual discipline, if the...
. In addition, the International Association for Information and Data Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field.
ISO 8000
ISO 8000
ISO 8000, Data quality, is an ISO standard under development. It will be published as a number of separate documents, which ISO calls "parts"....
is the international standard for data quality.
See also
- Accuracy and precisionAccuracy and precisionIn the fields of science, engineering, industry and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual value. The precision of a measurement system, also called reproducibility or repeatability, is the degree to which...
- Data cleansingData cleansingData cleansing, data cleaning, or data scrubbing is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc...
- Data governanceData governanceData governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization...
- Data integrityData integrityData Integrity in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is "the representational faithfulness of information to the true state of the object that the information represents, where representational faithfulness...
- Data managementData managementData management comprises all the disciplines related to managing data as a valuable resource.- Overview :The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution...
- Data profilingData profilingData profiling is the process of examining the data available in an existing data source and collecting statistics and information about that data...
- Data validationData validationIn computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called "validation rules" or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system...
- Dirty dataDirty DataDirty data is a term used by Information technology professionals when referring to inaccurate information collected from data capture forms...
- Identity resolutionIdentity resolutionIdentity resolution is an operational intelligence process, typically powered by an identity resolution engine or middleware stack, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non-obvious relationships across multiple data silos...
- Information qualityInformation qualityInformation quality is a term to describe the quality of the content of information systems. It is often pragmatically defined as: "The fitness for use of the information provided."- Conceptual problems :...
- Master data managementMaster Data ManagementIn computing, master data management comprises a set of processes and tools that consistently defines and manages the non-transactional data entities of an organization...
- Noisy text analyticsNoisy text analyticsNoisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data...
Further reading
- Eckerson, W. (2002) "Data Warehousing Special Report: Data quality and the bottom line", Article
- Ivanov, K.Kristo IvanovKristo Ivanov is a Swedish-Brazilian information scientist and systems scientist of ethnic Bulgarian origin. He is professor emeritus at the Department of informatics of Umeå University in Sweden.-Biography:...
(1972) "Quality-control of information: On the concept of accuracy of information in data banks and in management information systems". The University of Stockholm and The Royal Institute of Technology. Doctoral dissertation. - Kahn, B., Strong, D., Wang, R. (2002) "Information Quality Benchmarks: Product and Service Performance," Communications of the ACM, April 2002. pp. 184–192. Article
- Price, R. and Shanks, G. (2004) A Semiotic Information Quality Framework, Proc. IFIP International Conference on Decision Support Systems (DSS2004): Decision Support in an Uncertain and Complex World, Prato. Article
- Redman, T. C. (2004) Data: An Unfolding Quality Disaster Article
- Wand, Y. and Wang, R. (1996) “Anchoring Data Quality Dimensions in Ontological Foundations,” Communications of the ACM, November 1996. pp. 86–95. Article
- Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth International Conference of Data Engineering, Vienna, Austria. Article
- Fournel Michel, Accroitre la qualité et la valeur des données de vos clients, éditions Publibook, 2007. ISBN 978-2748338478.
- Daniel F., Casati F., Palpanas T., Chayka O., Cappiello C. (2008) "Enabling Better Decisions through Quality-aware Reports", International Conference on Information Quality (ICIQ), MIT. Article