Data profiling
Encyclopedia
Data profiling is the process of examining the data available in an existing data source (e.g. a database
or a file
) and collecting statistics
and information about that data. The purpose of these statistics may be to:
Additional metadata information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.
The metadata can then be used to discover problems such as illegal values, misspelling, missing values, varying value representation, and duplicates.
Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.
Normally purpose build tools are used for data profiling to ease the process. The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.
More detailed profiling is done prior to the dimensional modeling process in order to see what it will require to convert data into the dimensional model, and extends into the ETL system design process to establish what data to extract and which filters to apply.
Although data profiling is effective, then do remember to find a suitable balance and do not slip in to “analysis paralysis”.
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
or a file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
) and collecting statistics
Descriptive statistics
Descriptive statistics quantitatively describe the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics , in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are...
and information about that data. The purpose of these statistics may be to:
- Find out whether existing data can easily be used for other purposes
- Improve the ability to search the data by taggingTag (metadata)In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
it with keywordsKeywordsKeywords are the words that are used to reveal the internal structure of an author's reasoning. While they are used primarily for rhetoric, they are also used in a strictly grammatical sense for structural composition, reasoning, and comprehension...
, descriptions, or assigning it to a category - Give metricsSoftware metricA software metric is a measure of some property of a piece of software or its specifications. Since quantitative measurements are essential in all sciences, there is a continuous effort by computer science practitioners and theoreticians to bring similar approaches to software development...
on data qualityData qualityData are of high quality "if they are fit for their intended uses in operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer...
, including whether the data conforms to particular standards or patterns - Assess the risk involved in integrating dataData integrationData integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
for new applications, including the challenges of joinJoinJoin may refer to:* Join , to include additional counts or additional defendants on an indictment* Join , a least upper bound of set orders in lattice theory* Join , a type of binary operator...
s - Assess whether metadataMetadataThe term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
accurately describes the actual values in the source database - Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
- Have an enterprise view of all data, for uses such as master data managementMaster Data ManagementIn computing, master data management comprises a set of processes and tools that consistently defines and manages the non-transactional data entities of an organization...
where key data is needed, or data governanceData governanceData governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization...
for improving data quality.
Introduction
Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data. Profiling helps to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata. Thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not. The result of the analysis is used both strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.How to do Data Profiling
Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, and variation as well as other aggregates such as count and sum.Additional metadata information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.
The metadata can then be used to discover problems such as illegal values, misspelling, missing values, varying value representation, and duplicates.
Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.
Normally purpose build tools are used for data profiling to ease the process. The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.
When to Conduct Data Profiling
According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken as soon as candidate source systems have been identified right after the acquisition of the business requirements for the DW/BI. The purpose is to clarify at an early stage if the right data is available at the right detail level and that anomalies can be handled subsequently. If this is not the case the project might have to be canceled.More detailed profiling is done prior to the dimensional modeling process in order to see what it will require to convert data into the dimensional model, and extends into the ETL system design process to establish what data to extract and which filters to apply.
Benefits of Data Profiling
The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.Although data profiling is effective, then do remember to find a suitable balance and do not slip in to “analysis paralysis”.
See also
- Data qualityData qualityData are of high quality "if they are fit for their intended uses in operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer...
- Data governanceData governanceData governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization...
- Master data managementMaster Data ManagementIn computing, master data management comprises a set of processes and tools that consistently defines and manages the non-transactional data entities of an organization...
- Database normalizationDatabase normalizationIn the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...
- Data presentation architectureData Presentation ArchitectureData presentation architecture is a skill-set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge.-Origin and context:...