DataCleaner
Encyclopedia
DataCleaner is the flag-ship application of the eobjects.org open source
community. DataCleaner is a data quality
application suite with functionality for data profiling
, transformation and reporting. The project was founded in late 2007 by Danish student Kasper Sørensen, who wrote a term paper on the establishment of the process of establishing the project and the ways of Open source software development
.
version 2.0 to the Lesser General Public License. According to the DataCleaner website, the change was made to "ensure that improvements are submitted back to the projects" and that "we don't risk that anyone sell modified versions of our projects" .
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
community. DataCleaner is a data quality
Data quality
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer...
application suite with functionality for data profiling
Data profiling
Data profiling is the process of examining the data available in an existing data source and collecting statistics and information about that data...
, transformation and reporting. The project was founded in late 2007 by Danish student Kasper Sørensen, who wrote a term paper on the establishment of the process of establishing the project and the ways of Open source software development
Open source software development
Open source software development is the process by which open source software is developed. These are software products “available with its source code and under an open source license to study, change, and improve its design”...
.
Supported datastores
DataCleaner supports read-access to a lot of different types of datastores:- JDBC compliant databases (such as OracleOracle databaseThe Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....
, MySQLMySQLMySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
, Microsoft SQL ServerMicrosoft SQL ServerMicrosoft SQL Server is a relational database server, developed by Microsoft: It is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network...
, PostgresqlPostgreSQLPostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
, FirebirdFirebird (database server)Firebird is an open source SQL relational database management system that runs on Linux, Windows, and a variety of Unix. The database forked from Borland's open source edition of InterBase in 2000, but since Firebird 1.5 the code has been largely rewritten ....
, SQLiteSQLiteSQLite is an ACID-compliant embedded relational database management system contained in a relatively small C programming library. The source code for SQLite is in the public domain and implements most of the SQL standard...
, HsqldbHSQLDBHSQLDB is a relational database management system written in Java. It has a JDBC driver and supports a large subset of SQL-92 and SQL:2008 standards. It offers a fast, small database engine which offers both in-memory and disk-based tables...
, Derby/JavaDBApache DerbyApache Derby is a relational database management system developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 2 MB disk-space footprint.Apache Derby is developed as an open source project under the Apache 2.0 license...
) - Comma-separated values (.csv) files
- Microsoft ExcelMicrosoft ExcelMicrosoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...
(.xls and .xlsx) spreadsheets - XMLXMLExtensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
files - OpenDocumentOpenDocumentThe Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....
database (.odb) files - Microsoft AccessMicrosoft AccessMicrosoft Office Access, previously known as Microsoft Access, is a relational database management system from Microsoft that combines the relational Microsoft Jet Database Engine with a graphical user interface and software-development tools. It is a member of the Microsoft Office suite of...
(.mdb) database files - DBaseDBASEdBase II was the first widely used database management system for microcomputers. It was originally published by Ashton-Tate for CP/M, and later on ported to the Apple II and IBM PC under DOS...
(.dbf) database files
0.x: A school project
From early on, DataCleaner 0.x versions was released as a part of Kasper Sørensens term paper project. The 0.x versions had a similar user concept as the later 1.x versions, but the underlying querying mechanisms was based on a single data factory pattern, where the application could only retrieve data from various datastores using a single method of retrieval (get all rows).1.x: An independent OSS project
The 1.x versions of DataCleaner gained a lot of popularity in the field for DQ professionals. The application was partitioned into three specific data quality function areas:Profiler
The profiler in DataCleaner enables the user to gain insight in to the content of the datastore. The profiler can calculate and present a lot of interesting metrics that will help the user become aware and understand data quality issues. Examples of suchs metrics are distribution of values, max/min/average values, patterns used in values etc.Validator
The validator assumes a higher degree of data insight since it enables the user to create business rules for the data to honor. Rules for data can be defined in a variety of ways; through javascripts, lookup dictionaries, regular expressions and more.Comparator
The comparator enables a user to compare two separate datastores and look for values from one datastore within another datastore and vice versa.2.x: Acquisition by Human Inference
On the 14th of february, 2011, it was announced that the data quality vendor Human Inference had acquired eobjects.org, hired Kasper Sørensen and participated/sponsored the development of DataCleaner 2.0. The 2.0 release of DataCleaner was released the same day. It introduces a new user experience, where all of the previous function areas have been unified into a single workbench.License history
As of version 1.5 DataCleaner changed its license from the Apache LicenseApache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....
version 2.0 to the Lesser General Public License. According to the DataCleaner website, the change was made to "ensure that improvements are submitted back to the projects" and that "we don't risk that anyone sell modified versions of our projects" .
External links
- the DataCleaner community
- roadmap for the DataCleaner project