Environment for DeveLoping KDD-Applications Supported by Index-Structures
Encyclopedia
ELKI is a knowledge discovery in databases (KDD, "data mining") software framework
developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel
at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures
.
and built around a modular architecture. Most currently included algorithms belong to clustering, outlier detection
and database indexes
. A key concept of ELKI is to allow the combination of arbitrary algorithms, data types, distance functions and indexes and evaluate these combinations. When developing new algorithms or index structures, the existing components can be reused and combined.
The university project is developed for use in teaching and research. The source code is written with extensibility, readability and reusability in mind, but it is not extensively optimized for performance. A scientific evaluation comparing run times thus is only sound when both algorithms are implemented within ELKI so they share the same cost. It currently does not offer integration with business intelligence
applications or even an interface to common database management system
s via SQL
. The application of the algorithms requires knowledge about their use and study of documentation. The audience are student
s, researcher
s and software engineer
s.
The visualization modules use SVG
for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript
and PDF
for easy inclusion in scientific publications in LaTeX
.
Doctoral Dissertation Award 2009 Runner-up" by the Association for Computing Machinery
for its contributions to correlation clustering
. The algorithms published as part of the dissertation (4C, COPAC, HiCO, ERiC, CASH) are available in ELKI.
Version 0.4 presented at the "Symposium on Spatial and Temporal Databases" 2011 with included various methods for spatial outlier detection won the conferences "best demonstration paper award".
, as well as some index structure
s such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering
algorithms.
Version 0.2 (July 2009) added functionality for time series analysis, in particular distance functions for time series.
Version 0.3 (March 2010) extended the choice of anomaly detection
algorithms and visualization modules.
Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures.
Software framework
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by user code, thus providing application specific software...
developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel
Hans-Peter Kriegel
Hans-Peter Kriegel is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science....
at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
.
Description
The ELKI framework is written in JavaJava (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
and built around a modular architecture. Most currently included algorithms belong to clustering, outlier detection
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
and database indexes
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
. A key concept of ELKI is to allow the combination of arbitrary algorithms, data types, distance functions and indexes and evaluate these combinations. When developing new algorithms or index structures, the existing components can be reused and combined.
The university project is developed for use in teaching and research. The source code is written with extensibility, readability and reusability in mind, but it is not extensively optimized for performance. A scientific evaluation comparing run times thus is only sound when both algorithms are implemented within ELKI so they share the same cost. It currently does not offer integration with business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....
applications or even an interface to common database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...
s via SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
. The application of the algorithms requires knowledge about their use and study of documentation. The audience are student
Student
A student is a learner, or someone who attends an educational institution. In some nations, the English term is reserved for those who attend university, while a schoolchild under the age of eighteen is called a pupil in English...
s, researcher
Researcher
A researcher is somebody who performs research, the search for knowledge or in general any systematic investigation to establish facts. Researchers can work in academic, industrial, government, or private institutions.-Examples of research institutions:...
s and software engineer
Software engineer
A software engineer is an engineer who applies the principles of software engineering to the design, development, testing, and evaluation of the software and systems that make computers or anything containing software, such as computer chips, work.- Overview :...
s.
The visualization modules use SVG
Scalable Vector Graphics
Scalable Vector Graphics is a family of specifications of an XML-based file format for describing two-dimensional vector graphics, both static and dynamic . The SVG specification is an open standard that has been under development by the World Wide Web Consortium since 1999.SVG images and their...
for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...
and PDF
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
for easy inclusion in scientific publications in LaTeX
LaTeX
LaTeX is a document markup language and document preparation system for the TeX typesetting program. Within the typesetting system, its name is styled as . The term LaTeX refers only to the language in which documents are written, not to the editor used to write those documents. In order to...
.
Awards
ELKI started as implementation of the doctoral dissertation of Dr. Arthur Zimek, which was awarded "SIGKDDSIGKDD
SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. It became an official ACM SIG in 1998.- Conferences :...
Doctoral Dissertation Award 2009 Runner-up" by the Association for Computing Machinery
Association for Computing Machinery
The Association for Computing Machinery is a learned society for computing. It was founded in 1947 as the world's first scientific and educational computing society. Its membership is more than 92,000 as of 2009...
for its contributions to correlation clustering
Correlation clustering
In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects...
. The algorithms published as part of the dissertation (4C, COPAC, HiCO, ERiC, CASH) are available in ELKI.
Version 0.4 presented at the "Symposium on Spatial and Temporal Databases" 2011 with included various methods for spatial outlier detection won the conferences "best demonstration paper award".
Included algorithms
Select included algorithms:- Cluster analysis:
- K-means clustering
- Expectation-maximization algorithmExpectation-maximization algorithmIn statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
- Single-linkage clustering
- DBSCANDBSCANDBSCAN is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996....
(Density-Based Spatial Clustering of Applications with Noise) - OPTICSOPTICS algorithmOPTICS is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander....
(Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH - SUBCLUSUBCLUSUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger. It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN...
(Density-Connected Subspace Clustering for High-Dimensional Data)
- Anomaly detectionAnomaly detectionAnomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....
:- LOFLocal Outlier FactorLocal outlier factor is an anomaly detection algorithm presented as "LOF: Identifying Density-based Local Outliers" by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander...
(Local outlier factor) - OPTICSOPTICS algorithmOPTICS is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander....
-OF - DB-Outlier (Distance-Based Outliers)
- LOCI (Local Correlation Integral)
- LDOF (Local Distance-Based Outlier Factor)
- EMExpectation-maximization algorithmIn statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
-Outlier
- LOF
- Spatial index structures:
- R-treeR-treeR-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both research and real-world applications...
- R*-tree
- M-treeM-treeM-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-NN queries....
- R-tree
- Evaluation:
- Receiver operating characteristicReceiver operating characteristicIn signal detection theory, a receiver operating characteristic , or simply ROC curve, is a graphical plot of the sensitivity, or true positive rate, vs. false positive rate , for a binary classifier system as its discrimination threshold is varied...
(ROC curve) - Scatter plot
- HistogramHistogramIn statistics, a histogram is a graphical representation showing a visual impression of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson...
- Parallel coordinatesParallel coordinatesParallel coordinates is a common way of visualizing high-dimensional geometry and analyzing multivariate data.To show a set of points in an n-dimensional space, a backdrop is drawn consisting of n parallel lines, typically vertical and equally spaced...
- Receiver operating characteristic
- Other:
- Apriori algorithmApriori algorithmIn computer science and data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions...
- Dynamic time warpingDynamic time warpingDynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even...
- Principal component analysis
- Apriori algorithm
Licensing
The website or source code does not give an explicit license, it should therefore be considered copyrighted. The authors have stated that research use is acceptable but attribution is required. For commercial use, an explicit license is required.Version history
Version 0.1 (July 2008) contained several Algorithms from cluster analysis and anomaly detectionAnomaly detection
Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....
, as well as some index structure
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
s such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering
Correlation clustering
In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects...
algorithms.
Version 0.2 (July 2009) added functionality for time series analysis, in particular distance functions for time series.
Version 0.3 (March 2010) extended the choice of anomaly detection
Anomaly detection
Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....
algorithms and visualization modules.
Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures.
Related applications
- WekaWeka (machine learning)Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...
a similar project by the University of Waikato, with a focus on classification algorithms. - RapidMiner an application available both as open source as well as commercially with a focus on machine learningMachine learningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
. - Konstanz Information Miner (KNIME)KNIMEKNIME, the Konstanz Information Miner, is a user friendly, coherent open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept...
- open source data analytics platform integrated in EclipseEclipse (software)Eclipse is a multi-language software development environment comprising an integrated development environment and an extensible plug-in system...
.
External links
- Official web page of ELKI with download and documentation.