Predictive Model Markup Language
Encyclopedia
The Predictive Model Markup Language (PMML) is an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

-based markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

 developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

 and data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 and to share those models between PMML-compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML schema.

PMML Components

PMML follows an intuitive structure to describe a data mining model, be it an artificial neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

 or a logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

 model.
Sequentially, it can be described by the following components:
  • Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.

  • Data Dictionary: contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

  • Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
    • Normalization: map values to numbers, the input can be continuous or discrete.
    • Discretization: map continuous values to discrete values.
    • Value mapping: map discrete values to discrete values.
    • Functions: derive a value by applying a function to one or more parameters.
    • Aggregation: used to summarize or collect groups of values.

  • Model: contains the definition of the data mining model. A multi-layered feedforward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as backpropagation
    Backpropagation
    Backpropagation is a common method of teaching artificial neural networks how to perform a given task. Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969 . It wasn't until 1974 and later, when applied in the context of neural networks and...

    . Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
    • Model Name (attribute modelName)
    • Function Name (attribute functionName)
    • Algorithm Name (attribute algorithmName)
    • Activation Function (attribute activationFunction)
    • Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, Naive Bayes classifier
Naive Bayes classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...

, clustering models, text models, decision trees
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

, and different regression models.
  • Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
    • Name (attribute name): must refer to a field in the data dictionary
    • Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
    • Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
    • Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.
    • Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).

  • Targets: allow for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.

  • Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity (for clustering models), standard error, etc.

PMML 4.0

The latest version of PMML, 4.0, was released on June 16, 2009.

Examples of new features include:
  • Improved Pre-Processing Capabilities: Additions to built-in functions include a range of Boolean
    Boolean logic
    Boolean algebra is a logical calculus of truth values, developed by George Boole in the 1840s. It resembles the algebra of real numbers, but with the numeric operations of multiplication xy, addition x + y, and negation −x replaced by the respective logical operations of...

     operations and an If-Then-Else function.

  • Time Series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

     Models: New exponential Smoothing
    Smoothing
    In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. Many different algorithms are used in smoothing...

     models; also place holders for ARIMA
    Arima
    The Royal Borough of Arima is the fourth largest town in Trinidad and Tobago. Located east of the capital, Port of Spain, Arima supports the only organised indigenous community in the country, the Santa Rosa Carib Community and is the seat of the Carib Queen...

    , Seasonal Trend Decomposition
    Seasonal adjustment
    Seasonal adjustment is a statistical method for removing the seasonal component of a time series that is used when analyzing non-seasonal trends. It is normal to report un-adjusted data for current unemployment rates, as these reflect the actual current situation...

    , and Spectral Analysis
    Spectral analysis
    Spectral analysis or Spectrum analysis may refer to:* Spectrum analysis in chemistry and physics, a method of analyzing the chemical properties of matter from bands in their visible spectrum...

    , which are to be supported in the near future.

  • Model Explanation: Saving of evaluation and model performance measures to the PMML file itself.

  • Multiple Models: Capabilities for model composition, ensembles, and segmentation (e.g., combining of regression
    Regression analysis
    In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

     and decision trees).

  • Extensions of Existing Elements: Addition of multi-class classification for Support Vector Machines, improved representation for Association Rules, and the addition of Cox Regression Models
    Proportional hazards models
    Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may be associated with that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate...

    .

Release history












Version 0.7July 1997
Version 0.9July 1998
Version 1.0August 1999
Version 1.1August 2000
Version 2.0August 2001
Version 2.1March 2003
Version 3.0October 2004
Version 3.1December 2005
Version 3.2May 2007
Version 4.0June 2009

PMML Products

A range of products are being offered to produce and consume PMML:
  • Angoss KnowledgeSTUDIO: produces PMML 3.2 for regression models (logistic and linear), decision trees, clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    , neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

     and ruleset models (used to represent scorecards).


  • Angoss StrategyBuilder : (a standard module in KnowledgeSEEKER and KnowledgeSTUDIO)]: produces PMML 3.2 for decision trees (used to represent strategy trees).


  • IBM InfoSphere Warehouse: produces PMML 3.0 and 3.1 for sequences only models. Consumes (scores and visualizes) PMML 3.1 and earlier.

  • IBM SPSS Modeler: produces and scores PMML 3.2 and 4.0 for a variety of models.


  • KNIME: produces and consumes PMML 4.0 for neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , decision trees, clustering models, regression models, and support vector machines. As of release 2.4.0, KNIME also offers extensive pre-processing support in PMML, including the ability to edit existing PMML code.

  • KXEN: produces PMML 3.2 for regression models (including mining models) and clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    .

  • Microsoft SQL Server 2008 Analysis Services: produces and consumes PMML 2.1 for decision trees and clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    .

  • MicroStrategy: supports PMML 2.0, 2.1, 3.0, 3.1, 3.2 and 4.0 for linear regression
    Linear regression
    In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

    , logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

    , decision trees, clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    , association rules, time series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

    , neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

     and support vector machines.

  • Open Data Group's Augustus: Produces PMML 4.0 for tree, naive-bayes and ruleset models. It consumes PMML 4.0 tree, naive-bayes, ruleset and regression models. Older versions produce and consume PMML 3.0 regression, tree and naive-bayes.

  • Oracle Data Mining: supports the core features of PMML 3.1 for regression models. The imported models become native Oracle Data Mining (ODM) models capable of Exadata offload.

  • Pervasive DataRush: produces and consumes PMML 3.2 for regression models, decision trees, and naive bayes. Produces PMML 3.2 for association rules and clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

     (K-means Center-Based).

  • Predixion PMML Connexion: consumes PMML 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0 for several mining models, including decision trees, ruleset models, support vector machines, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , naive bayes, linear and logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

     models as well as clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

     models.

  • RapidMiner: Using the free PMML extension, several types of models can be exported to PMML.

  • Rattle/R: Uses the R programming language to build several predictive models. It offers a PMML package to export models built in R to PMML 3.2. This package includes export support for support vector machines, linear regression
    Linear regression
    In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

    , logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

    , decision trees, random forests, random survival forests, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , K-means and hierarchical clustering, and association rules.


  • SAND CDBMS 6.1 PMML Extension: consumes PMML versions 3.1 and 3.2 for several mining models, including association rules, clustering, regression, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , naive bayes, support vector machines, rulesets, and decision trees. It also consumes pre-processing elements and built-in functions.

  • SAS Enterprise Miner: produces PMML 2.1 and 3.1 for several mining models, including linear regression
    Linear regression
    In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

    , logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

    , decision trees, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , K-means  clustering, and association rules.

  • STATISTICA: generates PMML 2.0 and 3.0 for analyses such as linear regression
    Linear regression
    In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

    , logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

    , decision trees, support vector machines, and neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...


  • TIBCO Spotfire Miner 8.1: produces and consumes PMML 2.0 for regression models, decision trees, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    , and naive bayes models.

  • TERADATA Warehouse Miner 5.3.1: consumes PMML 2.1 through 3.2 for regression models, decision trees, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    , and mining models (regression type).

  • Weka (Pentaho): consumes PMML 3.2 for regression models, decision trees, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , rule sets, and support vector machines.

  • Zementis ADAPA: batch and real-time scoring of PMML 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0 for several mining models, including decision trees, association rules, support vector machines, neural networks
    Neural Networks
    Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

    , naive bayes, ruleset models, linear and logistic regression
    Logistic regression
    In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

     models as well as Cox regression models, clustering
    Clustering
    Clustering can refer to the following:In demographics:* Clustering , the gathering of various populations based on factors such as ethnicity, economics or religion.In graph theory:...

    models and model ensembles. ADAPA also consumes all pre- and post-processing PMML elements, including transformations, built-in functions, outputs, and targets.

  • Zementis PMML Converter: validates, corrects, and converts PMML files expressed in versions 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0.


  • Zementis Universal PMML Plug-in for Hadoop: Scoring of PMML 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0 for the Datameer Analytics Solution (DAS), an end-to-end BI solution that includes data source integration, an analytics engine, visualization and dashboarding. DAS uses Apache Hadoop, a Java-based framework that supports the parallel storage and processing of large data sets in a distributed environment, as its back-end storage and processing engine to scale to 4000 servers and petabytes of data.

Transformations Generator

PMML provides a variety of data transformations, including value mapping, normalization, and discretization. It also offers several built-in functions as well as arithmetic and logical operators which can be combined to represent complex pre-processing steps. With the Transformations Generator tool, one can graphically design a transformation and obtain the respective PMML code.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK