Data set

Encyclopedia

A

, usually presented in tabular

form. Each column

represents a particular variable. Each row

corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random number

s. Each value is known as a datum

. The data set may comprise data for one or more members, corresponding to the number of rows.

Nontabular data sets can take the form of marked up

strings of characters, such as an XML

file.

Almost all data sets, although they may often be written using high-level languages and base-ten numbers

, end up transcoded into machine language for processing by the computer

s involved. Thus, for all their semantic diversity and tabular or nontabular forms, most datasets can be expressed in binary code

as a long string of zeros and ones.

, where it had a well-defined meaning

, very close to contemporary

and kurtosis

.

In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a univariate

data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a multiset

rather than an (ordered) list.

The values may be numbers, such as real number

s or integer

s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement

. For each variable, the values will normally all be of the same kind. However, there may also be "missing values

", which need to be indicated in some way.

In statistics

data sets usually come from actual observations obtained by sampling

a statistical population

, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as PSPP

still present their data in the classical data set fashion.

**data set**is a collection of dataData

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

, usually presented in tabular

Table (database)

In relational databases and flat file databases, a table is a set of data elements that is organized using a model of vertical columns and horizontal rows. A table has a specified number of columns, but can have any number of rows...

form. Each column

Column (database)

In the context of a relational database table, a column is a set of data values of a particular simple type, one for each row of the table. The columns provide the structure according to which the rows are composed....

represents a particular variable. Each row

Row (database)

In the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields...

corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random number

Random number

Random number may refer to:* A number generated for or part of a set exhibiting statistical randomness.* A random sequence obtained from a stochastic process.* An algorithmically random sequence in algorithmic information theory....

s. Each value is known as a datum

Datum

A geodetic datum is a reference from which measurements are made. In surveying and geodesy, a datum is a set of reference points on the Earth's surface against which position measurements are made, and an associated model of the shape of the earth to define a geographic coordinate system...

. The data set may comprise data for one or more members, corresponding to the number of rows.

Nontabular data sets can take the form of marked up

Markup language

A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

strings of characters, such as an XML

XML

Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

file.

Almost all data sets, although they may often be written using high-level languages and base-ten numbers

Decimal

The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....

, end up transcoded into machine language for processing by the computer

Computer

A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

s involved. Thus, for all their semantic diversity and tabular or nontabular forms, most datasets can be expressed in binary code

Binary code

A binary code is a way of representing text or computer processor instructions by the use of the binary number system's two-binary digits 0 and 1. This is accomplished by assigning a bit string to each particular symbol or instruction...

as a long string of zeros and ones.

## History

Historically, the term originated in the mainframe fieldMainframe computer

Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

, where it had a well-defined meaning

Data set (IBM mainframe)

data set , dataset , is a computer file having a record organization. The term pertains to the IBM mainframe operating system line, starting with OS/360, and is still used by its successors, including the current z/OS. Those systems historically preferred this term over a file...

, very close to contemporary

*computer file*

. This topic is not covered here.Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

## Properties

A data set has several characteristics which define its structure and properties. These include the number and types of the attributes or variables and the various statistical measures which may be applied to them such as standard deviationStandard deviation

Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

and kurtosis

Kurtosis

In probability theory and statistics, kurtosis is any measure of the "peakedness" of the probability distribution of a real-valued random variable...

.

In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a univariate

Univariate

In mathematics, univariate refers to an expression, equation, function or polynomial of only one variable. Objects of any of these types but involving more than one variable may be called multivariate...

data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a multiset

Multiset

In mathematics, the notion of multiset is a generalization of the notion of set in which members are allowed to appear more than once...

rather than an (ordered) list.

The values may be numbers, such as real number

Real number

In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

s or integer

Integer

The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement

Level of measurement

The "levels of measurement", or scales of measure are expressions that typically refer to the theory of scale types developed by the psychologist Stanley Smith Stevens. Stevens proposed his theory in a 1946 Science article titled "On the theory of scales of measurement"...

. For each variable, the values will normally all be of the same kind. However, there may also be "missing values

Missing values

In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data....

", which need to be indicated in some way.

In statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

data sets usually come from actual observations obtained by sampling

Sampling (statistics)

In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

a statistical population

Statistical population

A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as PSPP

PSPP

PSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....

still present their data in the classical data set fashion.

## Classic data sets

Several classic data sets have been used extensively in the statistical literature:- Iris flower data setIris flower data setThe Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Aylmer Fisher as an example of discriminant analysis...

- multivariate data set introduced by Ronald FisherRonald FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

(1936). -
*Categorical data analysis*- Data sets used in the book,*An Introduction to Categorical Data Analysis*, by Agresti are provided on-line by StatLib. *Robust statistics*- Data sets used inRobust statisticsRobust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions.- Introduction :...*Robust Regression and Outlier Detection*(RousseeuwPeter RousseeuwPeter J. Rousseeuw is a Belgian statistician known for his work on robust statistics and cluster analysis.-Books:...

and Leroy, 1986). Provided on-line at the University of Cologne.*Time series*- Data used in Chatfield's book,Time seriesIn statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...*The Analysis of Time Series*, are provided on-line by StatLib.*Extreme values*- Data used in the book,*An Introduction to the Statistical Modeling of Extreme Values*are provided on-line by Stuart Coles, the book's author.**[Dead link]**

*Bayesian Data Analysis*- Data used in the book are provided on-line by Andrew GelmanAndrew GelmanAndrew Gelman is an American statistician, professor of statistics and political science, and director of the Applied Statistics Center at Columbia University. He earned an S.B. in mathematics and in physics from MIT in 1986 and a Ph.D...

, one of the book's authors.- The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.

## External links

- Research Pipeline - A wiki/website with links to datasets on many different topics.
- StatLib--Datasets Archive
- StatLib--JASA Data Archive
- Data.gov
- UK Government Public Data
- GCMD - The Global Change Master Directory contains more than 20,000 descriptions of Earth science data sets and services covering all aspects of Earth and environmental sciences.