Caltech 101
Encyclopedia
Caltech 101 is a dataset of digital images created in September, 2003, compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology
. It is intended to facilitate Computer Vision
research and techniques. It is most applicable to techniques interested in recognition, classification, and categorization. Caltech 101 contains a total of 9146 images, split between 101 distinct object (including face
s, watches, ants, pianos
, etc.) and a background category (for a total of 102 categories). Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab
script
for viewing.
algorithms function by training on a large set of example inputs.
To work effectively, most of these techniques require a large and varied set of training data. For example, the relatively well known real time face detection method used by Paul Viola and Micheal J. Jones was trained on 4916 hand labeled faces .
However, acquiring a large volume of appropriate and usable images is often difficult. Furthermore, cropping and resizing many images, as well as marking point of interest by hand, is a tedious and time intensive task.
Historically, most datasets used in computer vision research have been tailored to the specific needs of the project being worked on.
A large problem in comparing different computer vision techniques is the fact that most groups are using their own datasets. Each of these datasets may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images, and level of occlusion and clutter present can lead to varying results
The Caltech 101 dataset aims to alleviate many of these common problems.
However, a recent study demonstrates that tests based on uncontrolled natural images (like the Caltech 101 dataset) can be seriously misleading, potentially guiding progress in the wrong direction.
Each object category contains between 40 and 800 images on average. Common and popular categories such as faces tend to have a larger number of images than less used categories.
Each image is about 300x200 pixels in dimension.
Images of oriented objects such as airplanes and motorcycles were mirrored to be left-right aligned, and vertically oriented structures such as buildings were rotated to be off axis.
The general bounding box in which the object is located, and a detailed human specified outline enclosing the object.
A Matlab script is provided along with the annotations that will load an image and its corresponding annotation file and display them as a Matlab figure.
The bounding box is yellow and the outline is red.
The first paper to make use of Caltech 101 was an incremental Bayesian
approach to one shot learning . One shot learning is an attempt to learn a class of object using only a few examples, by building off of prior knowledge of many other classes.
The Caltech 101 images, along with the annotations, were used for another one shot learning paper at Caltech.
L. Fei-Fei, R. Fergus and P. Perona. One-Shot learning of object categories
Other Computer Vision papers that report using the Caltech 101 dataset:
Almost all the images within each category are uniform in image size and in the relative position of interest objects. This means that, in general, users who wish to use the Caltech 101 dataset do not need to spend and extra time cropping and scaling the images before they can be used.
Algorithms concerned with recognition usually function by storing features unique to the object that is to be recognized. However, the majority of images taken have varying degrees of background clutter. Algorithms trained on cluttered images can potentially build incorrect
The detailed annotations of object outlines is another advantage to using the dataset.
The Caltech 101 dataset represents only a small fraction of the possible object categories.
Certain categories are not represented as well as others, containing as few as 31 images.
This means that . The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
Images are very uniform in presentation, left right aligned, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm being trained might be expected to see. Under practical conditions, there is usually more clutter, occlusion, and variance in relative position and orientation of interest objects.
Some images have been rotated and scaled from their original orientation, and suffer from some amount of artifacts
or aliasing
.
California Institute of Technology
The California Institute of Technology is a private research university located in Pasadena, California, United States. Caltech has six academic divisions with strong emphases on science and engineering...
. It is intended to facilitate Computer Vision
Computer vision
Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...
research and techniques. It is most applicable to techniques interested in recognition, classification, and categorization. Caltech 101 contains a total of 9146 images, split between 101 distinct object (including face
Face
The face is a central sense organ complex, for those animals that have one, normally on the ventral surface of the head, and can, depending on the definition in the human case, include the hair, forehead, eyebrow, eyelashes, eyes, nose, ears, cheeks, mouth, lips, philtrum, temple, teeth, skin, and...
s, watches, ants, pianos
Pianos
Pianos is a two-story bar/restaurant/live music venue in the Lower East Side section of Manhattan at 158 Ludlow Street.Its stage attracts local and national alternative rock groups as well as DJs, though a more typical performance consists of smaller name local and touring acts...
, etc.) and a background category (for a total of 102 categories). Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
script
Scripting language
A scripting language, script language, or extension language is a programming language that allows control of one or more applications. "Scripts" are distinct from the core code of the application, as they are usually written in a different language and are often created or at least modified by the...
for viewing.
Purpose
Most Computer Vision and Machine LearningMachine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
algorithms function by training on a large set of example inputs.
To work effectively, most of these techniques require a large and varied set of training data. For example, the relatively well known real time face detection method used by Paul Viola and Micheal J. Jones was trained on 4916 hand labeled faces .
However, acquiring a large volume of appropriate and usable images is often difficult. Furthermore, cropping and resizing many images, as well as marking point of interest by hand, is a tedious and time intensive task.
Historically, most datasets used in computer vision research have been tailored to the specific needs of the project being worked on.
A large problem in comparing different computer vision techniques is the fact that most groups are using their own datasets. Each of these datasets may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images, and level of occlusion and clutter present can lead to varying results
The Caltech 101 dataset aims to alleviate many of these common problems.
- The work of collecting a large set of images, and cropping and resizing them appropriately has been taken care of.
- A large number of different categories are represented, which benefits both single, and multi class recognition algorithms.
- Detailed object outlines have been marked for each image.
- By being released for general use, the Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different datasets.
However, a recent study demonstrates that tests based on uncontrolled natural images (like the Caltech 101 dataset) can be seriously misleading, potentially guiding progress in the wrong direction.
Images
The Caltech 101 dataset consists of a total of 9146 images, split between 101 different object categories, as well as an additional background/clutter category.Each object category contains between 40 and 800 images on average. Common and popular categories such as faces tend to have a larger number of images than less used categories.
Each image is about 300x200 pixels in dimension.
Images of oriented objects such as airplanes and motorcycles were mirrored to be left-right aligned, and vertically oriented structures such as buildings were rotated to be off axis.
Annotations
As a supplement to the images, a set of annotations are provided for each image. Each set of annotations contains two pieces of information.The general bounding box in which the object is located, and a detailed human specified outline enclosing the object.
A Matlab script is provided along with the annotations that will load an image and its corresponding annotation file and display them as a Matlab figure.
The bounding box is yellow and the outline is red.
Uses
The Caltech 101 dataset has been used to train and test several Computer Vision recognition and classification algorithms.The first paper to make use of Caltech 101 was an incremental Bayesian
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
approach to one shot learning . One shot learning is an attempt to learn a class of object using only a few examples, by building off of prior knowledge of many other classes.
The Caltech 101 images, along with the annotations, were used for another one shot learning paper at Caltech.
L. Fei-Fei, R. Fergus and P. Perona. One-Shot learning of object categories
Other Computer Vision papers that report using the Caltech 101 dataset:
- Shape Matching and Object Recognition using Low Distortion Correspondence. Alexander C. Berg, Tamara L. Berg, Jitendra MalikJitendra MalikJitendra Malik is a researcher in computer vision, the Arthur J. Chick Professor of Electrical Engineering and Computer Science at the University of California, Berkeley....
. CVPR 2005 - The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005
- Combining Generative Models and Fisher Kernels for Object Class Recognition Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005
- Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005.
- SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra MalikJitendra MalikJitendra Malik is a researcher in computer vision, the Arthur J. Chick Professor of Electrical Engineering and Computer Science at the University of California, Berkeley....
. CVPR, 2006 - Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006
- Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005
- Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pg. 11-18, CVPR 2006, IEEE Computer Society Press, New York, June 2006
- Using Dependent Regions or Object Categorization in a Generative Framework, G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006
Advantages
Caltech 101 has several advantages over other similar datasets:- Uniform size and presentation.
Almost all the images within each category are uniform in image size and in the relative position of interest objects. This means that, in general, users who wish to use the Caltech 101 dataset do not need to spend and extra time cropping and scaling the images before they can be used.
- Low level of clutter/occlusion:
Algorithms concerned with recognition usually function by storing features unique to the object that is to be recognized. However, the majority of images taken have varying degrees of background clutter. Algorithms trained on cluttered images can potentially build incorrect
- Detailed Annotations:
The detailed annotations of object outlines is another advantage to using the dataset.
Weaknesses
There are several weaknesses to the Caltech 101 dataset . Some of them are conscious trade-offs for the advantages it provides, and some are simply limitations of the dataset itself.- Limited number of categories:
The Caltech 101 dataset represents only a small fraction of the possible object categories.
- Some categories contain few images:
Certain categories are not represented as well as others, containing as few as 31 images.
This means that . The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
- Can be too easy:
Images are very uniform in presentation, left right aligned, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm being trained might be expected to see. Under practical conditions, there is usually more clutter, occlusion, and variance in relative position and orientation of interest objects.
- Aliasing and artifacts due to manipulation:
Some images have been rotated and scaled from their original orientation, and suffer from some amount of artifacts
Compression artifact
A compression artifact is a noticeable distortion of media caused by the application of lossy data compression....
or aliasing
Aliasing
In signal processing and related disciplines, aliasing refers to an effect that causes different signals to become indistinguishable when sampled...
.
Other datasets
- Caltech 256 is another image dataset created at the California Institute of technology in 2007, a successor to Caltech 101. It is intended to address some of the weaknesses inherent to Caltech 101. Overall, it is a more difficult dataset than Caltech 101 (but it suffers from the same problems )
- 30,607 images, covering a larger number of categories.
- Minimum number of image per category raised to 80.
- Images not left-right aligned.
- More variation in image presentation.
- LabelMeLabelMeLabelMe is a project created by the MIT Computer Science and Artificial Intelligence Laboratory which provides a dataset of digital images with annotations. The dataset is dynamic, free to use, and open to public contribution. The most applicable use of LabelMe is in computer vision research...
is an open, dynamic dataset created at MIT Computer Science and Artificial Intelligence LaboratoryMIT Computer Science and Artificial Intelligence LaboratoryMIT Computer Science and Artificial Intelligence Laboratory is a research laboratory at the Massachusetts Institute of Technology formed by the 2003 merger of the Laboratory for Computer Science and Artificial Intelligence Laboratory...
(CSAIL). LabelMe takes a different approach to the problem of creating a large image dataset, with different trade-offs.- 106,739 images, 41,724 annotated images, and 203,363 labeled objects.
- Users may add images to the dataset by upload, and add labels or annotations to existing images.
- Due to its open nature, LabelMe has many more images covering a much wider scope than Caltech 101. However, since each person decides what images to upload, and how to label and annotate each image, there can be a lack of consistency between images.
- VOC 2008 is a European efforts of collecting images for benchmarking visual categorization methods. Compared to Catelch 101/256, a smaller number of categories (about 20) are collected. However, the number of images in each categories is larger.
- Overhead Imagery Research Data SetOverhead Imagery Research Data SetThe Overhead Imagery Research Data Set is a collection of an open-source, annotated, overhead images that computer vision researchers can use to aid in the development of algorithms. Most computer vision and machine learning algorithms function by training on a large set of example data...
(OIRDS) is an annotated library of imagery and tools to aid in the development of computer vision algorithms . OIRDS v1.0 is composed of passenger vehicle objects annotated in overhead imagery. Passenger vehicles in the OIRDS include cars, trucks, vans, etc. In addition to the object outlines, the OIRDS includes subjective and objective statistics that quantify the vehicle within the image's context. For example, subjective measures of image clutter, clarity, noise, and vehicle color are included along with more objective statistics such as ground sample distanceGround sample distanceIn remote sensing, ground sample distance in a digital photo of the ground from air or space is the distance between pixel centers measured on the ground. For example, in an image with a one-meter GSD, adjacent pixels image locations that are 1 meter apart on the ground...
(GSD), time of day, and day of year.- ~900 images, containing ~1800 annotated images
- ~30 annotations per object
- ~60 statistical measures per object
- wide variation in object context
- limited to passenger vehicles in overhead imagery
External links
- http://www.vision.caltech.edu/Image_Datasets/Caltech101/ -Caltech 101 Homepage (Includes download)
- http://www.vision.caltech.edu/Image_Datasets/Caltech256/ -Caltech 256 Homepage (Includes download)
- http://labelme.csail.mit.edu/ -LabelMe Homepage
- http://www2.it.lut.fi/project/visiq/ -Randomized Caltech 101 download page (Includes download)