Hierarchical Data Format
Encyclopedia
Hierarchical Data Format (HDF, HDF4, or HDF5) is the name of a set of file formats and libraries designed to store and organize large amounts of numerical data. Originally developed at the National Center for Supercomputing Applications
, it is currently supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data currently stored in HDF.
In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms, including Java
, MATLAB
, IDL, and Python
. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).
There currently exist two major versions of HDF, HDF4 and HDF5, which differ significantly in design and API.
for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."
The HDF4 format has many limitations. It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.
HDF5 simplifies the file structure to include only two major types of object:
This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
The latest version of NetCDF
, version 4, is based on HDF5.
Because it uses B-trees to index table objects, HDF5 works well for time series
data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of a SQL
database, but B-Tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema
.
Tools
National Center for Supercomputing Applications
The National Center for Supercomputing Applications is an American state-federal partnership to develop and deploy national-scale cyberinfrastructure that advances science and engineering. NCSA operates as a unit of the University of Illinois at Urbana-Champaign but it provides high-performance...
, it is currently supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data currently stored in HDF.
In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms, including Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, MATLAB
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
, IDL, and Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).
There currently exist two major versions of HDF, HDF4 and HDF5, which differ significantly in design and API.
HDF4
HDF4 is the older version of the format, although yet actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an APIApplication programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."
The HDF4 format has many limitations. It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.
HDF5
The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 Award.HDF5 simplifies the file structure to include only two major types of object:
- Datasets, which are multidimensional arrays of a homogenous type
- Groups, which are container structures which can hold datasets and other groups
This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
The latest version of NetCDF
NetCDF
NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research...
, version 4, is based on HDF5.
Because it uses B-trees to index table objects, HDF5 works well for time series
Time series
In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...
data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of a SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
database, but B-Tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema
Star schema
In computing, the star schema is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables...
.
Officially supported APIs
- CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
- C++C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
- FortranFortranFortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...
, Fortran 90 - HDF5 Lite (H5LT) – a light-weight interface for C
- HDF5 Image (H5IM) – a C interface for images or rasters
- HDF5 Table (H5TB) – a C interface for tables
- HDF5 Packet Table (H5PT) – interfaces for C and C++C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
to handle "packet" data, accessed at high-speeds - HDF5 Dimension Scale (H5DS) – allows dimension scales to be added to HDF5; to be introduced in the HDF5-1.8 release
- JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
Third-party bindings
- GNU Data LanguageGNU data languageThe GNU Data Language is a free, compatible alternative to IDL .. Version 0.9.1, released in March 2011, has full syntax compatibility with IDL 7.1 and supports some IDL 8.0 language elements as well. GDL is in beta stage of development...
- Huygens SoftwareHuygens SoftwareHuygens software refers to different multiplatform microscope image processing packages from Scientific Volume Imaging, made for restoring 2D and 3D microscopy images or time series and analyzing and visualizing them....
uses HDF5 as primary storage format since version 3.5 - IDL
- JHDF5, an alternative JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
binding that takes a different approach from the official HDF5 Java binding which some users find simpler - MATLABMATLABMATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
– uses HDF5 as primary storage format in recent releases - MathematicaMathematicaMathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...
immediate analysis of HDF and HDF5 data - PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
- PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
supports HDF5 via h5py (a thin wrapper) and via PyTables (a higher-level interface).
See also
- Common Data FormatCommon Data FormatCommon Data Format is a library and toolkit that has been developed by NASA. The software is an interface for the storage and manipulation of multi-dimensional data sets.-See also:* CGNS * EAS3...
(CDF) - FITSFITSFlexible Image Transport System is a digital file format used to store, transmit, and manipulate scientific and other images. FITS is the most commonly used digital file format in astronomy...
, a data format used in astronomy - GRIBGRIBGRIB is a mathematically concise data format commonly used in meteorology to store historical and forecast weather data...
(GRIdded Binary), a data format used in meteorology - HDF ExplorerHDF ExplorerHDF Explorer is a data visualization program that reads the HDF, HDF5 and netCDF data file formats. It runs in the Microsoft Windows operating systems. HDF Explorer was developed by , headquartered in Urbana-Champaign, Illinois.-External links:* *...
External links
HDF Group- What is HDF5?
- A presentation on how to handle large datasets in Quantum Chemistry using hdf5
- NASA HDF file example, its structure generated and shown online as CreativeCommons image
Tools