NetCDF
Warning This guide needs additional information
NetCDF (Network Common Data Form), is a file format that stores scientific data in arrays. Array values may be accessed directly, without knowing how the data are stored, and metadata information may be stored with the data.
Binary file format commonly used for scientific data
Self-describing, includes metadata
Multi-dimensional array data model
The netCDF data model consists of the following:
variable
Multi-dimensional array
Column-oriented: each variable as a separate entity
dimension
Usually temporal, spatial, spectral, …
Can be unlimited length. One, at most, is recommended for a growing time dimension
attribute
Metadata: global and variable level
group
Akin to directories
Avoid unless you really need the complex structure
Why use NetCDF
NetCDF is a file format commonly used at LASP as it is the “highly preferred” format for NASA Earth Observing System Data and Information System data products, per their Data Product Development Guide for Data Producers. This affects all NASA Earth Science missions.
NetCDF features:
Self-describing
structure captures coordinate system (functional relationship)
includes metadata
Efficient storage
packing
compression
Efficient access
chunking
http byte range
parallel IO
Open specification (unlike IDL save files)
Options available
There are two netCDF data models:
NetCDF-3 classic
NetCDF-4 built on HDF5
recommended but prefer classic constructs
How to use this data format
NetCDF Files
Binary format with open specification
Requires software libraries to read and write C, Fortran, Java, python, IDL, …
Internal compression, don’t bother to compress NetCDF files externally
HTTP byte range requests
Parallel IO
nc file extension
Don’t be afraid of big files
Coordinate System
Dimensions should be used to define a coordinate system
e.g. temporal, spatial, spectral
Avoid using dimensions to group data
Think “functional relationship”. Each independent variable should represent a dimension.
coordinate variable
1D variable with dimension of the same name
strictly monotonic (ordered)
no missing values
Independent variable of functional relationship
Every dimension should have one
shared dimensions
Each variable should reuse dimensions to indicate that they share the same coordinates (domain set)
Time as Coordinate Variable
If the data are a function of a single time dimension then there should be a single time variable
avoid breaking time up by date and time of day
Prefer numeric time units
time unit since an epoch
e.g. “seconds since 1970-01-01”, “microseconds since 1980-01-06”
Metadata
Optional but useful to make NetCDF file self-describing
attribute
global (dataset level)
title
history (provenance)
variable
long_name
units
Conventions
Other useful variable attributes
_FillValue
missing_value is considered deprecated and is not recommended by the NetCDF Users Group.
NaN is another option, however, NaNs in files are handled differently in every language and so it may be better to pick a value for official data products that many users will be using
valid_range, valid_min, valid_max
scale_factor, add_offset (packed values)
cell_methods : standards for representing data cells (bins)
e.g. daily average, wavelength bins
Useful Links
Credit: Content taken from a Confluence guide written by Doug Lindholm