NetCDF and xarray - Geophysical Data Analysis Book

Learning Objectives¶

Describe the netcdf data format as it is used to store climate data
Describe how xarray can be used to read netCDF files
Learn how to create well-formatted datasets from numpy arrays
Lean how to save xarray datasets as netCDF files

What is netCDF Data?¶

NetCDF (network Common Data Form) is a hierarchical data format. It is what is known as a “self-describing” data structure which means that metadata, or descriptions of the data, are included in the file itself and can be parsed programmatically, meaning that they can be accessed using code dto build automated and reproducible workflows.

The NetCDF format can store data with multiple dimensions. It can also store different types of data through arrays that can contain geospatial imagery, rasters, terrain data, climate data, and text. These arrays support metadata, making the netCDF format highly flexible. NetCDF was developed and is supported by UCAR who maintains standards and software that support the use of the format.

Data in netCDF format is:¶

Self-Describing. A netCDF file includes information about the data it contains.
Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
Sharable. One writer and multiple readers may simultaneously access the same netCDF file.
Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.

NetCDF4 Format for Climate Data¶

The hierarchical and flexible nature of netcdf files supports storing data in many different ways. The netCDF4 data standard is used broadly by the climate science community to store climate data. Climate data are:

often delivered in a time series format (months and years of historic or future projected data).
spatial in nature, covering regions such as the United States or even the world.
driven by models which require documentation making the self describing aspect of netCDF files useful.

Xarray¶

In the previous set of lectures, we saw how Pandas provided a way to keep track of additional “metadata” surrounding tabular datasets, including “indexes” for each row and labels for each column. These features, together with Pandas’ many useful routines for all kinds of data munging and analysis, have made Pandas one of the most popular python packages in the world.

However, not all Earth science datasets easily fit into the “tabular” model (i.e. rows and columns) imposed by Pandas. In particular, we often deal with multidimensional data. By multidimensional data (also often called N-dimensional), I mean data with many independent dimensions or axes. For example, we might represent Earth’s surface temperature $T$ as a three dimensional variable

T(x, y, t)

(1)

where $x$ is longitude, $y$ is latitude, and $y$ is time.

The point of xarray is to provide pandas-level convenience for working with this type of data.

Xarray data structures¶

Like Pandas, xarray has two fundamental data structures:

a DataArray, which holds a single multi-dimensional variable and its coordinates
a Dataset, which holds multiple variables that potentially share the same coordinates

DataArray¶

A DataArray has four essential attributes:

values: a numpy.ndarray holding the array’s values
dims: dimension names for each axis (e.g., (‘x’, ‘y’, ‘z’))
coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
attrs: an OrderedDict to hold arbitrary metadata (attributes)

Let’s start by constructing some DataArrays manually

The Argo program¶

We will get some practice with xarray using data from Argo floats

Data from Argo floats are available from several data centers. Here, we will use the data available form the French Institure for Ocean Research IFREMER

import numpy as np
import xarray as xr
from matplotlib import pyplot as plt

import cartopy.crs as ccrs

# We use xarray.load_dataset to load our profile data 
ds_raw = xr.load_dataset('../data/5901429_prof.nc')

xarray will read the netCDF data as an xarray.Dataset object. Below, we see that our dataset has 64 variables and 5 dimension. Looking at the file’s Attributes it becomes clear what we mean by metadata and self-describing.

ds_raw

Similarly to pandas, we can visualize the data directly from xarray. Below, we use the “dot” notation to access the variable TEMP_ADJUSTED and make a plot. Note that xarray used the metadata to already add information to the plot in the form of axes labels (it shows even the units 🤯)

ds_raw.TEMP_ADJUSTED.plot()

What we have here is sea temperature as a function of N_LEVELS and N_RPOFILES. This is not very intuitive. Maybe it would make more sense to analize temperature as a function of depth and time. Also, we don’t need all variables from this files and the variable names are a bit annoying to type. Let’s go head and do some data cleanup.

# Define a list with the variables that we want to keep
variables = ['PRES_ADJUSTED','TEMP_ADJUSTED', 'PSAL_ADJUSTED', 'LATITUDE', 'LONGITUDE', 'JULD']
# Select only these variables from the whole dataset
ds = ds_raw[variables]
ds

Now, we can rename variables in xarray, using the method rename and passing the current variable names and the respective new variable names in the form of a dictionary {'current_name1':'new_name1', 'current_name2':'new_name2'}. For example

ds = ds.rename({'JULD':'time'})
ds

Now, let’s do this for the othe variables

ds = ds.rename({'PRES_ADJUSTED':'pressure', 'TEMP_ADJUSTED':'temperature',
               'PSAL_ADJUSTED':'salinity', 'LATITUDE':'latitude',
                'LONGITUDE':'longitude'})
ds

ds.temperature

We have succefully changed the variable names. Now, we see that the dataset dimensions are profile number and level number (N_PROF, N_LEVELS), but we would prefer to have time as a dimension. We can swap the dimension N_PROF with time

ds = ds.swap_dims({'N_PROF':'time'})
ds

Nice! Now, if we try to plot temperature xarray will disply it as a function of time.

ds.temperature.plot()

Operations in xarray are dimension aware¶

Back when we were using numpy, if we wanted to perform an operation on a given array, we had to specify the axis on which to operate. For example, np.mean(data, axis=1). In xarray this is much more intuitive: you specify the dimension on which we want to operate.

ds.temperature.mean(dim='N_LEVELS').plot()

🤔 Pressure or depth?¶

The pressure in this file is given in decibar. Discuss with your peers what is a decibar and how it relates to depth.
It seems like the pressure values in your dataset are not exactly the same for each profile. Discuss with your peers some strategies the you could use to have a comon range of depths for all profiles and map the dimension ‘N_LEVELS’ to depth in your dataset.

# Hide this cell
ds.pressure.mean(dim='time').plot()
ds.pressure.median(dim='time').plot()
ds.pressure.max(dim='time').plot()
ds.pressure.min(dim='time').plot()

depths = ds.pressure.median(dim='time')
ds['depth'] = depths
ds = ds.swap_dims({'N_LEVELS':'depth'})
ds

ds.temperature.plot()

ds.temperature.plot(x='time', y='depth', yincrease=False)

Data selection in xarray¶

Similarly to pandas loc and iloc, in xarray you can select data by index or by the actual value of the data. For example:

ds.temperature.isel(time=10).plot(y='depth', yincrease=False)

# You can also ask for a slice
ds.temperature.isel(time=slice(0, 20)).plot()

# Which also works with multiple dimensions
ds.temperature.isel(time=slice(0, 20), depth=slice(0, 10)).plot()

ds.temperature.sel(time="2007-01-15")

ds.temperature.sel(time="2007-01-15", method='nearest')

Creating Xarray atasets¶

A Comparison to Numpy¶

Remembering we looked at data for the maximum recorded temperature at the Denver Water Department for a number of years. You used numpy to load the dataset as an array, and performed operations like taking the mean and standard deviation on various axes of the data. Let’s say we want to do this with xarray datasets instead of numpy arrays. We used np.loadtxt() to load the data as a numpy dataset, so lets try using the function we learned last lecture (xr.load_dataset()) to open the data as an xarray dataset.

xr.load_dataset('../data/meteo_denver_tmax_2000_2022.txt')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 xr.load_dataset('../data/meteo_denver_tmax_2000_2022.txt')

File /mnt/c/Ryan_Data/Teaching/GPGN268/GPGN268-BOOK/.pixi/envs/default/lib/python3.14/site-packages/xarray/backends/api.py:165, in load_dataset(filename_or_obj, **kwargs)
    162 if "cache" in kwargs:
    163     raise TypeError("cache has no effect in this context")
--> 165 with open_dataset(filename_or_obj, **kwargs) as ds:
    166     return ds.load()

File /mnt/c/Ryan_Data/Teaching/GPGN268/GPGN268-BOOK/.pixi/envs/default/lib/python3.14/site-packages/xarray/backends/api.py:588, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    585     kwargs.update(backend_kwargs)
    587 if engine is None:
--> 588     engine = plugins.guess_engine(filename_or_obj)
    590 if from_array_kwargs is None:
    591     from_array_kwargs = {}

File /mnt/c/Ryan_Data/Teaching/GPGN268/GPGN268-BOOK/.pixi/envs/default/lib/python3.14/site-packages/xarray/backends/plugins.py:217, in guess_engine(store_spec, must_support_groups)
    214     if not is_remote_uri(store_spec_str) and not os.path.exists(store_spec_str):
    215         raise FileNotFoundError(f"No such file: '{store_spec_str}'")
--> 217 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

Why didn’t this command work? Because xarray uses the netCDF data format, which is a special way of encoding and saving data. NetCDF files are saved with the .nc extension. So xarray doesn’t know what to do with a .txt file. If we want to make this data into an xarray dataset, we will need to do it ourselves.

DataArrays¶

One of the simplest ways to make a numpy array is to create it from a list:

np.array([0,1,2])

array([0, 1, 2])

In the same way, we can make a DataArray from a numpy array using the command xr.DataArray(). If you recall, a data array takes a single numpy array and “wraps” it so we can using xarray dimensions and coordinates to access it. From the xarray documentation, we can see that xr.DataArray() takes several arguments. We will go over a few of them here, but as always, refer back to the documentation for a more complete description of the function.

First, let’s load our maximum temperature data as a numpy array, and try converting it to a DataArray:

tmax_array = np.loadtxt('../data/meteo_denver_tmax_2000_2022.txt')
tmax_array.shape # To remind us what tmax_array looks like

(23, 12)

tmax_dataArray1 = xr.DataArray(data=tmax_array)
tmax_dataArray1

tmax_dataArray1.mean(dim='dim_0')

Here we just specified the data argument of xr.DataArray(). We did successfully create a DataArray, but it doesn’t have most of the things we like about dataArrays and datasets. The dimensions are just dim_0 and dim_1, which doesn’t tell us very much more than just using axis=0 or axis=1 in a numpy array. We also don’t have and coordinates or indexes, so we still have to select our data using its raw position in the array. Finally, we aren’t making use of the attributes to communicate more information about our data. In order to fix these, we will have to specify some other arguments of xr.DataArray(). Lets start by giving our DataArray some coordinates, corresponding to the months and years our data encompasses. Based on the documentation, we need to make a dictionary in the form {'dimension_name':array}.

months = np.array(['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC'])
years = np.arange(2000, 2023)
tmax_dataArray2 = xr.DataArray(data=tmax_array, coords={'year':years, 'month':months})
tmax_dataArray2

tmax_dataArray2.sel(year=2013, month='MAR')

tmax_dataArray2.mean(dim='year')

This works well! We now have 2 labelled dimensions (“year” and “month”) and corresponding indexes that we can use to select datapoints directly by their year and month. We can now do all of the data analysis operations (mean, median, standard deviation) by specifying named dimensions instead of axes, just like you saw in the previous lecture.

Note: There are a lot of different ways to initialize coordinates and dimensions for DataArrays. This is the simplest and most common, where each dimension corresponds to a 1D coordinate array that is used to index your data (Like using x, y, and z values to pick a point in a mathematical function of the form f(x, y, z)). Xarray allows for the flexibility of providing coordinates that correspond to multiple dimensions, dimensions without coordinates, . . . to give more flexibility. We won’t go over it here, but if you’re curious, check out the documenation, scroll through some examples, and experiment a little to see just how much xarray is capable of.

The final thing we want to add is attributes, so that people can understand what our data means. We do this by passing a dictionary as the argument to attrs.

attrs_dict = {'title': 'Max Temperature at Denver Water Department', 'units': 'Degrees Farenheit', 'date_created': '2025-03-20',
              'description': 'Max temp by month and year from 2000 - 2022'}

tmax_dataArray3 = xr.DataArray(data=tmax_array, coords={'year':years, 'month':months}, attrs=attrs_dict)
tmax_dataArray3

We have now added attributes that explain what the data is, where it was collected, when the dataset was created, and what the units of our data are. This is very helpful when sharing data, since everything someone needs to understand data is included in the file itself. One last thing to highlight before we move on to Datasets is that we can also add attributes after the fact, using the structure shown below.

tmax_dataArray3.month.attrs['units'] = '3-letter code for months'
tmax_dataArray3

This structure also allows us to add attributes to specific variables or coordinates to add further clarification. In the example above, I added an attribute to the month coordinate specifically, to clarify that months were indexed as 3-letter strings.

Datasets¶

If a DataArray corresponds to a single numpy array, a Dataset is a collection of several numpy arrays of the same shape. Datasets can store multiple variables that have the same dimensions and coordinates together. To explore how to create them, lets load a few more meteorological arrays:

precip_array = np.loadtxt('../data/meteo_denver_precip_2000_2022.txt')
snow_array = np.loadtxt('../data/meteo_denver_snow_2000_2022.txt')

As always, we follow the xarray documentation for Datasets. The simplest way to initialize a Dataset is to give it a few well-formatted DataArrays to stack together. To see this, lets turn our precipitation and snowfall into some well formated dataArrays, and pass them to xr.Dataset() as data variables.

precip_attrs_dict = {'title': 'Precipitation at Denver Water Department', 'units': 'inches', 'date_created': '2025-03-20',
              'description': 'Precipitation by month and year from 2000 - 2022'}
precip_dataArray = xr.DataArray(data=precip_array, coords={'year':years, 'month':months}, attrs=precip_attrs_dict)

snow_attrs_dict = {'title': 'Snowfall at Denver Water Department', 'units': 'inches', 'date_created': '2025-03-20',
              'description': 'Total Snow Accumulation by month and year from 2000 - 2022'}
snow_dataArray = xr.DataArray(data=snow_array, coords={'year':years, 'month':months}, attrs=snow_attrs_dict)

meteo_ds1 = xr.Dataset(data_vars={'max_temp':tmax_dataArray3, 'precip':precip_dataArray, 'snow':snow_dataArray})
meteo_ds1

This works pretty well. Our dataset is well formatted, with “year” and “month” as indexed coordinates, and all of our DataArrays as variables. Xarray even kept our attributes, bundling them into each variable. However, creating DataArrays for each array, then using those DataArrays to make a Dataset is rather inefficient. By using a few more arguments of xr.Dataset(), we can do the whole thing in one go. To do this, we need to use coords as well as data_vars, and we need to be a little more sophisticated in how we input the arguments. In our previous example, we used xarray DataArrays. These DataArrays already have the information on which dimensions correspond to which axes and what coordinate to use to index each of those dimensions. When we use numpy arrays instead of premade DataArrays, we have to specify this to xarray manually.

data_vars: We want to use the form {'variable_name': (('dimension_names'), array)}. This tells xarray what name to assign the array (variable_name), what the data values are (array), and which dimensions this array should be indexed by and in what order (‘dimension_names’).
coords: We use the same general form of {'coordinate_name': (('dimension_names'), array)}. As before, this tells xarray what name to assign the array (coordinate_name), what the data values are (array), and which dimensions this array corresponds to and in what order (‘dimension_names’). Oftentimes (unless you have multi-dimensional coordinates or a specific need), “coordinate_name” is the same as “dimension_name” and you only have a single dimension corresponding to a 1D arary.

Now, we apply this to our meteorological data:

meteo_ds2 = xr.Dataset(data_vars={'max_temp':(('year', 'month'), tmax_array),
                                  'precip':(('year', 'month'), precip_array),
                                  'snow':(('year', 'month'), snow_array)},
                       coords={'year':('year', years), 'month':('month', months)})
meteo_ds2

Success! We have managed to make the Dataset we wanted without having to first make DataArrays. The only thing we need to do now is add attributes, which we can do in exactly the same way as we did for DataArrays. We can use the attrs argument for global attributes, or assign attributes to individual variables and coordinates using the ds['variable_name'].attrs['attribute_name'] = 'attribute_description' format.

global_attrs = {'title': 'Meteorological Data from Denver Water Department',
                'description': 'Meteorological data by month and year from 2000 - 2022',
                'date_created': '2025-02-20'}

meteo_ds3 = xr.Dataset(data_vars={'max_temp':(('year', 'month'), tmax_array),
                                  'precip':(('year', 'month'), precip_array),
                                  'snow':(('year', 'month'), snow_array)},
                       coords={'year':('year', years), 'month':('month', months)}, attrs=global_attrs)

meteo_ds3['month'].attrs['units'] = '3-letter code for months'
meteo_ds3['max_temp'].attrs['units'] = 'Degrees Farenheit'
meteo_ds3['precip'].attrs['units'] = 'inches'
meteo_ds3['snow'].attrs['units'] = 'inches'

meteo_ds3

Saving DataArrays and Datasets¶

Now that we’ve created these nice DataArrays and Datasets, we want to save them. We can do this by saving them as netCDF files using the command Dataset.to_netcdf() or DataArray.to_netcdf(). Conveniently, both work in exactly the same way. From the documentation, we can see that these functions take a lot of arguments. Fortunately, however, they are essentially all optional, and in most cases we don’t need to worry about them. For now, let’s just specify the path.

meteo_ds3.to_netcdf('../data/den_water_dpt_meteo_data.nc')

Success! We have now created a new xarray dataset from numpy arrays and saved it to a file, so that we or anyone else can open it using xarray and have access to all of the features xarray offers.