The goal of this seminar is to demonstrate some of the capabilities of the xarray and pandas python packages.
To be able to use these packages, only a very superficial knowledge of the python syntax is needed.
Most of you already know Matlab or R. For you, learning python is going to be easy if you want. Try resources as Think Python and Python for Data Analysis.
Now we will see a very brief crash course of python syntax for those of you that see it for the first time.
In python we always deal with objects. The "=" is the assignation operator, as "<-" in R.
Python is designed to be readable and intuitive:
text = "Hello world"
print(text)
There are different types of objects, and the user can define its own kinds of objects, this is, classes.
The most important built-in classes are String (like the text in the previous example), Tuple, Set, List and Dictionary.
More interestingly, there are high-level packages, like pandas and xarray, that define custom classes to easily deal with structured data as time series, data frames and netCDFs.
We will be skipping the introduction to basic python classes, and see an example with the pandas package.
We use the import statement to get the pandas package into our namespace.
import pandas as pd
All the pandas functions will be available under the "pd" name. We don't use the full pandas name to keep the names shorter.
For example the read_csv function, which we can use to read data from comma sepparated ASCII files, is called typing pd.read_csv(arguments).
In python, parentheses () are used always to call functions.
We are going to open the famous iris flower dataset.
data = pd.read_csv("iris.csv")
data
The name "data" is now assigned to an object that belongs to the pandas.DataFrame class.
These objects are indexable. This means thay it is possible to subset it using python indexing syntax, with square brackets data[start:stop:step].
In the case of data frames, this subsets them by rows. Let's see the first ten rows of data.
data[0:10]
data["SepalLength"]
Python objects may contain attributes, which are objects that can be called by typing object_name.attribute_name.
For example, the "columns" attribute of data frames, is a list of the columns names (not exactly a list but a pandas.Index).
data.columns
Methods are attributes which are functions. The first argument of these functions is implied to be the object itself. They can be interpretated as the object acting on himself and other objects, in the object-oriented programming paradigm
But you don't have to worry about this. They are just functions. For example, in pandas you can easily compute the mean of each column of the data frame with the mean method.
Note the parentheses (), needed because we are calling a function, not like in the .columns examples.
When we call this function, there are no explicit arguments but, as "mean" is a method, the first argument is implicitly the "data" object.
data.mean()
Methods can be chained, so we can do many actions in one line, and keep it readable.
Pandas plot method is great for easy and fast visualization.
%matplotlib inline is a "magic command", used to tell ipython to embed the plots in the notebook.
%matplotlib inline
data.mean().plot(kind='bar')
The open your appetite, here is an example with an advanced visualization library called seaborn, prepared to work with pandas data frames.
Note how many lines of code were used to produce this plot.
import seaborn as sns
sns.pairplot(data, hue="Name")
Xarray is a python package designed to manipulate labeled N-dimensional arrays. It is very well documented in http://xray.readthedocs.io/en/stable/.
An "unlabeled" array is simply a n-dimensional matrix of numbers. In python, the numpy package is the way to deal with these arrays, defining, subsetting, and performing all kind of arithmetical operations.
Together with the matplotlib package for plotting, numpy provides a similar environment to matlab. scipy, statsmodels, scikit-learn and pandas are also the essential packages for scientific programming with python. They will not be part of this seminar, except pandas.
The following is an example of a 1D unlabeled array.
import numpy as np
array1d = np.random.normal(size=365) # Define a random 1D array with 365 numbers
array1d[10:20]
The pandas package provides an interface for dealing with 1D and 2D labeled arrays, named Series and DataFrames.
Series are 1D arrays (i.e. vectors) with an index, which can be, for example, a list of times:
times = pd.date_range("2001-01-01", "2001-12-31", freq="D") # Defines a 365 dates list
time_series = pd.Series(array1d, index=times) # Defines the pandas.Series object
time_series
time_series.loc["2001-03-01":"2001-05-31"]
We can directly use the plot method to see what we selected.
pandas also is able to label the x axis with the dates appropiatedly formatted automatically.
time_series.loc["2001-03-01":"2001-05-31"].plot()
In order to do this with the standard indexing syntax, we would need to know which are the index numbers for the first day of March and the last day of May.
We can still use this syntax with the "iloc" attribute. But this is not intuitive and error prone!
time_series.iloc[59:151]
Pandas is an extremely powerful and useful package, however, it is not able to deal with labeled arrays of more than two dimensions, and gridded climate data usually has at least 3: time, latitude and longitude.
xarray is devoted exactly to this, labeled N-dimensional arrays. As they say in their web
Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multidimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.
These are the basic classes of xarray.
xarray.DataArray objects are the labeled arrays themselves, while xarray.Dataset objects are essentially collections (dictionaries) of named DataArrays.
xarray supports attributes, so Datasets are full a in-memory representation of netCDF files.
For example, lets play with this netCDF. It contains global gridded temperatures from the Berkeley Earth dataset, from the year 1750.
!ncdump -h BEST_tasmean.nc # Prepending ! we can call the shell from ipython
We can open it easily with xarray by using the open_dataset function:
import xarray as xr
tasmean_file = "BEST_tasmean.nc"
dataset = xr.open_dataset(tasmean_file)
dataset
At this point, xarray did not load into the memory of the computer the data in the netCDF, it just parsed the structure. This philosophy is called "lazy loading".
The netCDF is opened as a xarray.Dataset object with one data variable and three coordinates.
We can get the data variable as a xarray.DataArray object as an attribute with the variable name.
dataset.tasmean
dataset["tasmean_K"] = dataset.tasmean + 273.15
dataset
We can add an attribute to tasmean_K easily, for example the units, using the dictionary syntax.
After this, saving to a netCDF is trivial, with the "to_netcdf" method.
dataset.tasmean_K.attrs["units"] = "degrees_kelvin"
dataset.to_netcdf("dataset.nc")
!ncdump -h dataset.nc
The following is a non-exahustive list of examples about how xarray makes your life easier when dealing with netCDFs.
I know by experience that programming these tasks with lower level packages (in python we woudl use scipy.io.netcdf or netCDF4, numpy and datetime), requires way more lines and is way more prone to errors.
Suppose we want to know how the temperature evolved in a given location during the past decades. We know the longitude and latitude of the location.
With xarray we can extract this time series almost immediatly by using the "sel" method.
lon_st = -5.3795 # Location of the Grazalema station of CLIMADAT
lat_st = 36.6956
tasmean_st = dataset.tasmean.sel(lon=lon_st, lat=lat_st, method='nearest')
tasmean_st
With the method="nearest" we are telling xarray to automatically search for the gridpoint nearest to the point given. In this case, we can see that it is (-5.5, 36.5).
xarray automatically decoded the time coordinate
Now lets plot the series.
tasmean_st.plot()
Now, that was easy! But the plot is noisy. Let's say we want the data averaged by year. We can use the resample method.
tasmean_st_yearly = tasmean_st.resample(dim='time', freq='AS', how='mean')
tasmean_st_yearly.plot()
We have seen how to select a lon, lat point. But what if we cant to select an interval?
To subset interval from a labeled axis, we can use the slice(start, stop, step) objects of python. Here we use it to get the last 115 years:
tasmean_lastc = tasmean_st_yearly.sel(time=slice("1900-01-01", "2015-01-01"))
tasmean_lastc.plot()
Task: Plot the 1971-2000 temperature climatology over the Iberian Peninsula. We can do this easily with the sel and the mean methods.
# Select the box
tasmean_ip = dataset.tasmean.sel(lon=slice(-11, 3), lat=slice(35, 45))
# Select the time period
tasmean_ip_bp = tasmean_ip.sel(time=slice("1971-01-01", "2000-12-31"))
# Compute the mean along the time axis
tasmean_ip_clim = tasmean_ip_bp.mean(dim="time")
# Plot!
tasmean_ip_clim.plot()
We get axis and colorbar labels automatically, but no coastline.
In order to plot the coastline, we need to use the cartopy package. It can be difficult to install... try with conda install -c scitools cartopy.
Appart from cartopy, the basemap package can also be used to plot maps. It is a good package, but it is not integrated with xarray as cartopy.
We need to import pyplot from matplotlib as well.
import cartopy.crs as crs
from matplotlib import pyplot as plt
# Define an axes object with a projection
ax = plt.axes(projection=crs.PlateCarree())
# Plot the coastline
ax.coastlines(lw=1)
# Plot the mesh
tasmean_ip_clim.plot()
Now lets say that we want to see the anomaly of the year 2010 respect to this baseline.
Arrays with different dimensions are automatically aligned when doing arithmetics, taking into account dimension names.
# Aggregate by years
tasmean_ip_yearly = tasmean_ip.resample(dim='time', freq='AS', how='mean')
# Select the year 2010
tasmean_ip_2010 = tasmean_ip_yearly.sel(time="2010-01-01")
# Compute the anomaly
tasmean_ip_2010_anom = tasmean_ip_2010 - tasmean_ip_clim
# Plot!
tasmean_ip_2010_anom.plot()
Now suppose we want to plot the climatology for the four seasons sepparatedly. We can do this using groupby.
This method lets us divide a dataset into groups and then apply a function over them.
In order to do group our dataset by seasons, we need a vector with the season of each date. In xarray there is a way to get this fast:
tasmean_ip["time.season"]
So we can use this feature to group the data by seasons and average each group.
dates_seasons = tasmean_ip["time.season"]
tasmean_ip_seas_clim = tasmean_ip.groupby(dates_seasons).mean(dim='time')
Now, the keyword arguments "col" and "col_wrap" can be used to plot many maps at the same time, along a dimension:
tasmean_ip_seas_clim.plot(col='season', col_wrap=2)
One aspect in which xarray excels is when our data has many dimensions.
In our last example we had three dimensions but a typical ensemble has four, time, ensemble member, latitude and longitude.
In this example we are going to open a temperature forecast from the ECMWF sub-seasonal ensemble, with 10 members, started the July 30th of 2003.
ds = xr.open_dataset("tas_ens_20030730.nc")
ds
tas_0803 = ds.tas.sel(time="2003-08-03T12:00")
tas_0803
We can easily plot the ensemble mean:
ax = plt.axes(projection=crs.PlateCarree())
ax.coastlines(lw=1)
tas_0803.mean(dim="member").plot()
We can plot the ensemble spread, using the standard deviation, to see the places with more ensemble spread:
ax = plt.axes(projection=crs.PlateCarree())
ax.coastlines(color="white", lw=1)
tas_0803.std(dim="member").plot()
We can also plot all the ensemble members, but the coastline here doesn't work.
tas_0803.plot(col="member", col_wrap=3)
Finally, we can plot the forecast for Grazalema.
lon_st = -5.3795 # Location of the Grazalema station of CLIMADAT
lat_st = 36.6956
# Read the data
tas_graz = ds.tas.sel(lon=lon_st, lat=lat_st, method='nearest')
# Convert from 12-hourly freq to daily.
tas_graz_daily = tas_graz.resample(dim="time", freq="D", how="mean")
# Convert to celsius
tas_graz_daily = tas_graz_daily - 273.15
# Convert to pandas.DataFrame, transpose and plot
tas_graz_daily.to_pandas().T.plot(legend=False)
# Improve the plot
plt.legend(ncol=4, loc="lower center")
plt.ylabel("Temperature (Celsius)")
plt.title("ECMWF forecast for Grazalema")
We can see how the ensemble members de-correlate as the forecast time increases.
We saw how xarray can be used to solve typical problems in climate data analysis in an elegant way.
Indexing arrays based on labels instead of positions makes the code more readable and is less error-prone.
Dealing with dates, months, seasons, etc, is greatly simplified respect to lower-level libraries as netCDF4.
It only requires a very superficial knowledge of python.
It is integrated with pandas and with plotting libraries (matplotlib, cartopy) to easily produce professional quality figures.
It is very well integrated with netCDF Input/Output as it uses the same data model.
It is open source (forget about paying licenses).
R functions and packages can be called from python with the rpy2 package
xarray can be used to remotely access OpenDAP datasets, subset them and download only the data you need.
It supports opening datasets scattered in a collection of files, and concatenating datasets over existing or new dimensions.
It can deal with large data that does not fit in the RAM memory of the computer, by using the dask package.
It is under active development, more features coming https://github.com/pydata/xarray