Dask Arrays with Xarray
Dask Array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.
Parallel: Uses all of the cores on your computer
Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
Blocked Algorithms: Perform large computations by performing many smaller computations
This notebook demonstrates one of Xarray’s most powerful features: the ability to wrap dask arrays and allow users to seamlessly execute analysis code in parallel.
Learning Objectives
Learn the distinction between eager and lazy execution, and how Xarray can work either way
Understand key features of dask arrays
Work with Dask Arrays in much the same way you would work with a NumPy array
Learn that xarray DataArrays and Datasets are “dask collections” i.e. you can execute top-level dask functions such as dask.visualize(xarray_object)
Learn that all xarray built-in operations can transparently use dask
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
Familiarity with Data Arrays |
|
Necessary |
Familiarity with Xarray Data Structures |
Time to learn: 30-40 minutes
Imports
import dask
import dask.array as da
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
from dask.utils import format_bytes
from pythia_datasets import DATASETS
Blocked algorithms
A blocked algorithm executes on a large dataset by breaking it up into many small blocks.
For example, consider taking the sum of a billion numbers, in a single computation. This would take a while. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.
We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)
dask.array
contains these algorithms
dask.array
implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using multiple cores. Dask coordinates these blocked algorithms using Dask graphs. Dask Arrays are also lazy, meaning that they do not evaluate until you explicitly ask for a result using the compute method.
Create a dask.array
object
If we want to create a 3D NumPy array of random values, we do it like this:
shape = (600, 200, 200)
arr = np.random.random(shape)
arr
array([[[0.72276826, 0.81939887, 0.9648053 , ..., 0.75422707,
0.45518683, 0.47013516],
[0.2290464 , 0.63189861, 0.84197834, ..., 0.95693948,
0.34410545, 0.60407552],
[0.47441026, 0.89965927, 0.34468886, ..., 0.2928259 ,
0.22009898, 0.18637576],
...,
[0.18973873, 0.46510587, 0.46521359, ..., 0.18479059,
0.55565091, 0.06490163],
[0.57006225, 0.2034277 , 0.39312264, ..., 0.26304905,
0.72284644, 0.97309118],
[0.89723195, 0.58876465, 0.17007025, ..., 0.25063083,
0.66576471, 0.34711286]],
[[0.03189923, 0.63182747, 0.52015635, ..., 0.52873366,
0.37550708, 0.52723972],
[0.10853188, 0.86012608, 0.23250878, ..., 0.96925617,
0.4080025 , 0.11750864],
[0.34739026, 0.65877474, 0.24510548, ..., 0.43640453,
0.26519044, 0.15538208],
...,
[0.98680704, 0.79659648, 0.02355848, ..., 0.63637454,
0.3990214 , 0.71100916],
[0.8661744 , 0.13790301, 0.0125664 , ..., 0.80836702,
0.90570354, 0.7084646 ],
[0.44348446, 0.33321438, 0.31369808, ..., 0.03410643,
0.17001342, 0.00466907]],
[[0.64001902, 0.19091414, 0.54252349, ..., 0.1115369 ,
0.05270739, 0.06863388],
[0.40351744, 0.71001541, 0.33376057, ..., 0.30531527,
0.33031184, 0.13564408],
[0.07036178, 0.48092483, 0.96371773, ..., 0.67454873,
0.47154619, 0.10551196],
...,
[0.97262951, 0.39756604, 0.78641545, ..., 0.86980993,
0.31590761, 0.45856953],
[0.08358006, 0.34354443, 0.18241253, ..., 0.33423266,
0.03606792, 0.45384353],
[0.51660469, 0.95032595, 0.05279355, ..., 0.58324791,
0.82246304, 0.87204739]],
...,
[[0.68204816, 0.07275065, 0.20457907, ..., 0.09093161,
0.88241962, 0.15121851],
[0.88598357, 0.01084879, 0.14406535, ..., 0.10483288,
0.91305729, 0.25706039],
[0.24111709, 0.66483059, 0.54047741, ..., 0.15209455,
0.83898073, 0.36562187],
...,
[0.37751588, 0.94306476, 0.63077786, ..., 0.9365931 ,
0.10586243, 0.51788884],
[0.52342304, 0.85619135, 0.16879181, ..., 0.65677019,
0.34328332, 0.57585425],
[0.51016987, 0.31792985, 0.63630915, ..., 0.99628686,
0.86683737, 0.45705513]],
[[0.9514087 , 0.74533276, 0.61719225, ..., 0.91601753,
0.61506059, 0.48001096],
[0.67675331, 0.22747699, 0.93086825, ..., 0.59713751,
0.7652143 , 0.67412224],
[0.92851744, 0.97182227, 0.51673962, ..., 0.9955057 ,
0.35330034, 0.10288631],
...,
[0.97263319, 0.75665321, 0.03746669, ..., 0.4454221 ,
0.17415345, 0.96588314],
[0.22532136, 0.25524652, 0.59883557, ..., 0.04544202,
0.13989966, 0.15248656],
[0.13363666, 0.10799611, 0.6050669 , ..., 0.49287486,
0.26213513, 0.20101287]],
[[0.02542001, 0.13312112, 0.9086651 , ..., 0.0463805 ,
0.95551927, 0.46371885],
[0.99416949, 0.66771427, 0.56380208, ..., 0.34877048,
0.93971194, 0.99581419],
[0.38703958, 0.16842792, 0.79062505, ..., 0.25638897,
0.84084678, 0.18494357],
...,
[0.38027878, 0.02393925, 0.02133541, ..., 0.66678385,
0.01565977, 0.15606277],
[0.23938421, 0.01135041, 0.81674295, ..., 0.86088857,
0.72929162, 0.01937985],
[0.40862767, 0.00714657, 0.76860079, ..., 0.0275748 ,
0.68829306, 0.85973984]]])
format_bytes(arr.nbytes)
'183.11 MiB'
This array contains ~183 MB
of data
Now let’s create the same array using Dask’s array interface.
darr = da.random.random(shape, chunks=(300, 100, 200))
A chunk size to tell us how to block up our array, like (300, 100, 200)
.
Specifying Chunks
There are several ways to specify chunks. In this tutorial, we will use a block shape.
darr
|
Notice that we just see a symbolic representation of the array, including its shape
, dtype
, and chunksize
. No data has been generated yet. Let’s visualize the constructed task graph.
darr.visualize()

Our array has four chunks. To generate it, Dask calls np.random.random
four times and then concatenates this together into one array.
Manipulate a dask.array
object as you would a numpy array
Now that we have an Array
we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc…
The interface is familiar, but the actual work is different. dask_array.sum()
does not do the same thing as numpy_array.sum()
.
What’s the difference?
dask_array.sum()
builds an expression of the computation. It does not do the computation yet, also known as lazy execution. numpy_array.sum()
computes the sum immediately (eager execution).
Why the difference?
A dask.array
is split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory.
total = darr.sum()
total
|
total.visualize()

Compute the result
dask.array
objects are lazily evaluated. Operations like .sum
build up a graph of blocked tasks to execute.
We ask for the final result with a call to .compute()
. This triggers the actual computation.
%%time
total.compute()
CPU times: user 320 ms, sys: 36.5 ms, total: 357 ms
Wall time: 249 ms
12002132.029623773
Exercise with dask.arrays
Modify the chunk size (or shape) in the random dask array, call .sum()
on the new array, and visualize how the task graph changes.
da.random.random(shape, chunks=(50, 200, 400)).sum().visualize()

Here we see Dask’s strategy for finding the sum. This simple example illustrates the beauty of Dask: it automatically designs an algorithm appropriate for custom operations with big data.
If we make our operation more complex, the graph gets more complex:
For an example, we use an arbitrarily complex calculation.
z = darr.dot(darr.T).mean(axis=0)[::2, :].std(axis=1)
z
|
z.visualize()

Testing a bigger calculation
The examples above were toy examples; the data (180 MB) is probably not big enough to warrant the use of Dask.
We can make it a lot bigger! Let’s create a new, big array
darr = da.random.random((4000, 100, 4000), chunks=(1000, 100, 500)).astype('float32')
darr
|
This dataset is ~6 GB
, rather than 32 MB! This is probably close to or greater than the amount of available RAM than you have in your computer. Nevertheless, Dask has no problem working on it.
z = (darr + darr.T)[::2, :].mean(axis=2)
z.visualize()

with ProgressBar():
computed_ds = z.compute()
[ ] | 0% Completed | 509.80 us
[ ] | 0% Completed | 108.35 ms
[ ] | 0% Completed | 209.26 ms
[ ] | 0% Completed | 309.97 ms
[ ] | 0% Completed | 411.12 ms
[ ] | 0% Completed | 511.96 ms
[ ] | 0% Completed | 613.21 ms
[ ] | 0% Completed | 717.10 ms
[ ] | 0% Completed | 818.03 ms
[ ] | 1% Completed | 918.92 ms
[ ] | 2% Completed | 1.02 s
[# ] | 3% Completed | 1.12 s
[# ] | 3% Completed | 1.22 s
[# ] | 4% Completed | 1.32 s
[## ] | 5% Completed | 1.42 s
[## ] | 5% Completed | 1.52 s
[## ] | 5% Completed | 1.62 s
[## ] | 5% Completed | 1.73 s
[## ] | 5% Completed | 1.83 s
[## ] | 5% Completed | 1.93 s
[## ] | 6% Completed | 2.03 s
[### ] | 8% Completed | 2.13 s
[### ] | 9% Completed | 2.23 s
[#### ] | 11% Completed | 2.33 s
[#### ] | 11% Completed | 2.43 s
[#### ] | 11% Completed | 2.53 s
[#### ] | 11% Completed | 2.64 s
[#### ] | 11% Completed | 2.74 s
[#### ] | 11% Completed | 2.84 s
[#### ] | 11% Completed | 2.94 s
[#### ] | 11% Completed | 3.04 s
[##### ] | 13% Completed | 3.15 s
[##### ] | 13% Completed | 3.25 s
[###### ] | 15% Completed | 3.35 s
[###### ] | 15% Completed | 3.45 s
[###### ] | 16% Completed | 3.55 s
[###### ] | 16% Completed | 3.65 s
[###### ] | 16% Completed | 3.75 s
[###### ] | 16% Completed | 3.86 s
[###### ] | 16% Completed | 3.96 s
[###### ] | 16% Completed | 4.06 s
[###### ] | 16% Completed | 4.16 s
[###### ] | 16% Completed | 4.26 s
[###### ] | 17% Completed | 4.36 s
[####### ] | 19% Completed | 4.46 s
[####### ] | 19% Completed | 4.56 s
[######## ] | 20% Completed | 4.67 s
[######## ] | 21% Completed | 4.77 s
[######## ] | 22% Completed | 4.87 s
[######## ] | 22% Completed | 4.97 s
[######## ] | 22% Completed | 5.07 s
[######## ] | 22% Completed | 5.17 s
[######## ] | 22% Completed | 5.27 s
[######## ] | 22% Completed | 5.37 s
[######### ] | 23% Completed | 5.47 s
[######### ] | 23% Completed | 5.57 s
[########## ] | 25% Completed | 5.68 s
[########## ] | 26% Completed | 5.78 s
[########### ] | 27% Completed | 5.88 s
[########### ] | 27% Completed | 5.98 s
[########### ] | 27% Completed | 6.08 s
[########### ] | 27% Completed | 6.18 s
[########### ] | 27% Completed | 6.28 s
[########### ] | 27% Completed | 6.38 s
[########### ] | 27% Completed | 6.48 s
[########### ] | 28% Completed | 6.59 s
[############ ] | 30% Completed | 6.69 s
[############ ] | 31% Completed | 6.79 s
[############# ] | 33% Completed | 6.89 s
[############# ] | 33% Completed | 6.99 s
[############# ] | 34% Completed | 7.09 s
[############# ] | 34% Completed | 7.19 s
[############# ] | 34% Completed | 7.30 s
[############# ] | 34% Completed | 7.40 s
[############## ] | 35% Completed | 7.50 s
[############## ] | 36% Completed | 7.60 s
[############## ] | 37% Completed | 7.70 s
[############### ] | 37% Completed | 7.80 s
[############### ] | 39% Completed | 7.90 s
[################ ] | 40% Completed | 8.00 s
[################ ] | 42% Completed | 8.10 s
[################# ] | 42% Completed | 8.20 s
[################# ] | 42% Completed | 8.30 s
[################# ] | 42% Completed | 8.40 s
[################# ] | 42% Completed | 8.50 s
[################# ] | 42% Completed | 8.61 s
[################# ] | 44% Completed | 8.71 s
[################## ] | 46% Completed | 8.81 s
[################## ] | 47% Completed | 8.91 s
[################### ] | 48% Completed | 9.01 s
[################### ] | 49% Completed | 9.11 s
[################### ] | 49% Completed | 9.21 s
[################### ] | 49% Completed | 9.31 s
[################### ] | 49% Completed | 9.41 s
[################### ] | 49% Completed | 9.51 s
[#################### ] | 51% Completed | 9.61 s
[#################### ] | 52% Completed | 9.71 s
[##################### ] | 53% Completed | 9.82 s
[##################### ] | 54% Completed | 9.92 s
[##################### ] | 54% Completed | 10.02 s
[##################### ] | 54% Completed | 10.12 s
[##################### ] | 54% Completed | 10.22 s
[##################### ] | 54% Completed | 10.32 s
[###################### ] | 55% Completed | 10.42 s
[###################### ] | 56% Completed | 10.52 s
[####################### ] | 58% Completed | 10.62 s
[####################### ] | 59% Completed | 10.72 s
[####################### ] | 59% Completed | 10.82 s
[####################### ] | 59% Completed | 10.92 s
[####################### ] | 59% Completed | 11.03 s
[####################### ] | 59% Completed | 11.13 s
[######################## ] | 60% Completed | 11.23 s
[######################### ] | 62% Completed | 11.33 s
[######################### ] | 63% Completed | 11.43 s
[######################### ] | 64% Completed | 11.53 s
[########################## ] | 65% Completed | 11.63 s
[########################## ] | 65% Completed | 11.73 s
[########################## ] | 65% Completed | 11.83 s
[########################## ] | 65% Completed | 11.93 s
[########################## ] | 65% Completed | 12.03 s
[########################## ] | 67% Completed | 12.14 s
[########################### ] | 68% Completed | 12.24 s
[########################### ] | 69% Completed | 12.34 s
[############################ ] | 70% Completed | 12.44 s
[############################# ] | 73% Completed | 12.54 s
[############################# ] | 74% Completed | 12.64 s
[############################# ] | 74% Completed | 12.74 s
[############################# ] | 74% Completed | 12.84 s
[############################# ] | 74% Completed | 12.94 s
[############################# ] | 74% Completed | 13.04 s
[############################## ] | 75% Completed | 13.14 s
[############################## ] | 76% Completed | 13.24 s
[############################### ] | 78% Completed | 13.34 s
[############################### ] | 79% Completed | 13.45 s
[################################ ] | 80% Completed | 13.55 s
[################################ ] | 80% Completed | 13.65 s
[################################ ] | 80% Completed | 13.75 s
[################################ ] | 80% Completed | 13.85 s
[################################ ] | 81% Completed | 13.95 s
[################################ ] | 81% Completed | 14.05 s
[################################# ] | 83% Completed | 14.15 s
[################################# ] | 84% Completed | 14.26 s
[################################## ] | 85% Completed | 14.36 s
[################################## ] | 85% Completed | 14.46 s
[################################## ] | 85% Completed | 14.56 s
[################################## ] | 85% Completed | 14.66 s
[################################## ] | 85% Completed | 14.76 s
[################################## ] | 87% Completed | 14.86 s
[################################### ] | 88% Completed | 14.96 s
[################################### ] | 89% Completed | 15.06 s
[#################################### ] | 91% Completed | 15.16 s
[##################################### ] | 92% Completed | 15.26 s
[##################################### ] | 93% Completed | 15.36 s
[##################################### ] | 93% Completed | 15.47 s
[##################################### ] | 93% Completed | 15.57 s
[##################################### ] | 93% Completed | 15.67 s
[##################################### ] | 93% Completed | 15.77 s
[##################################### ] | 94% Completed | 15.87 s
[###################################### ] | 95% Completed | 15.97 s
[###################################### ] | 97% Completed | 16.07 s
[####################################### ] | 99% Completed | 16.17 s
[########################################] | 100% Completed | 16.27 s
Dask Arrays with Xarray
Often times, you won’t be directly interacting with dask.arrays
directly; odds are you will be interacting with them via Xarray
! Xarray wraps NumPy arrays similar to how we showed in the previous section, helping the user jump right into the dask array interface!
Reading data with Dask
and Xarray
Recall that a Dask’s array consists of many chunked arrays:
darr
|
To read data as Dask arrays with Xarray, we need to specify the chunks
argument to open_dataset()
function.
ds = xr.open_dataset(DATASETS.fetch('CESM2_sst_data.nc'), chunks={})
ds.tos
/usr/share/miniconda3/envs/pythia-book-dev/lib/python3.10/site-packages/xarray/conventions.py:543: SerializationWarning: variable 'tos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
<xarray.DataArray 'tos' (time: 180, lat: 180, lon: 360)> dask.array<open_dataset-b3fb23ec235c79a96ed2f4f1f892be07tos, shape=(180, 180, 360), dtype=float32, chunksize=(180, 180, 360), chunktype=numpy.ndarray> Coordinates: * time (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00 * lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5 * lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5 Attributes: (12/19) cell_measures: area: areacello cell_methods: area: mean where sea time: mean comment: Model data on the 1x1 grid includes values in all cells f... description: This may differ from "surface temperature" in regions of ... frequency: mon id: tos ... ... time_label: time-mean time_title: Temporal mean title: Sea Surface Temperature type: real units: degC variable_id: tos
Passing chunks={}
to open_dataset()
works, but since we didn’t tell dask how to split up (or chunk) the array, Dask will create a single chunk for our array.
The chunks
here indicate how values should go into each chunk - for example, chunks={'time':90}
will tell Xarray
+ Dask
to place 90 time slices into a single chunk.
Notice how tos
(sea surface temperature) is now split in the time dimension, with two chunks (since there are a total of 180 time slices in this dataset).
ds = xr.open_dataset(
DATASETS.fetch('CESM2_sst_data.nc'),
engine="netcdf4",
chunks={"time": 90, "lat": 180, "lon": 360},
)
ds.tos
/usr/share/miniconda3/envs/pythia-book-dev/lib/python3.10/site-packages/xarray/conventions.py:543: SerializationWarning: variable 'tos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
<xarray.DataArray 'tos' (time: 180, lat: 180, lon: 360)> dask.array<open_dataset-9659494e9c2d7a2fbcc7df4c9bc8a9b8tos, shape=(180, 180, 360), dtype=float32, chunksize=(90, 180, 360), chunktype=numpy.ndarray> Coordinates: * time (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00 * lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5 * lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5 Attributes: (12/19) cell_measures: area: areacello cell_methods: area: mean where sea time: mean comment: Model data on the 1x1 grid includes values in all cells f... description: This may differ from "surface temperature" in regions of ... frequency: mon id: tos ... ... time_label: time-mean time_title: Temporal mean title: Sea Surface Temperature type: real units: degC variable_id: tos
Calling .chunks
on the tos
xarray.DataArray
displays the number of slices in each dimension within each chunk, with (time
, lat
, lon
). Notice how there are now two chunks each with 90 time slices for the time dimension.
ds.tos.chunks
((90, 90), (180,), (360,))
Xarray data structures are first-class dask collections
This means you can call the following functions
dask.visualize(...)
dask.compute(...)
on both xarray.DataArray
and xarray.Dataset
objects backed by dask.array
.
Let’s visualize our dataset using Dask!
dask.visualize(ds)

Parallel and lazy computation using dask.array
with Xarray
Xarray seamlessly wraps Dask so all computation is deferred until explicitly requested.
z = ds.tos.mean(['lat', 'lon']).dot(ds.tos.T)
z
<xarray.DataArray 'tos' (lon: 360, lat: 180)> dask.array<sum-aggregate, shape=(360, 180), dtype=float32, chunksize=(360, 180), chunktype=numpy.ndarray> Coordinates: * lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5 * lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
As you can see, z
contains a Dask array. This is true for all Xarray built-in operations including subsetting
z.isel(lat=0)
<xarray.DataArray 'tos' (lon: 360)> dask.array<getitem, shape=(360,), dtype=float32, chunksize=(360,), chunktype=numpy.ndarray> Coordinates: lat float64 -89.5 * lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
We can visualize this subset too!
dask.visualize(z)

Now that we have prepared our calculation, we can go ahead and call .compute()
with ProgressBar():
computed_ds = z.compute()
[ ] | 0% Completed | 271.90 us
[########################################] | 100% Completed | 101.59 ms
Summary
We saw that we can use Xarray
to access dask.arrays
, which required passing a chunks
argument to upon opening the dataset. Once the data were loaded into Xarray, we could interact with the xarray.Datasets
and xarray.DataArrays
as we would if we were working with dask.arrays
. This can be a powerful tool for working with larger-than-memory datasets!
Dask Shortcomings
Dask Array does not implement the entire Numpy interface. Users expecting this will be disappointed. Notably Dask Array has the following failings:
Dask Array does not support some operations where the resulting shape depends on the values of the array. For those that it does support (for example, masking one Dask Array with another boolean mask), the chunk sizes will be unknown, which may cause issues with other operations that need to know the chunk sizes.
Dask Array does not attempt operations like
sort
which are notoriously difficult to do in parallel and are of somewhat diminished value on very large data (you rarely actually need a full sort). Often we include parallel-friendly alternatives liketopk
.Dask development is driven by immediate need, and so many lesser used functions, like
np.sometrue
have simply not been implemented yet. These would make excellent community contributions.
Learn More
Visit the Dask Array documentation. In particular, this array screencast will reinforce the concepts you learned here.
from IPython.display import YouTubeVideo
YouTubeVideo(id="9h_61hXCDuI", width=600, height=300)
Resources and references
Reference
Ask for help
dask
tag on Stack Overflow, for usage questionsgithub discussions: dask for general, non-bug, discussion, and usage questions
github issues: dask for bug reports and feature requests
github discussions: xarray for general, non-bug, discussion, and usage questions
github issues: xarray for bug reports and feature requests
Pieces of this notebook are adapted from the following sources