../../_images/dask_horizontal.svg

Dask Arrays with Xarray

Dask Array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.

  • Parallel: Uses all of the cores on your computer

  • Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.

  • Blocked Algorithms: Perform large computations by performing many smaller computations

This notebook demonstrates one of Xarray’s most powerful features: the ability to wrap dask arrays and allow users to seamlessly execute analysis code in parallel.

Learning Objectives

  • Learn the distinction between eager and lazy execution, and how Xarray can work either way

  • Understand key features of dask arrays

  • Work with Dask Arrays in much the same way you would work with a NumPy array

  • Learn that xarray DataArrays and Datasets are “dask collections” i.e. you can execute top-level dask functions such as dask.visualize(xarray_object)

  • Learn that all xarray built-in operations can transparently use dask

Prerequisites

Concepts

Importance

Notes

Introduction to NumPy

Necessary

Familiarity with Data Arrays

Introduction to Xarray

Necessary

Familiarity with Xarray Data Structures

  • Time to learn: 30-40 minutes

Imports

import dask
import dask.array as da
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
from dask.utils import format_bytes
from pythia_datasets import DATASETS

Blocked algorithms

A blocked algorithm executes on a large dataset by breaking it up into many small blocks.

For example, consider taking the sum of a billion numbers, in a single computation. This would take a while. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.

We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)

dask.array contains these algorithms

dask.array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using multiple cores. Dask coordinates these blocked algorithms using Dask graphs. Dask Arrays are also lazy, meaning that they do not evaluate until you explicitly ask for a result using the compute method.

Create a dask.array object

If we want to create a 3D NumPy array of random values, we do it like this:

shape = (600, 200, 200)
arr = np.random.random(shape)
arr
array([[[0.99508135, 0.22175797, 0.44016924, ..., 0.51859924,
         0.82781135, 0.4346433 ],
        [0.21492685, 0.19958331, 0.12155469, ..., 0.10792508,
         0.31897276, 0.12812765],
        [0.8987877 , 0.09799323, 0.41508226, ..., 0.77822943,
         0.85431631, 0.17619869],
        ...,
        [0.34078722, 0.22900818, 0.32147336, ..., 0.74517698,
         0.873438  , 0.74972791],
        [0.06487327, 0.86291262, 0.84977461, ..., 0.09678896,
         0.0892313 , 0.52207807],
        [0.44737615, 0.27610508, 0.97695002, ..., 0.9427538 ,
         0.09441916, 0.73173758]],

       [[0.77580518, 0.55518173, 0.54925921, ..., 0.14546325,
         0.08337601, 0.83639   ],
        [0.73724986, 0.76258422, 0.158717  , ..., 0.46612074,
         0.97695899, 0.14601021],
        [0.30036694, 0.46615932, 0.74979414, ..., 0.16753985,
         0.55751268, 0.56302985],
        ...,
        [0.86494074, 0.05466872, 0.12272287, ..., 0.11644158,
         0.17165432, 0.91463584],
        [0.82696462, 0.46708352, 0.6529604 , ..., 0.6933393 ,
         0.27345092, 0.45866493],
        [0.05171999, 0.31780065, 0.01830433, ..., 0.49203931,
         0.30371828, 0.45443298]],

       [[0.71845171, 0.04635316, 0.59139941, ..., 0.95579403,
         0.93763992, 0.23628609],
        [0.96813849, 0.37617541, 0.56112797, ..., 0.1828069 ,
         0.28072752, 0.10627897],
        [0.17538941, 0.48845481, 0.99540807, ..., 0.42270886,
         0.09139068, 0.91347475],
        ...,
        [0.55206258, 0.93068552, 0.02121214, ..., 0.20028049,
         0.36867222, 0.55463405],
        [0.50996981, 0.03780998, 0.22449516, ..., 0.29084853,
         0.29255471, 0.77819078],
        [0.09592503, 0.28095225, 0.03016769, ..., 0.17588537,
         0.37462336, 0.75778952]],

       ...,

       [[0.51583849, 0.46700049, 0.05855181, ..., 0.2834727 ,
         0.63218357, 0.06237733],
        [0.42247255, 0.96958554, 0.22576217, ..., 0.39246866,
         0.78072914, 0.13774569],
        [0.09862352, 0.97684248, 0.04441755, ..., 0.48608178,
         0.75899973, 0.29868631],
        ...,
        [0.09523379, 0.21418685, 0.56633924, ..., 0.14222359,
         0.32849902, 0.19974628],
        [0.36740356, 0.84115563, 0.4580744 , ..., 0.36096674,
         0.32771023, 0.56241503],
        [0.51998532, 0.47299688, 0.79652667, ..., 0.95738118,
         0.6577222 , 0.80682714]],

       [[0.69817986, 0.34326248, 0.41977565, ..., 0.75167892,
         0.80906173, 0.73010637],
        [0.8155196 , 0.35979306, 0.05416557, ..., 0.21630959,
         0.21079194, 0.00246903],
        [0.61892995, 0.17930534, 0.34425143, ..., 0.52079742,
         0.1408626 , 0.13932697],
        ...,
        [0.55957662, 0.59242124, 0.71167422, ..., 0.86397543,
         0.68944328, 0.35688402],
        [0.32719852, 0.58705698, 0.78028058, ..., 0.31564736,
         0.86715815, 0.64449163],
        [0.52745122, 0.82665065, 0.90211923, ..., 0.495561  ,
         0.29581591, 0.62738837]],

       [[0.45359721, 0.49268539, 0.97281797, ..., 0.97774981,
         0.66618505, 0.24045622],
        [0.15404265, 0.88015227, 0.62825949, ..., 0.7504875 ,
         0.40360263, 0.44151572],
        [0.19037972, 0.69336863, 0.46827396, ..., 0.56768649,
         0.0579564 , 0.5199718 ],
        ...,
        [0.5730851 , 0.99282458, 0.42381815, ..., 0.66493741,
         0.35763611, 0.62651205],
        [0.647908  , 0.23980809, 0.70427349, ..., 0.36553028,
         0.11685165, 0.7729463 ],
        [0.79896199, 0.84595986, 0.01267043, ..., 0.80424287,
         0.88969108, 0.77516427]]])
format_bytes(arr.nbytes)
'183.11 MiB'

This array contains ~183 MB of data

Now let’s create the same array using Dask’s array interface.

darr = da.random.random(shape, chunks=(300, 100, 200))

A chunk size to tell us how to block up our array, like (300, 100, 200).

Specifying Chunks

There are several ways to specify chunks. In this tutorial, we will use a block shape.

darr
Array Chunk
Bytes 183.11 MiB 45.78 MiB
Shape (600, 200, 200) (300, 100, 200)
Dask graph 4 chunks in 1 graph layer
Data type float64 numpy.ndarray
200 200 600

Notice that we just see a symbolic representation of the array, including its shape, dtype, and chunksize. No data has been generated yet. Let’s visualize the constructed task graph.

darr.visualize()
../../_images/dask-arrays-xarray_16_0.png

Our array has four chunks. To generate it, Dask calls np.random.random four times and then concatenates this together into one array.

Manipulate a dask.array object as you would a numpy array

Now that we have an Array we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc…

The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum().

What’s the difference?

dask_array.sum() builds an expression of the computation. It does not do the computation yet, also known as lazy execution. numpy_array.sum() computes the sum immediately (eager execution).

Why the difference?

A dask.array is split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory.

total = darr.sum()
total
Array Chunk
Bytes 8 B 8 B
Shape () ()
Dask graph 1 chunks in 3 graph layers
Data type float64 numpy.ndarray
total.visualize()
../../_images/dask-arrays-xarray_20_0.png

Compute the result

dask.array objects are lazily evaluated. Operations like .sum build up a graph of blocked tasks to execute.

We ask for the final result with a call to .compute(). This triggers the actual computation.

%%time
total.compute()
CPU times: user 225 ms, sys: 56.3 ms, total: 282 ms
Wall time: 148 ms
12000696.610552333

Exercise with dask.arrays

Modify the chunk size (or shape) in the random dask array, call .sum() on the new array, and visualize how the task graph changes.

da.random.random(shape, chunks=(50, 200, 400)).sum().visualize()
../../_images/dask-arrays-xarray_24_0.png

Here we see Dask’s strategy for finding the sum. This simple example illustrates the beauty of Dask: it automatically designs an algorithm appropriate for custom operations with big data.

If we make our operation more complex, the graph gets more complex:

For an example, we use an arbitrarily complex calculation.

z = darr.dot(darr.T).mean(axis=0)[::2, :].std(axis=1)
z
Array Chunk
Bytes 468.75 kiB 117.19 kiB
Shape (100, 600) (50, 300)
Dask graph 4 chunks in 12 graph layers
Data type float64 numpy.ndarray
600 100
z.visualize()
../../_images/dask-arrays-xarray_28_0.png

Testing a bigger calculation

The examples above were toy examples; the data (180 MB) is probably not big enough to warrant the use of Dask.

We can make it a lot bigger! Let’s create a new, big array

darr = da.random.random((4000, 100, 4000), chunks=(1000, 100, 500)).astype('float32')
darr
Array Chunk
Bytes 5.96 GiB 190.73 MiB
Shape (4000, 100, 4000) (1000, 100, 500)
Dask graph 32 chunks in 2 graph layers
Data type float32 numpy.ndarray
4000 100 4000

This dataset is ~6 GB, rather than 32 MB! This is probably close to or greater than the amount of available RAM than you have in your computer. Nevertheless, Dask has no problem working on it.

z = (darr + darr.T)[::2, :].mean(axis=2)
z.visualize()
../../_images/dask-arrays-xarray_33_0.png
with ProgressBar():
    computed_ds = z.compute()
[                                        ] | 0% Completed | 312.90 us
[                                        ] | 0% Completed | 108.41 ms
[                                        ] | 0% Completed | 209.60 ms
[                                        ] | 0% Completed | 310.59 ms
[                                        ] | 0% Completed | 411.82 ms
[                                        ] | 0% Completed | 512.80 ms
[                                        ] | 0% Completed | 613.82 ms
[                                        ] | 0% Completed | 717.04 ms
[                                        ] | 0% Completed | 818.02 ms
[                                        ] | 0% Completed | 919.02 ms
[                                        ] | 0% Completed | 1.02 s
[                                        ] | 0% Completed | 1.12 s
[                                        ] | 1% Completed | 1.22 s
[                                        ] | 1% Completed | 1.33 s
[                                        ] | 2% Completed | 1.43 s
[#                                       ] | 2% Completed | 1.53 s
[#                                       ] | 3% Completed | 1.63 s
[#                                       ] | 3% Completed | 1.74 s
[#                                       ] | 4% Completed | 1.84 s
[##                                      ] | 5% Completed | 1.94 s
[##                                      ] | 5% Completed | 2.04 s
[##                                      ] | 5% Completed | 2.14 s
[##                                      ] | 5% Completed | 2.25 s
[##                                      ] | 5% Completed | 2.35 s
[##                                      ] | 5% Completed | 2.45 s
[##                                      ] | 5% Completed | 2.55 s
[##                                      ] | 5% Completed | 2.65 s
[##                                      ] | 6% Completed | 2.75 s
[##                                      ] | 7% Completed | 2.85 s
[###                                     ] | 8% Completed | 2.95 s
[###                                     ] | 8% Completed | 3.05 s
[###                                     ] | 9% Completed | 3.16 s
[####                                    ] | 10% Completed | 3.26 s
[####                                    ] | 11% Completed | 3.36 s
[####                                    ] | 11% Completed | 3.46 s
[####                                    ] | 11% Completed | 3.56 s
[####                                    ] | 11% Completed | 3.66 s
[####                                    ] | 11% Completed | 3.77 s
[####                                    ] | 11% Completed | 3.87 s
[####                                    ] | 11% Completed | 3.97 s
[####                                    ] | 11% Completed | 4.07 s
[####                                    ] | 11% Completed | 4.18 s
[#####                                   ] | 12% Completed | 4.28 s
[#####                                   ] | 13% Completed | 4.38 s
[#####                                   ] | 13% Completed | 4.48 s
[#####                                   ] | 14% Completed | 4.59 s
[######                                  ] | 15% Completed | 4.69 s
[######                                  ] | 15% Completed | 4.79 s
[######                                  ] | 16% Completed | 4.89 s
[######                                  ] | 16% Completed | 4.99 s
[######                                  ] | 16% Completed | 5.09 s
[######                                  ] | 16% Completed | 5.19 s
[######                                  ] | 16% Completed | 5.29 s
[######                                  ] | 16% Completed | 5.39 s
[######                                  ] | 16% Completed | 5.50 s
[######                                  ] | 16% Completed | 5.60 s
[######                                  ] | 16% Completed | 5.70 s
[######                                  ] | 16% Completed | 5.80 s
[######                                  ] | 17% Completed | 5.90 s
[#######                                 ] | 18% Completed | 6.01 s
[#######                                 ] | 19% Completed | 6.11 s
[#######                                 ] | 19% Completed | 6.21 s
[########                                ] | 20% Completed | 6.31 s
[########                                ] | 21% Completed | 6.41 s
[########                                ] | 21% Completed | 6.52 s
[########                                ] | 22% Completed | 6.62 s
[########                                ] | 22% Completed | 6.72 s
[########                                ] | 22% Completed | 6.82 s
[########                                ] | 22% Completed | 6.92 s
[########                                ] | 22% Completed | 7.03 s
[########                                ] | 22% Completed | 7.13 s
[########                                ] | 22% Completed | 7.23 s
[########                                ] | 22% Completed | 7.33 s
[#########                               ] | 23% Completed | 7.44 s
[#########                               ] | 24% Completed | 7.54 s
[##########                              ] | 25% Completed | 7.64 s
[##########                              ] | 25% Completed | 7.74 s
[##########                              ] | 25% Completed | 7.84 s
[##########                              ] | 27% Completed | 7.95 s
[###########                             ] | 27% Completed | 8.05 s
[###########                             ] | 27% Completed | 8.15 s
[###########                             ] | 27% Completed | 8.25 s
[###########                             ] | 27% Completed | 8.35 s
[###########                             ] | 27% Completed | 8.45 s
[###########                             ] | 27% Completed | 8.55 s
[###########                             ] | 27% Completed | 8.65 s
[###########                             ] | 27% Completed | 8.76 s
[###########                             ] | 27% Completed | 8.86 s
[###########                             ] | 28% Completed | 8.96 s
[###########                             ] | 29% Completed | 9.06 s
[############                            ] | 30% Completed | 9.17 s
[############                            ] | 30% Completed | 9.27 s
[#############                           ] | 32% Completed | 9.37 s
[#############                           ] | 33% Completed | 9.47 s
[#############                           ] | 33% Completed | 9.58 s
[#############                           ] | 34% Completed | 9.68 s
[#############                           ] | 34% Completed | 9.78 s
[#############                           ] | 34% Completed | 9.88 s
[#############                           ] | 34% Completed | 9.98 s
[#############                           ] | 34% Completed | 10.08 s
[#############                           ] | 34% Completed | 10.19 s
[#############                           ] | 34% Completed | 10.29 s
[##############                          ] | 35% Completed | 10.39 s
[##############                          ] | 36% Completed | 10.49 s
[##############                          ] | 37% Completed | 10.59 s
[###############                         ] | 38% Completed | 10.69 s
[###############                         ] | 39% Completed | 10.80 s
[################                        ] | 40% Completed | 10.90 s
[################                        ] | 41% Completed | 11.00 s
[################                        ] | 42% Completed | 11.10 s
[#################                       ] | 42% Completed | 11.20 s
[#################                       ] | 42% Completed | 11.30 s
[#################                       ] | 42% Completed | 11.40 s
[#################                       ] | 42% Completed | 11.51 s
[#################                       ] | 42% Completed | 11.61 s
[#################                       ] | 43% Completed | 11.71 s
[#################                       ] | 43% Completed | 11.81 s
[#################                       ] | 44% Completed | 11.91 s
[##################                      ] | 45% Completed | 12.02 s
[##################                      ] | 46% Completed | 12.12 s
[##################                      ] | 47% Completed | 12.22 s
[###################                     ] | 48% Completed | 12.32 s
[###################                     ] | 49% Completed | 12.43 s
[###################                     ] | 49% Completed | 12.53 s
[###################                     ] | 49% Completed | 12.63 s
[###################                     ] | 49% Completed | 12.73 s
[###################                     ] | 49% Completed | 12.83 s
[###################                     ] | 49% Completed | 12.94 s
[###################                     ] | 49% Completed | 13.04 s
[####################                    ] | 51% Completed | 13.14 s
[####################                    ] | 51% Completed | 13.24 s
[#####################                   ] | 52% Completed | 13.34 s
[#####################                   ] | 53% Completed | 13.44 s
[#####################                   ] | 54% Completed | 13.54 s
[#####################                   ] | 54% Completed | 13.64 s
[#####################                   ] | 54% Completed | 13.75 s
[#####################                   ] | 54% Completed | 13.85 s
[#####################                   ] | 54% Completed | 13.95 s
[#####################                   ] | 54% Completed | 14.05 s
[######################                  ] | 55% Completed | 14.15 s
[######################                  ] | 56% Completed | 14.25 s
[######################                  ] | 56% Completed | 14.35 s
[#######################                 ] | 58% Completed | 14.45 s
[#######################                 ] | 58% Completed | 14.55 s
[#######################                 ] | 59% Completed | 14.66 s
[#######################                 ] | 59% Completed | 14.76 s
[#######################                 ] | 59% Completed | 14.86 s
[#######################                 ] | 59% Completed | 14.96 s
[#######################                 ] | 59% Completed | 15.06 s
[#######################                 ] | 59% Completed | 15.16 s
[#######################                 ] | 59% Completed | 15.26 s
[########################                ] | 61% Completed | 15.36 s
[#########################               ] | 62% Completed | 15.46 s
[#########################               ] | 63% Completed | 15.56 s
[#########################               ] | 64% Completed | 15.67 s
[##########################              ] | 65% Completed | 15.77 s
[##########################              ] | 65% Completed | 15.87 s
[##########################              ] | 65% Completed | 15.97 s
[##########################              ] | 65% Completed | 16.07 s
[##########################              ] | 65% Completed | 16.17 s
[##########################              ] | 65% Completed | 16.27 s
[##########################              ] | 65% Completed | 16.37 s
[##########################              ] | 66% Completed | 16.47 s
[##########################              ] | 67% Completed | 16.58 s
[###########################             ] | 67% Completed | 16.68 s
[###########################             ] | 68% Completed | 16.78 s
[###########################             ] | 69% Completed | 16.88 s
[############################            ] | 70% Completed | 16.98 s
[############################            ] | 72% Completed | 17.08 s
[#############################           ] | 74% Completed | 17.18 s
[#############################           ] | 74% Completed | 17.28 s
[#############################           ] | 74% Completed | 17.39 s
[#############################           ] | 74% Completed | 17.49 s
[#############################           ] | 74% Completed | 17.59 s
[#############################           ] | 74% Completed | 17.69 s
[#############################           ] | 74% Completed | 17.79 s
[##############################          ] | 75% Completed | 17.89 s
[##############################          ] | 76% Completed | 17.99 s
[##############################          ] | 76% Completed | 18.09 s
[###############################         ] | 77% Completed | 18.19 s
[###############################         ] | 78% Completed | 18.30 s
[###############################         ] | 79% Completed | 18.40 s
[################################        ] | 80% Completed | 18.50 s
[################################        ] | 80% Completed | 18.60 s
[################################        ] | 80% Completed | 18.71 s
[################################        ] | 80% Completed | 18.81 s
[################################        ] | 80% Completed | 18.91 s
[################################        ] | 80% Completed | 19.01 s
[################################        ] | 81% Completed | 19.12 s
[################################        ] | 82% Completed | 19.22 s
[#################################       ] | 83% Completed | 19.32 s
[#################################       ] | 83% Completed | 19.42 s
[#################################       ] | 84% Completed | 19.52 s
[##################################      ] | 85% Completed | 19.62 s
[##################################      ] | 85% Completed | 19.72 s
[##################################      ] | 85% Completed | 19.83 s
[##################################      ] | 85% Completed | 19.93 s
[##################################      ] | 85% Completed | 20.03 s
[##################################      ] | 85% Completed | 20.13 s
[##################################      ] | 86% Completed | 20.23 s
[##################################      ] | 87% Completed | 20.33 s
[###################################     ] | 88% Completed | 20.43 s
[###################################     ] | 89% Completed | 20.53 s
[####################################    ] | 90% Completed | 20.63 s
[####################################    ] | 91% Completed | 20.74 s
[####################################    ] | 92% Completed | 20.84 s
[#####################################   ] | 93% Completed | 20.94 s
[#####################################   ] | 93% Completed | 21.04 s
[#####################################   ] | 93% Completed | 21.14 s
[#####################################   ] | 93% Completed | 21.24 s
[#####################################   ] | 93% Completed | 21.34 s
[#####################################   ] | 93% Completed | 21.44 s
[#####################################   ] | 93% Completed | 21.54 s
[#####################################   ] | 94% Completed | 21.65 s
[######################################  ] | 95% Completed | 21.75 s
[######################################  ] | 96% Completed | 21.85 s
[######################################  ] | 96% Completed | 21.95 s
[####################################### ] | 97% Completed | 22.05 s
[####################################### ] | 99% Completed | 22.15 s
[########################################] | 100% Completed | 22.25 s

Dask Arrays with Xarray

Often times, you won’t be directly interacting with dask.arrays directly; odds are you will be interacting with them via Xarray! Xarray wraps NumPy arrays similar to how we showed in the previous section, helping the user jump right into the dask array interface!

Reading data with Dask and Xarray

Recall that a Dask’s array consists of many chunked arrays:

darr
Array Chunk
Bytes 5.96 GiB 190.73 MiB
Shape (4000, 100, 4000) (1000, 100, 500)
Dask graph 32 chunks in 2 graph layers
Data type float32 numpy.ndarray
4000 100 4000

To read data as Dask arrays with Xarray, we need to specify the chunks argument to open_dataset() function.

ds = xr.open_dataset(DATASETS.fetch('CESM2_sst_data.nc'), chunks={})
ds.tos
/usr/share/miniconda/envs/pythia-book-dev/lib/python3.8/site-packages/xarray/conventions.py:523: SerializationWarning: variable 'tos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
<xarray.DataArray 'tos' (time: 180, lat: 180, lon: 360)>
dask.array<open_dataset-e5232c61bb5adae759d8664152d046bftos, shape=(180, 180, 360), dtype=float32, chunksize=(180, 180, 360), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
Attributes: (12/19)
    cell_measures:  area: areacello
    cell_methods:   area: mean where sea time: mean
    comment:        Model data on the 1x1 grid includes values in all cells f...
    description:    This may differ from "surface temperature" in regions of ...
    frequency:      mon
    id:             tos
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Sea Surface Temperature
    type:           real
    units:          degC
    variable_id:    tos

Passing chunks={} to open_dataset() works, but since we didn’t tell dask how to split up (or chunk) the array, Dask will create a single chunk for our array.

The chunks here indicate how values should go into each chunk - for example, chunks={'time':90} will tell Xarray + Dask to place 90 time slices into a single chunk.

Notice how tos (sea surface temperature) is now split in the time dimension, with two chunks (since there are a total of 180 time slices in this dataset).

ds = xr.open_dataset(
    DATASETS.fetch('CESM2_sst_data.nc'),
    engine="netcdf4",
    chunks={"time": 90, "lat": 180, "lon": 360},
)
ds.tos
<xarray.DataArray 'tos' (time: 180, lat: 180, lon: 360)>
dask.array<open_dataset-0cdd66a88027ec4fa676839b34fcf652tos, shape=(180, 180, 360), dtype=float32, chunksize=(90, 180, 360), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
Attributes: (12/19)
    cell_measures:  area: areacello
    cell_methods:   area: mean where sea time: mean
    comment:        Model data on the 1x1 grid includes values in all cells f...
    description:    This may differ from "surface temperature" in regions of ...
    frequency:      mon
    id:             tos
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Sea Surface Temperature
    type:           real
    units:          degC
    variable_id:    tos

Calling .chunks on the tos xarray.DataArray displays the number of slices in each dimension within each chunk, with (time, lat, lon). Notice how there are now two chunks each with 90 time slices for the time dimension.

ds.tos.chunks
((90, 90), (180,), (360,))

Xarray data structures are first-class dask collections

This means you can call the following functions

  • dask.visualize(...)

  • dask.compute(...)

on both xarray.DataArray and xarray.Dataset objects backed by dask.array.

Let’s visualize our dataset using Dask!

dask.visualize(ds)
../../_images/dask-arrays-xarray_45_0.png

Parallel and lazy computation using dask.array with Xarray

Xarray seamlessly wraps Dask so all computation is deferred until explicitly requested.

z = ds.tos.mean(['lat', 'lon']).dot(ds.tos.T)
z
<xarray.DataArray 'tos' (lon: 360, lat: 180)>
dask.array<sum-aggregate, shape=(360, 180), dtype=float32, chunksize=(360, 180), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5

As you can see, z contains a Dask array. This is true for all Xarray built-in operations including subsetting

z.isel(lat=0)
<xarray.DataArray 'tos' (lon: 360)>
dask.array<getitem, shape=(360,), dtype=float32, chunksize=(360,), chunktype=numpy.ndarray>
Coordinates:
    lat      float64 -89.5
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5

We can visualize this subset too!

dask.visualize(z)
../../_images/dask-arrays-xarray_51_0.png

Now that we have prepared our calculation, we can go ahead and call .compute()

with ProgressBar():
    computed_ds = z.compute()
[                                        ] | 0% Completed | 201.90 us
[######################                  ] | 57% Completed | 101.76 ms
[########################################] | 100% Completed | 203.04 ms


Summary

We saw that we can use Xarray to access dask.arrays, which required passing a chunks argument to upon opening the dataset. Once the data were loaded into Xarray, we could interact with the xarray.Datasets and xarray.DataArrays as we would if we were working with dask.arrays. This can be a powerful tool for working with larger-than-memory datasets!

Dask Shortcomings

Dask Array does not implement the entire Numpy interface. Users expecting this will be disappointed. Notably Dask Array has the following failings:

  1. Dask Array does not support some operations where the resulting shape depends on the values of the array. For those that it does support (for example, masking one Dask Array with another boolean mask), the chunk sizes will be unknown, which may cause issues with other operations that need to know the chunk sizes.

  2. Dask Array does not attempt operations like sort which are notoriously difficult to do in parallel and are of somewhat diminished value on very large data (you rarely actually need a full sort). Often we include parallel-friendly alternatives like topk.

  3. Dask development is driven by immediate need, and so many lesser used functions, like np.sometrue have simply not been implemented yet. These would make excellent community contributions.

Learn More

Visit the Dask Array documentation. In particular, this array screencast will reinforce the concepts you learned here.

from IPython.display import YouTubeVideo

YouTubeVideo(id="9h_61hXCDuI", width=600, height=300)

Resources and references