Intermediate NumPy

Overview

Working with multiple dimensions
Subsetting of irregular arrays with booleans
Sorting, or indexing with indices

Prerequisites

Concepts	Importance	Notes
NumPy Basics	Necessary

Time to learn: 20 minutes

Imports

We will be including Matplotlib to illustrate some of our examples, but you don’t need knowledge of it to complete this notebook.

import matplotlib.pyplot as plt
import numpy as np

Using axes to slice arrays

Here we introduce an important concept when working with NumPy: the axis. This indicates the particular dimension along which a function should operate (provided the function does something taking multiple values and converts to a single value).

Let’s look at a concrete example with sum:

a = np.arange(12).reshape(3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

This calculates the total of all values in the array.

np.sum(a)

np.int64(66)

Info

Some of NumPy’s functions can be accessed as ndarray methods!

a.sum()

np.int64(66)

Now, with a reminder about how our array is shaped,

a.shape

(3, 4)

we can specify axis to get just the sum across each of our rows.

np.sum(a, axis=0)

array([12, 15, 18, 21])

Or do the same and take the sum across columns:

np.sum(a, axis=1)

array([ 6, 22, 38])

After putting together some data and introducing some more advanced calculations, let’s demonstrate a multi-layered example: calculating temperature advection. If you’re not familiar with this (don’t worry!), we’ll be looking to calculate

\[\begin{equation*} \text{advection} = -\vec{v} \cdot \nabla T \end{equation*}\]

and to do so we’ll start with some random \(T\) and \(\vec{v}\) values,

temp = np.random.randn(100, 50)
u = np.random.randn(100, 50)
v = np.random.randn(100, 50)

We can calculate the np.gradient of our new \(T(100x50)\) field as two separate component gradients,

gradient_x, gradient_y = np.gradient(temp)

In order to calculate \(-\vec{v} \cdot \nabla T\), we will use np.dstack to turn our two separate component gradient fields into one multidimensional field containing \(x\) and \(y\) gradients at each of our \(100x50\) points,

grad_vectors = np.dstack([gradient_x, gradient_y])
print(grad_vectors.shape)

(100, 50, 2)

and then do the same for our separate \(u\) and \(v\) wind components,

wind_vectors = np.dstack([u, v])
print(wind_vectors.shape)

(100, 50, 2)

Finally, we can calculate the dot product of these two multidimensional fields of wind and temperature gradient components by hand as an element-wise multiplication, *, and then a sum of our separate components at each point (i.e., along the last axis),

advection = (wind_vectors * -grad_vectors).sum(axis=-1)
print(advection.shape)

(100, 50)

Indexing arrays with boolean values

Array comparisons

NumPy can easily create arrays of boolean values and use those to select certain values to extract from an array

# Create some synthetic data representing temperature and wind speed data
np.random.seed(19990503)  # Make sure we all have the same data
temp = 20 * np.cos(np.linspace(0, 2 * np.pi, 100)) + 50 + 2 * np.random.randn(100)
speed = np.abs(
    10 * np.sin(np.linspace(0, 2 * np.pi, 100)) + 10 + 5 * np.random.randn(100)
)

plt.plot(temp, 'tab:red')
plt.plot(speed, 'tab:blue');

../../_images/b2474152a274af1b819c878e117a726663c3c611259ae9ae689b0b2251959cf1.png

By doing a comparison between a NumPy array and a value, we get an array of values representing the results of the comparison between each element and the value

temp > 45

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

This, which is its own NumPy array of boolean values, can be used as an index to another array of the same size. We can even use it as an index within the original temp array we used to compare,

temp[temp > 45]

array([69.89825854, 71.52313905, 69.90028363, 66.73828667, 66.77980233,
       72.91468564, 69.34603239, 69.09533591, 68.27350814, 64.33916721,
       67.49497791, 67.05282372, 63.51829518, 63.54034678, 65.46576463,
       62.99683836, 59.27662304, 61.29361272, 60.51641586, 57.46048995,
       55.19793004, 53.07572989, 54.47998158, 53.09552107, 54.59037269,
       47.84272747, 49.1435589 , 45.87151534, 45.11976794, 45.009292  ,
       46.36021141, 46.87557425, 47.25668992, 50.09599544, 53.77789358,
       50.24073197, 54.07629059, 51.95065202, 55.84827794, 57.56967086,
       57.19572063, 61.67658285, 56.51474577, 59.72166924, 62.99403256,
       63.57569453, 64.05984232, 60.88258643, 65.37759899, 63.94115754,
       65.53070256, 67.15175649, 66.26468701, 67.03811793, 69.17773618,
       69.83571708, 70.99586742, 66.34971928, 67.49905207, 69.83593609])

Info

This only returns the values from our original array meeting the indexing conditions, nothing more! Note the size,

temp[temp > 45].shape

(60,)

Warning

Indexing arrays with arrays requires them to be the same size!

If we store this array somewhere new,

temp_45 = temp[temp > 45]

temp_45[temp < 45]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 temp_45[temp < 45]

IndexError: boolean index did not match indexed array along axis 0; size of axis is 60 but size of corresponding boolean axis is 100

We find that our original (100,) shape array is too large to subset our new (60,) array.

If their sizes do match, the boolean array can come from a totally different array!

speed > 10

array([False, False, False,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False,  True,  True, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True,  True, False, False, False,
        True])

temp[speed > 10]

array([66.73828667, 66.77980233, 69.34603239, 69.09533591, 68.27350814,
       64.33916721, 67.49497791, 67.05282372, 63.51829518, 63.54034678,
       65.46576463, 62.99683836, 59.27662304, 61.29361272, 60.51641586,
       57.46048995, 55.19793004, 53.07572989, 54.47998158, 53.09552107,
       54.59037269, 47.84272747, 49.1435589 , 45.87151534, 43.95971516,
       42.72814762, 42.45316175, 39.2797517 , 40.23351938, 36.77179678,
       34.43329229, 31.42277612, 38.97505745, 34.10549575, 35.70826448,
       29.01276068, 30.31180935, 29.31602671, 32.84580454, 30.76695309,
       29.11344716, 30.16652571, 29.91513049, 39.51784389, 69.17773618,
       69.83571708, 69.83593609])

Replacing values

To extend this, we can use this conditional indexing to assign new values to certain positions within our array, somewhat like a masking operation.

# Make a copy so we don't modify the original data
temp2 = temp.copy()
speed2 = speed.copy()

# Replace all places where speed is <10 with NaN (not a number)
temp2[speed < 10] = np.nan
speed2[speed < 10] = np.nan

plt.plot(temp2, 'tab:red');

../../_images/8edda2172eafad9ca7f9bffc7ae8eaa5202a6c717910700010830ee89c0d6d8b.png

and to put this in context,

plt.plot(temp, 'r:')
plt.plot(temp2, 'r')
plt.plot(speed, 'b:')
plt.plot(speed2, 'b');

../../_images/841faec9677e68e0b550c43188c2909c0fd4dfa0291a47733257e7b82cb6f474.png

If we use parentheses to preserve the order of operations, we can combine these conditions with other bitwise operators like the & for bitwise_and,

multi_mask = (temp < 45) & (speed > 10)
multi_mask

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True,  True, False,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False,  True,  True, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

temp[multi_mask]

array([43.95971516, 42.72814762, 42.45316175, 39.2797517 , 40.23351938,
       36.77179678, 34.43329229, 31.42277612, 38.97505745, 34.10549575,
       35.70826448, 29.01276068, 30.31180935, 29.31602671, 32.84580454,
       30.76695309, 29.11344716, 30.16652571, 29.91513049, 39.51784389])

Heat index is only defined for temperatures >= 80F and relative humidity values >= 40%. Using the data generated below, we can use boolean indexing to extract the data where heat index has a valid value.

# Here's the "data"
np.random.seed(19990503)
temp = 20 * np.cos(np.linspace(0, 2 * np.pi, 100)) + 80 + 2 * np.random.randn(100)
relative_humidity = np.abs(
    20 * np.cos(np.linspace(0, 4 * np.pi, 100)) + 50 + 5 * np.random.randn(100)
)

# Create a mask for the two conditions described above
good_heat_index = (temp >= 80) & (relative_humidity >= 0.4)

# Use this mask to grab the temperature and relative humidity values that together
# will give good heat index values
print(temp[good_heat_index])

[ 99.89825854 101.52313905  99.90028363  96.73828667  96.77980233
91468564  99.34603239  99.09533591  98.27350814  94.33916721
49497791  97.05282372  93.51829518  93.54034678  95.46576463
99683836  89.27662304  91.29361272  90.51641586  87.46048995
19793004  83.07572989  84.47998158  83.09552107  84.59037269
09599544  83.77789358  80.24073197  84.07629059  81.95065202
84827794  87.56967086  87.19572063  91.67658285  86.51474577
72166924  92.99403256  93.57569453  94.05984232  90.88258643
37759899  93.94115754  95.53070256  97.15175649  96.26468701
03811793  99.17773618  99.83571708 100.99586742  96.34971928
49905207  99.83593609]

Another bitwise operator we can find helpful is Python’s ~ complement operator, which can give us the inverse of our specific mask to let us assign np.nan to every value not satisfied in good_heat_index.

plot_temp = temp.copy()
plot_temp[~good_heat_index] = np.nan
plt.plot(plot_temp, 'tab:red');

../../_images/a44bf3d530e50047af4935824c9ef88c0b1bfbce2f3ff56b4c37db1674fc1b03.png

Indexing using arrays of indices

You can also use a list or array of indices to extract particular values–this is a natural extension of the regular indexing. For instance, just as we can select the first element:

temp[0]

np.float64(99.89825854468695)

We can also extract the first, fifth, and tenth elements as a list:

temp[[0, 4, 9]]

array([99.89825854, 96.77980233, 94.33916721])

One of the ways this comes into play is trying to sort NumPy arrays using argsort. This function returns the indices of the array that give the items in sorted order. So for our temp,

inds = np.argsort(temp)
inds

array([52, 57, 42, 48, 54, 44, 56, 51, 49, 43, 50, 46, 58, 55, 53, 40, 37,
       61, 47, 45, 59, 39, 36, 60, 41, 34, 66, 63, 35, 38, 32, 62, 64, 33,
       31, 67, 29, 28, 68, 69, 65, 30, 27, 70, 71, 72, 25, 26, 73, 75, 77,
       21, 23, 74, 76, 22, 24, 20, 78, 82, 80, 19, 79, 16, 83, 18, 87, 17,
       81, 84, 15, 12, 13, 85, 89, 86,  9, 88, 14, 90, 92, 97,  3,  4, 93,
       11, 91, 10, 98,  8,  7, 94,  6, 95, 99,  0,  2, 96,  1,  5])

i.e., our lowest value is at index 52, next 57, and so on. We can use this array of indices as an index for temp,

temp[inds]

array([ 58.71828204,  58.85269149,  59.01276068,  59.11344716,
        59.25186164,  59.31602671,  59.42796381,  59.91513049,
        60.16652571,  60.31180935,  60.48608715,  60.76695309,
        60.93380275,  60.95814392,  61.07199963,  61.1341411 ,
        61.42277612,  62.27369636,  62.44927684,  62.84580454,
        63.37573713,  64.10549575,  64.43329229,  64.95696914,
        65.70826448,  66.77179678,  67.06954335,  67.39853293,
        67.7453367 ,  68.97505745,  69.2797517 ,  69.34620461,
        69.51784389,  70.23351938,  72.45316175,  72.69583703,
        72.72814762,  73.95971516,  74.03576453,  74.45775806,
        75.009292  ,  75.11976794,  75.87151534,  76.36021141,
        76.87557425,  77.25668992,  77.84272747,  79.1435589 ,
        80.09599544,  80.24073197,  81.95065202,  83.07572989,
        83.09552107,  83.77789358,  84.07629059,  84.47998158,
        84.59037269,  85.19793004,  85.84827794,  86.51474577,
        87.19572063,  87.46048995,  87.56967086,  89.27662304,
        89.72166924,  90.51641586,  90.88258643,  91.29361272,
        91.67658285,  92.99403256,  92.99683836,  93.51829518,
        93.54034678,  93.57569453,  93.94115754,  94.05984232,
        94.33916721,  95.37759899,  95.46576463,  95.53070256,
        96.26468701,  96.34971928,  96.73828667,  96.77980233,
        97.03811793,  97.05282372,  97.15175649,  97.49497791,
        97.49905207,  98.27350814,  99.09533591,  99.17773618,
        99.34603239,  99.83571708,  99.83593609,  99.89825854,
        99.90028363, 100.99586742, 101.52313905, 102.91468564])

to get a sorted array back!

With some clever slicing, we can pull out the last 10, or 10 highest, values of temp,

ten_highest = inds[-10:]
print(temp[ten_highest])

[ 99.09533591  99.17773618  99.34603239  99.83571708  99.83593609
  99.89825854  99.90028363 100.99586742 101.52313905 102.91468564]

There are other NumPy arg functions that return indices for operating; check out the NumPy docs on sorting your arrays!

Summary

In this notebook we introduced the power of understanding the dimensions of our data by specifying math along axis, used True and False values to subset our data according to conditions, and used lists of positions within our array to sort our data.

What’s Next

Taking some time to practice this is valuable to be able to quickly manipulate arrays of information in useful or scientific ways.

Resources and references

The NumPy Users Guide expands further on some of these topics, as well as suggests various Tutorials, lectures, and more at this stage.