## Our starting point

Histogramming some data is simple using
`numpy.histogram`

:

```
>>> import numpy as np
>>> x = np.random.randn(10000) ## create a dataset
>>> w = np.random.normal(1, 0.2, 10000) ## create some phony weights
>>> b = np.linspace(-5, 5, 11) ## bin edges (10 bins from -5 to 5)
>>> n, bins = np.histogram(x, bins=b, weights=w)
```

This gives me two arrays
- one for the bin heights (`n`

)
- one for the bin edges (`bins`

).

Quick and simple – but what if I want to include underflow and overflow in the first and last bins, respectively? What if I want to compute the error on each bin height given a weighted dataset? These quantities are important for high energy physics, where nearly all of our analysis is done using histograms.

## Underflow and overflow

Where the elements of the data contribute to the bin height is of
course determined by the bin edges. We can make the left and right
edges infinite to be sure to count *all* of our data^{1}. Then we just
add the `[0]`

bin contents to the `[1]`

bin contents, and add the
`[-1]`

bin contents to the `[-2]`

bin contents. Finally, we polish it
off by chopping off the out-of-bounds elements:

```
>>> import numpy as np
>>> raw_bins = np.linspace(-5, 5, 11)
>>> use_bins = [np.array([-np.inf]), raw_bins, np.array([np.inf])]
>>> use_bins = np.concatenate(use_bins)
>>> x = np.random.normal(0, 2, 1000) ## phony dataset
>>> n, bins = np.histogram(x, bins=use_bins)
>>> n[1] += n[0] ## add underflow to first bin
>>> n[-2] += n[-1] ## add overflow to last bin
>>> n = n[1:-1] ## chop off the under/overflow
>>> bins = raw_bins ## use our original binning (without infinities)
```

And that’s it, now you have *all* of your data histogrammed including
under and overflow.

## Error on bin height using weights

The standard error on a bin height is simply the square-root of the
bin height, \(\sqrt{N}\) ^{2}. If a bin is constructed from weighted
data, we require the square-root of the sum of the weights squared,
\(\sqrt{\sum_i w_i^2}\).

The standard histogram in NumPy doesn’t yield any information about
which weights belong to which bin, but we have another useful NumPy
function which can generate an array of indices based on where data
falls in a particular set of bins,
`numpy.digitize`

.

First, we get an array representing which bin each data point would
fall into. We can then use the conditional function
`numpy.where`

in a loop over all bins to grab only the weights in that bin, and sum
their squares.

```
>>> import numpy as np
>>> x = np.random.normal(0, 2.0, 1000) ## a dataset
>>> b = np.linspace(-2, 2, 21) ## 20 bins
>>> w = np.random.normal(1, 0.2, 1000) ## some weights
>>> sum_w2 = np.zeros([20], dtype=np.float32) ## start with empty errors
>>> digits = np.digitize(x, b) ## bin index array for each data element
>>> for i in range(nbins):
>>> weights_in_current_bin = w[np.where(digits == i)[0]]
>>> sum_w2[i] = np.sum(np.power(weights_in_current_bin, 2))
>>> n, bins = np.histogram(x, bins=b, weights=w)
>>> err = np.sqrt(sum_w2)
```

And that’s it, now you have an array of bin heights and an array or uncertainty in each bin.

Happy NumPying :)

### Appendix, a function to combine the two methods:

```
def extended_hist(data, xmin, xmax, nbins, underflow=True, overflow=True, weights=None):
if weights is not None:
if weights.shape != data.shape:
raise ValueError(
"Unequal shapes data: {}; weights: {}".format(data.shape, weights.shape)
)
edges = np.linspace(xmin, xmax, nbins + 1)
neginf = np.array([-np.inf], dtype=np.float32)
posinf = np.array([np.inf], dtype=np.float32)
uselsp = np.concatenate([neginf, edges, posinf])
if weights is None:
hist, bin_edges = np.histogram(data, bins=uselsp)
else:
hist, bin_edges = np.histogram(data, bins=uselsp, weights=weights)
n = hist[1:-1]
if underflow:
n[0] += hist[0]
if overflow:
n[-1] += hist[-1]
if weights is None:
w = np.sqrt(n)
else:
bin_sumw2 = np.zeros(nbins + 2, dtype=np.float32)
digits = np.digitize(data, edges)
for i in range(nbins + 2):
bin_sumw2[i] = np.sum(np.power(weights[np.where(digits == i)[0]], 2))
w = bin_sumw2[1:-1]
if underflow:
w[0] += bin_sumw2[0]
if overflow:
w[-1] += bin_sumw2[-1]
w = np.sqrt(w)
centers = np.delete(edges, [0]) - (np.ediff1d(edges) / 2.0)
return n, w, centers, edges
```

In the implementation of

↩`numpy.histogram`

, elements of the input array that live outside the bounds of the binning are ignored.Bin height is related to counting, therefore the data in a bin is Poissonian.

↩