This page will hopefully store my notes on various maths courses, and will hopefully grow over time.

Maths

Part II

Mathematics of Machine Learning [link], complete.

This is a new course, and I am not aware of any complete sets of notes available other than these at present. Highlights are the much nicer references to previous equations than in existing notes, and also the elementary proof of the full form of Hoeffding’s Lemma.

Part IB

Geometry [link], near complete.

I found this course hard, and found a number of sections of the course covered at a rapid speed. My goal in these notes is to distill the important intuitions, but also fill in details in the sections I was less clear about.

All sections of the course are covered in some detail, except the final section on hyperbolic geometry.

Analysis [link], very rough.

Currently covers some more background on the differentiation section of Analysis and Topology, as well as a technical detail that allows the general Cauchy’s Integral Formula for derivatives to be deduced.

Machine Learning

AI Safety Workshop (CUAI) [link]

A workshop (with exercises!) on the ‘off switch’ problem in AI. Explores ‘safe interruptibility’: one way of thinking about this problem in the context on current Reinforcement Learning systems.

Cambridge Societies

I spent some time at Cambridge in different societies:

Founded, was president of, and put talks together with the UCCPS.
Into Effective Altruism.
Organised workshops with the CUAI.
A lot of running, too.

Drafts

Backpropagation and einsum

This post assumes familiarity with the einsum function. A great introduction can be found here which describes more than what we’ll need in this blog post.

The setup

I was recently preparing for machine learning interviews, and was told to have familiarity with Python 3.7 and numpy. It’s not surprising to need to code in Python for ML interviews, but using numpy? It seemed a good idea to write some forwards and backwards passes for common functions in neural networks. It turns out that when you use einsum this is a lot less of a headache than you might have expected!

Let’s suppose we have some linear module in a neural network with a weight W computing some example (intentionally complicated) operation

Y_{bij} = \sum_{k, l} X_{bik} W_{jkl}

This may seem scary, but of course we’ll just compute this as

Y = torch.einsum(
    "bik,jkl->bij",
    X,
    W,
)

Now, how about writing the backwards method for this linear layer?

Previously, I would have gotten out pencil and paper to work out this mess. Now, we’ll use a result (proven later in this post) to make this process much faster. Recall that the backwards pass takes some upstream gradient dL_dY as input and needs to calculate dL_dW to do backprop on this weight matrix and dL_dX the upstream gradient for further backprop (here, L is the loss we’re optimizing).

Indeed, we can immediately calculate

dL_dW = torch.einsum(
    "bik,bij->jkl",
    X,
    dL_dY,
)

and

dL_dX = torch.einsum(
    "jkl,bij->bik",
    W,
    dL_dY,
)

by simply permuting the three terms in the einsum string, and inserting the similarly shaped tensors (note that we only compute gradients from dL_dY, rather than Y).

Why is this true? Let’s start with the dL_dW expression. The key idea is to use the chain rule, so that

dL_dW = \sum dL_dY * dY_dW

where the * represents the product, and we sum this over all possible Y indices. I think the easiest way to see the result is then to imagine fixing one particular Y index and calculating the gradient of that Y value with respect to the weight tensor W. Indeed, it’s now clear that we can compute dL_dY by looking at the relevant indices in the W tensor, and since we’re summing over everything, we get the einsum expression at the end of the day. Similarly for dL_dX, here we can use dL_dX = \sum dL_dY * dY_dX and this explains the W that appears in the einsum expression for the calculation here.

Additionally, even for functions such as convolutions, by representing these as matrix multiplications, we can calculate gradients in a far easier way. Consider the following implementation of a minimal conv2D function from ARENA (this was implemented intentionally using numpy only, hence the ugly conversions!):

def conv1d_minimal(x: Float[Tensor, "b ic w"], weights: Float[Tensor, "oc ic kw"]) -> Float[Tensor, "b oc ow"]:
    '''
    Like torch's conv1d using bias=False and all other keyword arguments left at their default values.
    x: shape (batch, in_channels, width)
    weights: shape (out_channels, in_channels, kernel_width)
    Returns: shape (batch, out_channels, output_width)
    '''

    x = x.numpy()
    weights = weights.numpy()

    output_width = x.shape[2] - weights.shape[2] + 1

    xstrided = np.lib.stride_tricks.as_strided(
        x,
        shape=(x.shape[0], x.shape[1], output_width, weights.shape[-1]), # batch, in_channels, output_width, kernel_width
        strides=(x.strides[0], x.strides[1], x.strides[2], x.strides[2]),
    )

    return torch.tensor(
        einops.einsum(
            xstrided,
            weights,
            "batch in_channels output_width kernel_width, \
            out_channels in_channels kernel_width \
            -> batch out_channels output_width",
        )
    )

Then we can create a backwards pass by reconstructing the large xstrided function and then aggregating the contributions of X from this:

def conv2d_minimal_backwards(dL_dY: Float[Tensor, "b oc ow"])
    dL_dW = torch.einsum(
        strided_X,
        dL_dY,
        "batch in_channels output_width kernel_width, \
        batch out_channels output_width \
        -> out_channels in_channels kernel_width",
    )

    dL_dXstrided = torch.einsum(
        W,
        dL_dY,
        "out_channels in_channels kernel_width, \
        batch in_channels output_width \
        -> batch in_channels output_width kernel_width",
    )

    # calculate all the gradients with as_strided for the "middle" elements that contribute exactly kernel_width times to downstream convolutions
    # calculate all the gradient manually for the "edge" elements which contribute less than kernel_width elements (there are few of these so this doesn't affect time complexity)
    # tada!