Neural Nets Aren't Black Boxes

If you think neural nets are black boxes, you’re certainly not alone. While they may not be as interpretable as something like a random forest (at least not yet), we can still understand how they process inputs to arrive at their predictions. In this post we’ll do just that as we build our own network from scratch, starting with logistic regression.

Objective#

Our goal is to construct a binary logistic classifier as a neural network. The network will consist of a single linear layer followed by a sigmoid activation with binary cross entropy as the loss. We’ll begin by deriving the back-prop equations for our particular scenario and in doing so we’ll see that what we’ve done generalizes immediately to networks with arbitrary layers and activations. In other words, we’ll have developed a framework that can model any feedforward network—all by starting from ordinary logistic regression.

Actually, this isn’t all that surprising when you think about it. Logistic regression is a linear layer followed by sigmoid and feedforward networks are just a bunch of linear layers stacked together with non-linear activations in between.

Backpropagation#

Back-propagation is nothing more than the chain rule. We can view our logistic network as the composition of three functions

\[ x \to \mathrm{BCE} \circ \mathrm{Sigmoid} \circ \mathrm{Linear}(x) \]

While the loss function is not usually viewed as a layer of the network, treating it as the final layer makes computing the gradients easier. Let’s denote the output of the $i$-th layer by $x_i$ so that

\begin{aligned} x_1 &= \mathrm{Linear}(x) \cr x_2 &= \mathrm{Sigmoid}(x_1) \cr x_3 &= \mathrm{BCE}(x_2) \end{aligned}

Since the only weights in our network are those of the linear layer, the only gradient we need to compute is the gradient of the loss function with respect to the linear weights. To do so, we’ll need to follow the chain rule and compute a series of intermediate gradients, starting with the gradient of the loss function with respect to the activations $x_2$.

\[ \frac{\partial \mathrm{BCE}}{\partial x_2} = \frac{\partial \mathrm{BCE}}{\partial x_2}(x_2) \]

Next, we’ll need to compute the gradient with respect to the linear outputs $x_1$. The chain rule tells us

\[ \frac{\partial \mathrm{BCE}}{\partial x_1} = \frac{\partial \mathrm{BCE}}{\partial x_2} \times \frac{\partial \mathrm{Sigmoid}}{\partial x_1}(x_1) \]

We only need these two gradients to update our single layer logistic network, however, were there another layer, we’d also need to compute the gradient with respect to $x$ (which would be the previous layer’s activations). Since that’s where we’re heading anyways, let’s go ahead and do it now.

\[ \frac{\partial \mathrm{BCE}}{\partial x} = \frac{\partial \mathrm{BCE}}{\partial x_1} \times \frac{\partial \mathrm{Linear}}{\partial x}(x) \]

Notice a pattern? The first gradient we computed—the gradient with respect to the network’s final activations—is used to compute the next gradient—the gradient with respect to the linear outputs—which is in turn used to compute the gradient with respect to the linear inputs. To compute the gradients of any network, we simply start at the last layer and successively pass the gradients backwards to the preceding layer until we arrive at the original inputs. That’s the reason it’s called back-propagation. It really is helpful to picture passing the gradients back through the network like a baton.

We’ll compute each of these gradients in turn, starting with the last layer and working our way backwards to the original inputs.

⚠️ Note: So far we’ve been treating the input $x$ as a single variable but most of the time $x$ will have more than one dimension. Computing the gradients in the multi-variate case isn’t any more difficult than what we’ve done (it involves something called the Jacobian, but we’ll pretend we didn’t hear that).

Binary Cross Entropy#

Binary cross entropy penalizes predictions by the logarithm of their confidence. Given labels $y$ which are either zero or one and probabilities $\hat{y}$ for the positive class, binary cross entropy is computed as

\[ \mathrm{BCE}(\hat{y}, y) = -[y \ln(\hat{y}) + (1 - y)\ln(1 - \hat{y})] \]

After simplifying, you’ll find its derivative is

\[ \frac{\partial \mathrm{BCE}}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \]

To avoid division-by-zero errors, we’ll clip the probabilities so they’re not too close to zero or one.

class BinaryCrossEntropy:
    """Container for the forward and backward pass of BCE."""

    def __call__(self, y_hat, y):
        return self.forward(y_hat, y)

    def forward(self, y_hat, y):
        """Return binary cross entropy given predictions and targets."""
        self.y_hat, self.y = y_hat.clip(min=1e-8, max=1-1e-8), y
        return -np.where(y==1, np.log(self.y_hat), np.log(1 - self.y_hat))

    def backward(self):
        """Backpropagate the gradient with respect to predictions."""
        return (self.y_hat - self.y) / (self.y_hat * (1 - self.y_hat))

Activations#

The easiest component of networks are the activation functions. Our activation is sigmoid, which you’ll often see defined as one of

\[ \sigma(x) = \frac{1}{1 + \mathrm{exp}(-x)} \quad \text{or} \quad \frac{\mathrm{exp}(x)}{1 + \mathrm{exp}(x)} \]

Turns out we need both versions to implement a numerically stable sigmoid. Why? Notice how when $x$ is very negative, $\mathrm{exp}(-x)$ is very large, and when $x$ is very positive, $\mathrm{exp}(x)$ is very large—in both cases, too large to store in memory. The easy fix is to use the former when $x > 0$ and the latter when $x < 0$.

After simplifying, you’ll find the derivative of sigmoid is

\[ \sigma'(x) = \sigma(x)(1 - \sigma(x)) \]

This is exactly the denominator of the BCE gradient we just computed, meaning the two terms will cancel when multiplied, which is what the chain rule tells us will happen when we compute the gradient with respect to the outputs $x_1$ of our network’s last (and only) linear layer.

\[ \frac{\partial \mathrm{BCE}}{\partial x_1} = \frac{\partial \mathrm{BCE}}{\partial \hat{y}} \times \frac{\partial \mathrm{Sigmoid}}{\partial x_1}(x_1) = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \times \hat{y}(1 - \hat{y}) = \hat{y} - y \]

Nice, right? This tells us that the gradient of the loss with respect to the network’s final linear outputs is just the difference between the probabilities $\hat{y}$ and the labels $y$. The further apart they are (i.e. the worse our predictions are), the larger the gradient and the larger the update to the weights in the SGD step.

This is terrific because it means the weights of our network will change gradually as we train and won’t spike or drop suddenly, which would be the case if the gradients were a quadratic or higher-order function of the prediction error. It also nicely demonstrates how a network adjusts its weights based on the error of its predictions.

In fact, the same is true of multi-class classification, where instead of sigmoid we use softmax (or log softmax) as the final activation and cross entropy (or negative log-likelihood) as the loss.

⚠️ Remember: Thanks to the chain rule, the above gradient appears as a factor in the gradients for all of the weights of our network, meaning they all depend linearly on the prediction error, which helps promote smoother training.

class Sigmoid:
    """Container for the forward and backward pass of sigmoid."""

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        """Pass a mini-batch through a sigmoid layer."""
        self.y_hat = np.where(x > 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))
        return self.y_hat

    def backward(self, grad):
        """Backpropagate the gradient given the preceding gradient."""
        return self.y_hat * (1 - self.y_hat) * grad

Linear Layers#

The last and most difficult component we need to implement is the linear layer, which contains weights and biases. Denoting the linear outputs by $z$, we have

\[ z = xw + b \]

If $x$ is a mini-batch of shape $(bs, n_\mathrm{inp})$, then $w$ has shape $(n_\mathrm{inp}, n_\mathrm{out})$ and $b$ has shape $(n_\mathrm{out},)$, with addition being done via broadcasting. In our case $n_\mathrm{out}=1$, since we’re just doing logistic regression, but we’ll leave things as is to treat the general case. To make things easier, for the moment let’s just imagine we have a batch size of one so that

\[ x = [x_1, \space \dots, \space x_{n_\mathrm{inp}}] \quad \text{and} \quad z = [z_1, \space \dots, \space z_{n_\mathrm{out}}] \]

There are two gradients to compute this time, one with respect to the weights and another with respect to the bias. To make life easier, let’s write everything out in coordinates.

\[ z_i = \sum_{k=1}^{n_\mathrm{inp}} \space x_k w_{ki} + b_i \]

Taking the derivative with respect to the bias, we have

\[ \frac{\partial \mathrm{BCE}}{\partial b_i} = \frac{\partial \mathrm{BCE}}{\partial z_i} \times \frac{\partial z_i}{\partial b_i} = \frac{\partial \mathrm{BCE}}{\partial z_i} \]

This is great because we’ll already have the gradient with respect to the linear outputs $z_i$ stored in a variable $\mathrm{grad}$ from the previous back-prop step, so we get the gradient with respect to the bias for free. That just leaves the weights.

\[ \frac{\partial \mathrm{BCE}}{\partial w_{ki}} = \frac{\partial \mathrm{BCE}}{\partial z_i} \times \frac{\partial z_i} {\partial w_{ki}} = \frac{\partial \mathrm{BCE}}{\partial z_i} \times x_k \]

Taking a closer look at these gradients, we see exactly why neural nets are so sensitive to the scale of their inputs. Because the gradients with respect to the weights $w$ are multiplied by the input features $x_k$, having features of different magnitudes will result in some gradients being larger than others. These larger gradients will dominate the learning process and prevent the network from learning from all features equally. This is why it’s important to normalize your data before training.

The last thing we have to do is to figure out how to write the above equations as a matrix product. Whenever I have to do something like this, I just focus on getting the shapes right.

\begin{aligned} &\bullet \space x \text{ has shape } (bs, n_\mathrm{inp}) \cr &\bullet \space \text{grad has shape } (bs, n_\mathrm{out}) \cr &\bullet \space \mathrm{grad_w} \text{ has shape } (n_\mathrm{inp}, n_\mathrm{out}) \end{aligned}

The only way we can multiply $x$ and $\mathrm{grad}$ and get something of shape $(n_\mathrm{inp}, n_\mathrm{out})$ is to reshape $x$ to have shape $(bs, n_\mathrm{inp}, 1)$ and $\mathrm{grad}$ to have shape $(bs, 1, n_\mathrm{out})$ so that ordinary matrix multiplication over the last two dimensions gives the shape $(bs, n_\mathrm{inp}, n_\mathrm{out})$. Then we’ll average the gradients over the batch dimension to get a single update to the weights.

As we mentioned earlier, were there another linear layer we’d also need to compute the gradient with respect to the inputs $x$ so we could keep back-propagating the gradients. This isn’t any more difficult than what we’ve done so far and doing so will allow us to build a network with any number of layers, so let’s go ahead and do it. Since $x_k$ appears in each of the activations $z_i$, the gradient will respect to $x_k$ will include all of the intermediate gradients with respect to the $z_i$.

\[ \frac{\partial \mathrm{BCE}}{\partial x_k} = \sum_{i=1}^{n_\mathrm{inp}} \frac{\partial \mathrm{BCE}}{\partial z_i} \times \frac{\partial z_i}{\partial x_k} = \sum_{i=1}^{n_\mathrm{inp}} \frac{\partial \mathrm{BCE}}{\partial z_i} w_{ki} \]

We can rewrite this as a matrix product using the transpose of the weight matrix.

\[ \frac{\partial \mathrm{BCE}}{\partial x_k} = \mathrm{grad} \times w^t \]

Let’s do a sanity check on the dimensions involved to make sure nothing has gone horribly wrong. Since $\mathrm{grad}$ has shape $(bs, n_\mathrm{out})$ and $w$ has shape $(n_\mathrm{inp}, n_\mathrm{out})$, $\mathrm{grad} \times w^t$ has shape $(bs, n_\mathrm{inp})$, which is exactly the shape of $x$—just as it should be.

class Linear:
    """Container for the forward and backward pass of a linear layer."""

    def __init__(self, n_inp, n_out):
        """Initialise layer with random weights and zero bias."""
        k = 1 / np.sqrt(n_inp)
        self.weights = np.random.uniform(-k, k, (n_inp, n_out))
        self.bias = np.zeros(n_out)

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        """Pass a mini-batch through a linear layer."""
        self.x = x
        return x @ self.weights + self.bias

    def backward(self, grad):
        """Backpropagate the gradient given the preceding gradient."""
        self.grad_w = (self.x[:,:,None] @ grad[:,None,:]).mean(axis=0)
        self.grad_b = grad.mean(axis=0)
        return grad @ self.weights.T

Putting It All Together#

It’s finally time to string together all of the work we’ve done into a complete network. Then we’ll put it to the test on the breast cancer dataset and see how we compare to sklearn’s built-in logistic model.

class Sequential:
    """Container for a feedforward neural net."""

    def __init__(self, layers, criterion):
        """Initialise layers and loss criterion."""
        self.layers = layers
        self.criterion = criterion

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        """Pass a mini-batch through the network."""
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self):
        """Backpropagate gradients to the start of the network."""
        grad = self.criterion.backward()
        for layer in self.layers[::-1]:
            grad = layer.backward(grad)

For the optimizer, we’ll stick with vanilla gradient descent.

class SGD:
    """Container for updating a model's weights via SGD."""

    def __init__(self, model, lr):
        """Initialise model parameters and learning rate."""
        self.model = model
        self.lr = lr

    def step(self):
        """Update weights and biases of all linear layers."""
        for layer in self.model.layers:
            if isinstance(layer, Linear):
                layer.weights -= self.lr * layer.grad_w
                layer.bias -= self.lr * layer.grad_b

Trainer Class#

To make life easier, let’s wrap all of the functionality we’ll need to train a network in a single class.

class Trainer:
    """Container for training a feedforward neural net."""

    def __init__(self, model, optimizer, train_dl, val_dl, metric):
        self.model = model
        self.optimizer = optimizer
        self.train_dl = train_dl
        self.val_dl = val_dl
        self.metric = metric

    def train_one_epoch(self):
        """Train for one epoch and return the loss."""
        loss, n = 0, 0
        for x, y in self.train_dl:
            y_hat = self.model(x)
            batch_loss = self.model.criterion(y_hat, y).sum()
            self.model.backward()
            self.optimizer.step()
            loss += batch_loss
            n += len(y)
        return loss / n

    def train(self, n_epochs, log_level=1):
        """Train for several epochs."""
        for epoch in range(n_epochs):
            loss = self.train_one_epoch()
            val_loss, val_metric = self.evaluate(self.val_dl)
            if (epoch + 1) % log_level == 0:
                print(f"{epoch= :2d} | {loss= :.3f} | {val_loss= :.3f} | {val_metric= :.3f}")

    def evaluate(self, dl):
        """Return loss and metric on validation or test set."""
        loss, metric, n = 0, 0, 0
        for x, y in dl:
            y_hat = self.model(x)
            batch_loss = self.model.criterion(y_hat, y).sum()
            batch_metric = self.metric(y_hat, y)
            metric += len(y) * batch_metric
            loss += batch_loss
            n += len(y)
        return loss / n, metric / n

Getting the Data#

Now let’s grab our data and split off a validation set.

X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=9)
X_train.shape, X_val.shape   # (455, 30), (114, 30)

As we saw above, it’s important to always normalize our data before training.

def normalize(X_train, X_val):
    """Normalize training and validation data using training stats."""
    for j in range(X_train.shape[1]):
        mu, sigma = X_train[:,j].mean(), X_train[:,j].std()
        X_train[:,j] = (X_train[:,j] - mu) / sigma
        X_val[:,j] = (X_val[:,j] - mu) / sigma
    return X_train, X_val

Datasets & Dataloaders#

Since pytorch’s dataset and dataloader classes only work with tensors, we’ll need to implement our own numpy versions in order to train in batches. We’ll omit this for brevity, but you can take a look at the source code at the tinytorch repo.

# Prep datasets & dataloaders
train_ds = Dataset(X_train, y_train)
val_ds = Dataset(X_val, y_val)

batch_size = 64
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=False)

Training Loop#

At last we’re ready to train our logistic network.

# Number of input features
n_inp = X_train.shape[1]

# Initialise layers and criterion
criterion = BinaryCrossEntropy()
layers = [Linear(n_inp, 1), Sigmoid()]
model = Sequential(layers, criterion)

# Initialise optimizer and trainer
optimizer = SGD(model, lr=0.1)
trainer = Trainer(model, optimizer, train_dl, val_dl, metric=accuracy)
trainer.train(5)

Comparison with `sklearn`#

Let’s see how our logistic network stacks up against sklearn’s built-in logistic model.

sklearn_model = LogisticRegression(random_state=9)
sklearn_model.fit(X_train, y_train)
sklearn_model.score(X_val, y_val)

0.98245

Going Deeper#

Now let’s see if we can’t do a bit better using a deeper network. First, we’ll need to implement $\mathrm{ReLU}$ for the activations in between our linear layers. $\mathrm{ReLU}$ acts as a gate that admits only positive inputs and maps negative inputs to zero.

$$ \mathrm{ReLU}(x) = \begin{cases} x & \mathrm{if } x > 0 \cr 0 & \mathrm{otherwise} \end{cases} $$

You might recognize its derivative as the Heaviside function, which models an electric current that’s turned on at time $t = 0$.

$$ \mathrm{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \cr 0 & \text{otherwise} \end{cases} $$

During the backward pass $\mathrm{ReLU}$ again acts as a gate that only admits gradients corresponding to positive inputs and sends all other gradients to zero.

class ReLU:
    """Container for the forward and backward pass of ReLU."""

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        """Pass a mini-batch through ReLU."""
        self.x = x
        return np.where(x > 0, x, 0)

    def backward(self, grad):
        """Return the gradient where x is positive, otherwise zero."""
        return np.where(self.x > 0, grad, 0)

A Two Layer Network#

Now we’re ready to train a multi-layer neural net.

# Initialise layers and criterion
criterion = BinaryCrossEntropy()
layers = [Linear(n_inp, 20), ReLU(), Linear(20, 1), Sigmoid()]
model = Sequential(layers, criterion)

# Initialise optimizer and trainer
optimizer = SGD(model, lr=0.1)
trainer = Trainer(model, optimizer, train_dl, val_dl, metric=accuracy)
trainer.train(5)

$\,$

epoch	train loss	val loss	accuracy
0	0.620	0.547	0.904
1	0.503	0.428	0.930
2	0.398	0.329	0.939
3	0.314	0.261	0.939
4	0.256	0.217	0.956

We’re not seeing a clear win with a deeper model, which isn’t all that surprising. For whatever reason, neural nets do not excel at tabular data.

Conclusion#

That’s about it for us. In this post we were able to build and understand each piece of a feedforward neural net, starting from a single layer logistic network and ending with a multi-layer network. We saw that one reason cross entropy is an effective loss function is that the gradient with respect to the weights depends linearly on the prediction error, which promotes smoother training, and we saw that because the original inputs appear as factors in the network’s gradients, it’s essential to always normalize your data before training.