Fisher Information Matrix 🐠

I learn Fisher Information Matrix before/during the goddamn military service.

Interests in Fisher Information Matrix

Fisher Information Matrix is highly related to Hessian Matrix. Hessian matrix is a square matrix describing the second-order partial derivatives. As we learned in high school, second order information gives us one-step further information on the current curvature. This property allows efficient optimization.

Let's say we have a model parametrized by $\theta$ and we would like to optimize the likelihood $p(x|\theta)$ w.r.t. $\theta$. Usually, we aim at maximizing the log-likelihood $log \ p(x|\theta)$ instead. For the sake of convenience of the following sections, we define a score function $s(\theta)$ as follows:

$s(\theta) = \nabla_{\theta} log \ p(x|\theta)$

The score function is the graident of the log-likelihood.

The interpretation of the score function is that it measures the changes of the log-likelihood w.r.t. each parameter in $\theta$. Namely, suppose $\theta=[\theta_1, \theta_2, ..., \theta_N]$, the score function describes how the log-likelihood changes when $\theta_1$ or $\theta_2$ changes.
Now, one natural question arises, what if we are interested in the relation between $\nabla_{\theta_1} log \ p(x|\theta)$ and $\nabla_{\theta_2} log \ p(x|\theta)$. To this end, we need the covariance of the score function:

$I_{\theta} = Corr(s(\theta), s(\theta)) = E_{p(x|\theta)} [(s(\theta) - E_{p(x|\theta)}[s(\theta)]) (s(\theta) - E_{p(x|\theta)}[s(\theta)])^T]$

Before diving deeper, let's first calculate this problematic term: the mean of the scrore function $E_{p(x|\theta)}[s(\theta)]$. Don't worry. It will be surprisingly easy:

\begin{align*} E_{p(x|\theta)}[s(\theta)] &= E_{p(x|\theta)} [\nabla_\theta log \ p(x|\theta)] \\ &= \intop\nolimits_{x} \nabla_\theta log \ p(x|\theta) \ p(x|\theta) dx \\ &= \intop\nolimits_{x} \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)} p(x|\theta) dx \\ &= \intop\nolimits_{x} \nabla_\theta p(x|\theta) dx \\ &= \nabla_\theta \intop\nolimits_{x} p(x|\theta) dx \\ &= 0 \end{align*}

Now, $Corr(s(\theta), s(\theta))$ becomes much easier:

\begin{align*} I_{\theta} &= Corr(s(\theta), s(\theta)) \\ &= E_{p(x|\theta)} [(s(\theta) - 0) (s(\theta) - 0)^T] \\ &= E_{p(x|\theta)} [\nabla_{\theta} log \ p(x|\theta) \nabla_{\theta} log \ p(x|\theta)^T] \end{align*}

If the exact form of the log-likelihood w.r.t. $\theta$ is known, we might be able to directly calculate $I_\theta$. However, most of the time the likelihood is intractable. A workaround is that we can calculate empirical fisher if we have empirical samples drawn from $p(x|\theta)$.

$I_{\theta} = \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta log \ p(x_i|\theta) \nabla_\theta log \ p(x_i|\theta)^T$

Note that this formulation is very common. In generative model, we optimize the likelihood of the data $x$ (can be either continuous/discrete data) w.r.t. $\theta$. In supervised learning, it's more straightforward. We optimize $p(y=\widehat{y} | x, \theta)$ w.r.t. $\theta$. Similarly, we can consider a reinforcement learning program as an inference problem that we optimize $p(a|o, \theta)$ w.r.t $\theta$ such that the total return/cost $\eta(\tau)$ is maximized/minimized (It might not be intuitive, but it's possible with some smart formulation).

The takeaway so far

The Fisher Information Matrix describes the covariance of the gradient of the log-likelihood function. Note that we call it "information" because the Fisher information measures how much the parameters tell us about the data.

🔨 Case study: Elastic weight consolidation

Figure 1. Illustration of the learning process of task B after that of task A.

tl;dr: EWC is an algorithm to avoid catastrophic forgetting in neural networks. It slows down learning on certain weights based on how important they are to previously seen tasks.

Let's say we have two tasks A and B. In continual learning, we first learn task A and then task B. When learning the second task B, the neural networks incline to forget the previously learned knowledge of the previous task (A). The learning process can be written as (corresponds to "no penalty" in figure 1):

$\theta^* = \underset{\theta}{\operatorname{argmin}} \mathcal{L_{B}(\theta)}$

To avoid forgetting the learned knowledge in task A, one simple trick is that we can minimize the distances between $\theta$ and $\theta_\mathcal{A}^*$. Thus, the learning process becomes (corresponds to "l2" in figure 1):

$\theta^* = \underset{\theta}{\operatorname{argmin}} \mathcal{L_{B}(\theta)} + \frac{1}{2}\alpha (\theta - \theta_\mathcal{A}^*)^2$

where $\alpha$ is the scalar sets how important the old task is compared to the new one.

It turns out that the l2 constraint is so strong that it could hamper the learning process of task B. Here, we have one more observation: In neural networks, we often over-parametrize the models. There might be some parameters that are less useful and others are more valuable. In the l2 constraint case, each parameter is treated equally. Here, we want to use the diagonal components in Fisher Information Matrix to identify which parameters are more important to task A and apply higher weights to them. (corresponds to "EWC" in figure 1)

$\theta^* = \underset{\theta}{\operatorname{argmin}} \mathcal{L_{B}(\theta)} + \frac{1}{2} \alpha I_{\theta^*_\mathcal{A}, i} (\theta_i - \theta^*_{\mathcal{A}, i})^2$.

where $I_i$ is the diagonal of the Fisher Information Matrix and $i$ labels each parameter

$I_i$ tells us if the parameter $\theta_i$ is important to the previous task A. To compute $I_i$, we sample the data from task A once and calculate the empirical Fisher Information Matrix as described before. If you also find it interesting, check the PyTorch implementation here @moskomule/ewc.pytorch. (tbh, I didn't run this code.)

$I_{\theta_\mathcal{A}^*} = \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta log \ p(x_{\mathcal{A}, i}|\theta_\mathcal{A}^*) \nabla_\theta log \ p(x_{\mathcal{A}, i}|\theta_\mathcal{A}^*)^T$

The relation between Fisher Information Matrix and KL-divergence

This part is sort of mathness. Hang in there! 🧟

KL-divergence is widely used to measure the difference between two distributions. Here, we will prove that Fisher Information Matrix defines the local curvature in distribution space for which KL-divergence is the metric.

Relation between Fisher and Hessian

(skip to the last line of this subsection if you are not interested in it.)
We begin to briefly prove the relation between Fisher and Hessian. Hessian Matrix is a square matrix of second-order partial derivatives of a scalar-valued function, which describe the local curvature. The Hessian matrix of the log-likelihood can be written as:

\begin{align*} H_{log \ p(x|\theta)} &= J(\nabla_\theta log \ p(x|\theta)) \ \ (\text{by definition})\\ &= J(\frac{\nabla_\theta p(x|\theta)}{p(x|\theta)}) \\ &= \frac{H_{p(x|\theta)} p(x|\theta) - \nabla_\theta p(x|\theta) \nabla_\theta p(x|\theta)^T}{p(x|\theta) p(x|\theta)} \ \ (\text{quotient rule of derivative}) \\ &= \frac{H_{p(x|\theta)}}{p(x|\theta)} - \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)} \frac{\nabla_\theta p(x|\theta)^T}{p(x|\theta)} \end{align*}

Now, let's take the expectation w.r.t. the current model $p(x|\theta)$:

\begin{align*} E_{p(x|\theta)} [H_{log \ p(x|\theta)}] &= E_{p(x|\theta)} [\frac{H_{p(x|\theta)}}{p(x|\theta)} - \frac{\nabla_\theta p(x|\theta)}{p(x|\theta)} \frac{\nabla_\theta p(x|\theta)^T}{p(x|\theta)}] \\ &= \intop\nolimits_{x} \frac{H_{p(x|\theta)}}{p(x|\theta)} p(x|\theta) dx - E_{p(x|\theta)} [\frac{\nabla_\theta p(x|\theta)}{p(x|\theta)} \frac{\nabla_\theta p(x|\theta)^T}{p(x|\theta)}] \\ &= H_{\intop\nolimits_{x} p(x|\theta) dx} - E_{p(x|\theta)} [\nabla_\theta log p(x|\theta) \nabla_\theta log p(x|\theta)^T] \\ &= - I_{\theta} \end{align*}

Thus, we have: The negative of Fisher is the expectation of Hessian of log-likelihood.

Relation between Fisher and KL-divergence

With the conclusion above, we can move on to this interesting property: Fisher Information Matrix defines the local curvature in distribution space for which KL-divergence is the metric. Note that there are two components here: (1) local curvature (Hessian). (2) for which KL-divergence is the metric (KL between two distributions).

We start with checking the KL of two distributions $p(x|\theta)$ and $p(x|\theta')$ (assume that $\theta'$) is very close to $\theta$ since we are interested in local curvature.

\begin{align*} KL[p(x|\theta)||p(x|\theta')] &= E_{p(x|\theta)} [log \ p(x|\theta)] - E_{p(x|\theta)} [log \ p(x|\theta')] \end{align*}

and take the first derivative w.r.t. $\theta'$:

\begin{align*} \nabla_{\theta'} KL[p(x|\theta)||p(x|\theta')] &= \nabla_{\theta'} E_{p(x|\theta)} [log \ p(x|\theta)] - \nabla_{\theta'} E_{p(x|\theta)} [log \ p(x|\theta')] \\ &= - E_{p(x|\theta)} [\nabla_{\theta'} log \ p(x|\theta')] \\ &= - \intop\nolimits_{x} p(x|\theta) \nabla_{\theta'} log \ p(x|\theta')dx \end{align*}

the first line comes from the fact that KL can be decomposed as entropy and cross entropy. Then, we take the second derivative w.r.t. $\theta'$:

\begin{align*} \nabla_{\theta'}^2 KL[p(x|\theta)||p(x|\theta')] &= - \intop\nolimits_{x} p(x|\theta) \nabla_{\theta'}^2 log \ p(x|\theta')dx \end{align*}

Thus, we have Hessian w.r.t. $\theta'$ evaluate at $\theta'=\theta$ (the expectation over samples from $p(x|\theta)$. In practice, we usually only have the samples from the current parameters $\theta$, instead of future ones $\theta'$.):

\begin{align*} H_{KL[p(x|\theta)||p(x|｀\theta')} &= \nabla_{\theta'}^2 KL[p(x|\theta)||p(x|\theta')] \\ &= - \intop\nolimits_{x} p(x|\theta) \nabla_{\theta'}^2 log \ p(x|\theta')|_{\theta'=\theta} dx \\ &= - \intop\nolimits_{x} p(x|\theta) H_{log \ p(x|\theta)} dx \\ &= - E_{p(x|\theta)} [H_{log \ p(x|\theta)}] \\ &= I_{\theta} \end{align*}

Really interesting, right?

Reference:

• https://wiseodd.github.io/techblog/2018/03/11/fisher-information/