The Best Gamma-Poisson Parameterization imo.
I’ve been studying Bayesian models for count data using Gamma-Poisson distributions. The objective was simple, using auxiliary features, use a neural network to predict the parameters of the Gamma distribution, and then use the Gamma distribution as a prior for the count data. The issue arises when trying to actually implement it. Any naive reparameterization I tried (using torch.log
, adding epsilons everywhere, normalizing the count data, etc.) didn’t work, and training a neural network always resulted in exploding gradients. I needed to understand the Gamma-Poisson distribution better.
First of all, documentation on Gamma distributions are messy, not only are there different reparameterizations (scale vs. rate), sometimes people use the same symbols for different parameterizations and don’t tell you which ones they’re using! Not to mention some references just get it wrong and the formulas are not even correct (Look at this. Here $\alpha$ and $\beta$ should be switched. yikes).
Let’s develop a good parameterization from first principles. Turns out, besides using scale or rate, there’s a quirky third reparameterization that’s far superior to all others (in my opinion). Not sure if it’s been done before, but I haven’t seen it anywhere, so I’m posting it here. To get at it, let’s first write down the pdf of the Gamma-Poisson distribution. It’s just a mixture of Poisson distributions with parameter $\lambda$ drawn from a Gamma distribution with parameters (shape = $\alpha$) and (rate = $\beta$):
$f_{Poisson}(\lambda) = \frac{e^{-\lambda}\lambda^{x}}{x!}$
$f_{Gamma}(\lambda | \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)}\lambda^{\alpha - 1}e^{-\beta\lambda}$
$f_{GammaPoi}(x) = \int_{0}^{\infty}{f_{Poisson}(\lambda)f_{Gamma}(\lambda | \alpha, \beta) d\lambda}$
$\quad = \int_{0}^{\infty}{\frac{e^{-\lambda}\lambda^{x}}{x!}\frac{\beta^{\alpha}}{\Gamma(\alpha)}\lambda^{\alpha - 1}e^{-\beta\lambda} d\lambda}$
$\quad = \frac{\beta^{\alpha}}{x!\Gamma(\alpha)}\int_{0}^{\infty}{\lambda^{x + \alpha - 1}e^{-(1 + \beta)\lambda} d\lambda}$
Notice here that the integral looks almost like a Gamma distribution, but with different (shape = $x + \alpha$) and (rate = $1 + \beta$) parameters. It’s also missing a few constants. Since the integral of a probability distribution must be 1, we can just multiply and divide by the missing constants $\frac{\Gamma(x + \alpha)}{(1 + \beta)^{x + \alpha}}$ to get:
$f_{GammaPoi}(x) = \frac{\beta^{\alpha}}{x!\Gamma(\alpha)}\frac{\Gamma(x + \alpha)}{(1 + \beta)^{x + \alpha}}$
Rearranging a little bit, we get:
$f_{GammaPoi}(x) = \frac{\Gamma(x + \alpha)}{x!\Gamma(\alpha)}\left(\frac{\beta}{1 + \beta}\right)^{\alpha}\left(\frac{1}{1 + \beta}\right)^{x}$
Notice that I separated the term with $\beta$ and $1 + \beta$ into two terms. I did this to show that the Gamma-Poisson is very closely related to the so-called Negative Binomial distribution. The Negative Binomial distribution is defined as:
$f_{NegBin}(x) = \frac{\Gamma(x + r)}{x!\Gamma(r)}p^{r}(1 - p)^{x}$
We can see that the Negative Binomial distribution and Gamma-Poisson distribution are related as $\alpha = r$ and $\frac{\beta}{1 + \beta} = p$. This is a way better reparameterization than the usual scale or rate parameterization. As we will see, it allows us to use Binary Cross Entropy to compute the log-likelihood, a function which has been heavily optimized in many libraries and is very stable. So, lets parameterize it in terms of logits $z$, and $\sigma(z)$ is the sigmoid of $z$:
$\sigma(z) = p = \frac{\beta}{1 + \beta}$
Then we can rewrite the Gamma-Poisson distribution as:
$f_{GammaPoi}(x) = \frac{\Gamma(x + \alpha)}{x!\Gamma(\alpha)}\sigma(z)^{\alpha}(1 - \sigma(z))^{x}$
The log-likelihood of this distribution is:
$\log{f_{GammaPoi}(x)} = - \log{x!} + \log{\Gamma(x + \alpha)} - \log{\Gamma(\alpha)}$
$+ \alpha\log{\sigma(z)} + x\log{(1 - \sigma(z))}$
Notice that the two last terms looks like Binary Cross Entropy (BCE)!
$\alpha\log{\sigma(z)} + x\log{(1 - \sigma(z))}$
Specifically, after further reparameterizing the shape $\alpha$ as log shape $\log(\alpha)$, the input logits, targets, and weights to the BCE are:
Logits: $z$
Targets: $\frac{\alpha}{x + \alpha} = \text{Softmax}(\log(\alpha), \log(x))[0]$
Weights: $x + \alpha$
We can just use the Binary Cross Entropy loss function BCEWithLogitsLoss
for half of the Gamma-Poisson likelihood computation:
$\alpha\log{\sigma(z)} + x\log{(1 - \sigma(z))}$
= - (x + alpha) BCEWithLogitsLoss(z, alpha / (x + alpha))
Meanwhile, the other half can just be computed using torch.lgamma()
functions. The full Gamma-Poisson likelihood computation is:
def gamma_poisson_loss(self, log_shape, logits, value):
eps = 1e-8
concentration = torch.exp(log_shape)
log_value = torch.log(value + eps)
targets = torch.softmax(
torch.stack(
[log_shape, log_value], dim=-1
),
dim=-1
)[..., 0]
weights = value + concentration
log_likelihood = (
torch.lgamma(concentration + value + eps)
- torch.lgamma(concentration + eps)
- weights * F.binary_cross_entropy_with_logits(
input = logits,
target = targets,
reduction = 'none'
)
)
return -log_likelihood.mean()
It’s also nice because to get the rate, we can just exponentiate the logits:
$\beta = \frac{p}{1 - p} = \frac{\sigma(z)}{1 - \sigma(z)} = \frac{1}{1 + e^{-z}} \frac{1 + e^{-z}}{e^{-z}} = e^{z}$
In conclusion, we parameterized the Gamma-Poisson distribution using logits and log shape. These parameters are unrestricted, so we can ask a neural network to output logits $z$ and $\log(\alpha)$ for the Gamma-Poisson distribution. We then use Binary Cross Entropy for parts the likelihood computation, and torch.lgamma()
functions for the rest. Voila! No more exploding gradients!
-Randy