ELBO¶
Given a probability distribution density \(p(x)\) and a latent variable \(z\), the marginalization of the joint probability is
Using Jensen's Inequality¶
In many models, we are interested in the log probability density \(\log p(X)\) which can be decomposed using an auxiliary density of the latent variable \(q(Z)\),
Jensen's Inequality
Jensen's inequality shows that^{1}
as \(\log\) is a concave function.
Applying Jensen's inequality,
Using the definition of entropy and cross entropy, we know that
is the entropy of \(q(z)\), and
is the cross entropy. We define
which is called the evidence lower bound (ELBO). It is a lower bound because
Using KL Divergence¶
In a latent variable model, we need the posterior \(p(zx)\). When this is intractable, we find an approximation \(q(z\theta)\) where \(\theta\) is the parametrization, e.g., neural network parameters. To make sure we have a good approximation of the posterior, we require the KL divergence of \(q(z\theta)\) and \(p(zz)\) to be small. The KL divergence in this situation is^{2}
Since \(\operatorname{D}_{\text{KL}}(q(z\theta)\parallel p(zx))\geq 0\), we have
which also indicates that \(L\) is the lower bound of \(\log p(x)\).
Jensen gap
The difference between \(\log p(x)\) and \(L\) is the Jensen gap, i.e.,

Contributors to Wikimedia projects. Jensen’s inequality. In: Wikipedia [Internet]. 27 Aug 2021 [cited 5 Sep 2021]. Available: https://en.wikipedia.org/wiki/Jensen%27s_inequality ↩

Yang X. Understanding the Variational Lower Bound. 14 Apr 2017 [cited 5 Sep 2021]. Available: https://xyang35.github.io/2017/04/14/variationallowerbound/ ↩