Kullback–Leibler Divergence (KLD)#
The Kullback–Leibler divergence, it is the mathematical measure of difference in between two probability distributions. It is a measure of relative entropy. E.g.: \(P\) and \(Q\) are probability distributions over a set \(\mathcal{X}\):
Here, \(H_Q(P)\) is the Cross-Entropy and \(H(P)\) the normal Entropy (H). One can interpret the K-L divergence as degree of surprise one encounter by falsely assigning the distribution \(Q\) for true distribution \(P\) in a model. Even though K-L divergence seems to measure some sort of distance (in a sense) between the two probability distributions, it is not a distance metric in the mathematical sense as it lacks some of the properties such as being symmetric and satisfying the triangle inequality.
import infomeasure as im
p = [6, 3, 1, 3, 8, 1, 2, 9, 7, 7, 3, 7, 3, 3, 5, 7, 7, 3, 3, 5]
q = [2, 1, 6, 6, 3, 3, 6, 5, 3, 1, 7, 9, 3, 3, 1, 5, 4, 6, 6, 1]
im.kullback_leiber_divergence(p, q, approach='discrete')
np.float64(0.22055153357055035)
Understanding Kullback-Leibler Divergence
Imagine you are at a techno club, and there are two DJs:
P: The DJ playing your favorite tracks (your true preference).
Q: The DJ playing tracks they think you’ll enjoy (their estimate of your preference).
Kullback-Leibler Divergence (KLD) quantifies the “extra effort” or “surprise” caused by dancing to DJ Q’s playlist instead of your ideal playlist. It measures how much harder it is to vibe to DJ Q’s tracks compared to what you’d naturally enjoy.
This differs from cross-entropy \(H_Q(P)\), which includes both:
The baseline effort to dance to your favorite music (the inherent uncertainty of P), and
The mismatch between DJ Q’s playlist and your true taste (P).
KLD isolates just the extra effort caused by the mismatch, removing the baseline uncertainty of P. In other words, it focuses purely on how poorly DJ Q’s playlist aligns with your ideal taste. In summary:
Cross-Entropy \(H_Q(P)\): The total energy required to dance to DJ Q’s music.
\(\operatorname{KLD}(P \parallel Q)\): The additional burden of dancing to DJ Q’s playlist compared to your perfect playlist.
This is why KLD is often used in information theory and machine learning to evaluate how well an estimated distribution (Q) represents the true data distribution (P).
Supported Estimators#
KLD is computed using the cross-entropy formula for all supported estimators:
Where \(H_Q(P)\) is the cross-entropy and \(H(P)\) is the entropy of distribution \(P\).
The following estimators support cross-entropy and thus KLD:
Basic Estimators:
discrete(MLE),miller_madowBayesian:
bayes(with multiple priors)Kernel:
kernel(with various kernel types)Ordinal:
ordinal(for time series analysis)Metric/KL:
metricorkl(Kozachenko-Leonenko)Generalized Entropies:
renyi,tsallis
(im.kld(p, q, approach='kernel', kernel='box', bandwidth=3),
im.kld(p, q, approach='kernel', kernel='gaussian', bandwidth=2))
(np.float64(0.03340367715039161), np.float64(-0.03741375373620448))
im.kld(p, q, approach='metric') # or 'kl'
np.float64(2.1755292233102637)
(im.kld(p, q, approach='ordinal', embedding_dim=2),
im.kld(p, q, approach='ordinal', embedding_dim=3),
im.kld(p, q, approach='ordinal', embedding_dim=4))
(np.float64(4.1430365193662055e-05),
np.float64(0.0777691885832359),
np.float64(-1.4014672742902516))
(im.kld(p, q, approach='renyi', alpha=0.8),
im.kld(p, q, approach='tsallis', q=0.9))
(np.float64(0.9160952514825628), np.float64(1.191986052941051))
im.kld(p, q, approach='miller_madow') # or 'mm'
np.float64(0.22055153357055013)
im.kld(p, q, approach='bayes', alpha="laplace")
np.float64(-0.004095921294350324)