Jensen–Shannon Divergence (JSD)

Jensen–Shannon Divergence (JSD)#

Building onto the Kullback–Leibler Divergence (KLD), Jensen–Shannon divergence was proposed as a method of measuring similarity between the probability distributions [Lin91]. One can say, the Jensen–Shannon Divergence is the symmetric version of K-L divergence, written as:

\[ \mathrm{JSD}(P \parallel Q) = \frac{1}{2} D(P \parallel M) + \frac{1}{2} D(Q \parallel M) \]

where, \(M = \frac{1}{2} (P + Q)\) is a mixture distribution of \(P\) and \(Q\) and \(D(P \parallel M)\) is the K-L divergence measure.

One can express the above equation in terms of Shannon Entropy \(H(X)\) as follows:

\[ \mathrm{JSD}(P \parallel Q) = H\left(\frac{P + Q}{2} \right) - \frac{1}{2} H(P) - \frac{1}{2} H(Q). \]

Let \(P_1, P_2, \cdots, P_n\) be \(n\) probability distributions with weights \(\pi_1, \pi_2, \cdots, \pi_n\), respectively. The generalized Jensen-Shannon divergence can be defined as:

\[ JS_{\pi}(P_1, P_2, \cdots, P_n) = H\left( \sum_{i=1}^{n} \pi_i P_i \right) - \sum_{i=1}^{n} \pi_i H(P_i), \]

where \(\pi = (\pi_1, \pi_2, \cdots, \pi_n)\) and \(\sum_{i=1}^{n} \pi_i = 1\). In the case of two probability distributions \(P_1 = P, P_2 =Q\), \( \pi_1 = \pi_2 = 1/2 \) we get back to the expression we started. Hence, the Jensen–Shannon divergence can also be understood as the difference between the entropy of average distributions to average of entropies.

Since JSD is a measure of the similarity between the probability distributions, the larger the value of this metric indicates, the less the similarity.

Bound of Jensen–Shannon divergence

The Jensen–Shannon divergence measure is bounded by \(log_b (n)\), as:

\[ 0 \leq \mathrm{JSD}_{\pi_1, \dots, \pi_n} (P_1, P_2, \dots, P_n) \leq \log_b (n) \]

The Jensen Shannon Distance was proposed as the square root of Jensen Shannon Divergence, i.e., \(\left[ D_{JS}(P, Q) \right]^{1/2}\), as it comes to fulfill the triangle inequality property required to make up a metric space [ES03].

Supported Estimators#

JSD is available for Bayesian, Shrinkage, and pre-v0.5.0 discrete estimators, plus continuous estimators:

Discrete Estimators:

Maximum Likelihood: discrete
Bayesian: bayes (with multiple priors)
Shrinkage: shrink (James-Stein shrinkage)

Continuous Estimators:

Kernel: kernel (with various kernel types)
Ordinal: ordinal (for time series analysis)

import infomeasure as im

p = [6, 3, 1, 3, 8, 1, 2, 9, 7, 7, 3, 7, 3, 3, 5, 7, 7, 3, 3, 5]
q = [2, 1, 6, 6, 3, 3, 6, 5, 3, 1, 7, 9, 3, 3, 1, 5, 4, 6, 6, 1]
im.jensen_shannon_divergence(p, q, approach='discrete')

np.float64(0.12011427674638497)

Calculating the generalized Jensen-Shannon Divergence with multiple RVs is also possible.

s = [3, 6, 6, 8, 5, 3, 6, 7, 3, 9, 7, 7, 1, 3, 2, 3, 4, 9, 3, 7]
t = [6, 3, 3, 9, 6, 6, 3, 9, 5, 9, 4, 4, 5, 8, 9, 8, 3, 3, 6, 7]
im.jsd(p, q, s, t, approach='discrete')

np.float64(0.15678524224326962)

(im.jsd(p, q, approach='kernel', kernel='box', bandwidth=3),
 im.jsd(p, q, approach='kernel', kernel='gaussian', bandwidth=2))

(np.float64(0.012876322263125406), np.float64(0.0007406410081878256))

(im.jsd(p, q, approach='ordinal', embedding_dim=2),
 im.jsd(p, q, approach='ordinal', embedding_dim=3),
 im.jsd(p, q, approach='ordinal', embedding_dim=4))

(np.float64(0.005967818813228387),
 np.float64(0.10597304163022847),
 np.float64(0.3107972572116049))

im.jsd(p, q, approach='bayes', alpha = 0.9)

np.float64(0.0954217515332334)

im.jsd(p, q, approach='shrink')

np.float64(0.0847767465266136)

Jensen–Shannon Divergence (JSD)

Contents

Jensen–Shannon Divergence (JSD)#

Supported Estimators#