Kozachenko-Leonenko (KL) / Metric / kNN Entropy Estimation

Kozachenko-Leonenko (KL) / Metric / kNN Entropy Estimation#

The Shannon differential entropy formula for a continuous random variable \( X \) with density \( p(x)\) is given as [Sha48]:

\[ H(X) = -\int_{X} p(x) \log p(x) \, dx, \]

where \(p(x)\) is the probability density function (pdf).

The Kozachenko-Leonenko (KL) entropy estimator leverages a nearest-neighbor approach to estimate the Shannon entropy of a continuous random variable from a finite sample. The estimator approximates entropy as the expectation of the logarithm of the density. Given a sample \( \{x_1, x_2, \dots, x_N\} \), the density at each point \( x_i \) is estimated using the distance \( \epsilon(i) \) to its \( k \)-th nearest neighbor. Assuming local uniformity, the estimated density suffices \( \widehat{p}(x_i) \approx c_d \epsilon(i)^d \), where \( c_d \) is the volume of a unit \( d \)-dimensional ball. By leveraging order statistics, the expectation \( E(\log p_i) = \psi(k) - \psi(N) \) is obtained, where \( \psi(x) \) is the digamma function. Substituting this into the entropy definition leads to the final KL estimator [HSPVB07, KL87, KStogbauerG11]:

\[ \hat{H}(X) = - \psi(k) + \psi(N) + \log c_d + \frac{d}{N} \sum_{i=1}^{N} \log \epsilon(i), \]

where:

\(\psi\) is the digamma function, the derivative of the logarithm of the gamma function \(\Gamma(x)\),
\(k\) is the number of nearest neighbors,
\(\epsilon_i\) is twice the distance from \(x_i\) to its \(k^{th}\) nearest neighbor, representing the diameter of the hypersphere encompassing the \(k\) neighbors,
\(c_d\) is the volume of the unit ball in \(d\)-dimensional space, where \(\log c_d = 0\) for the maximum norm and \(c_d = \pi^{d/2} / (\Gamma(1 + d/2) \cdot 2^d)\) for Euclidean spaces.

For demonstration, we generate a dataset of normally distributed values with mean \(0\) and standard deviation \(1\). We then calculate the entropy using the box kernel with a bandwidth of \(0.5\). The analytical expected values can be calculated with

\[ H(X) = \frac{1}{2} \log(2\pi e \sigma^2), \]

where \(\sigma^2\) is the variance of the data.

import infomeasure as im
import numpy as np
rng = np.random.default_rng(692475)

std = 1.0
data = rng.normal(loc=0, scale=std, size=2000)

h = im.entropy(data, approach="metric", k=4)
h_expected = (1 / 2) * np.log(2 * np.pi * np.e * std ** 2)
h, h_expected

(np.float64(1.4233444309368766), np.float64(1.4189385332046727))

To access the local values, an estimator instance is needed.

est = im.estimator(data, measure="h", approach="metric", k=4)
est.result(), est.local_vals()

(np.float64(1.4233444320506785),
 array([0.56276484, 1.6673356 , 0.18937589, ..., 1.80094728, 2.09305939,
        1.81389835], shape=(2000,)))

Higher-dimensional data also is easily processed.

im.entropy(rng.normal(loc=0, scale=1, size=(2000, 3)), approach="metric", k=4)

np.float64(4.237377089947204)

The estimator is implemented in the KozachenkoLeonenkoEntropyEstimator class, which is part of the im.measures.entropy module.

class infomeasure.estimators.entropy.kozachenko_leonenko.KozachenkoLeonenkoEntropyEstimator(*data, k: int = 4, noise_level=1e-10, minkowski_p=inf, base: int | float | str = 'e')[source]

Bases: RandomGeneratorMixin, EntropyEstimator

Kozachenko-Leonenko estimator for Shannon entropies.

Attributes:

*dataarray_like: The data used to estimate the entropy.
kint: The number of nearest neighbors to consider.
noise_levelfloat: The standard deviation of the Gaussian noise to add to the data to avoid issues with zero distances.
minkowski_pfloat, \(1 \leq p \leq \infty\): The power parameter for the Minkowski metric. Default is np.inf for maximum norm. Use 2 for Euclidean distance.

Raises:

ValueError: If the number of nearest neighbors is not a positive integer
ValueError: If the noise level is negative
ValueError: If the Minkowski power parameter is invalid

Notes

Changing the number of nearest neighbors k can change the outcome, but the default value of \(k=4\) is recommended by [KStogbauerG11].