Kozachenko-Leonenko (KL) / Metric / kNN Entropy Estimation#
The Shannon differential entropy formula for a continuous random variable \( X \) with density \( p(x)\) is given as [Sha48]:
where \(p(x)\) is the probability density function (pdf).
The Kozachenko-Leonenko (KL) entropy estimator leverages a nearest-neighbor approach to estimate the Shannon entropy of a continuous random variable from a finite sample. The estimator approximates entropy as the expectation of the logarithm of the density. Given a sample \( \{x_1, x_2, \dots, x_N\} \), the density at each point \( x_i \) is estimated using the distance \( \epsilon(i) \) to its \( k \)-th nearest neighbor. Assuming local uniformity, the estimated density suffices \( \widehat{p}(x_i) \approx c_d \epsilon(i)^d \), where \( c_d \) is the volume of a unit \( d \)-dimensional ball. By leveraging order statistics, the expectation \( E(\log p_i) = \psi(k) - \psi(N) \) is obtained, where \( \psi(x) \) is the digamma function. Substituting this into the entropy definition leads to the final KL estimator [HSPVB07, KL87, KStogbauerG11]:
where:
\(\psi\) is the digamma function, the derivative of the logarithm of the gamma function \(\Gamma(x)\),
\(k\) is the number of nearest neighbors,
\(\epsilon_i\) is twice the distance from \(x_i\) to its \(k^{th}\) nearest neighbor, representing the diameter of the hypersphere encompassing the \(k\) neighbors,
\(c_d\) is the volume of the unit ball in \(d\)-dimensional space, where \(\log c_d = 0\) for the maximum norm and \(c_d = \pi^{d/2} / (\Gamma(1 + d/2) \cdot 2^d)\) for Euclidean spaces.
For demonstration, we generate a dataset of normally distributed values with mean \(0\) and standard deviation \(1\). We then calculate the entropy using the box kernel with a bandwidth of \(0.5\). The analytical expected values can be calculated with
where \(\sigma^2\) is the variance of the data.
import infomeasure as im
import numpy as np
rng = np.random.default_rng(692475)
std = 1.0
data = rng.normal(loc=0, scale=std, size=2000)
h = im.entropy(data, approach="metric", k=4)
h_expected = (1 / 2) * np.log(2 * np.pi * np.e * std ** 2)
h, h_expected
(np.float64(1.4233444309368766), np.float64(1.4189385332046727))
To access the local values, an estimator instance is needed.
est = im.estimator(data, measure="h", approach="metric", k=4)
est.result(), est.local_vals()
(np.float64(1.4233444320506785),
array([0.56276484, 1.6673356 , 0.18937589, ..., 1.80094728, 2.09305939,
1.81389835], shape=(2000,)))
Higher-dimensional data also is easily processed.
im.entropy(rng.normal(loc=0, scale=1, size=(2000, 3)), approach="metric", k=4)
np.float64(4.237377089947204)
The estimator is implemented in the KozachenkoLeonenkoEntropyEstimator class,
which is part of the im.measures.entropy module.
- class infomeasure.estimators.entropy.kozachenko_leonenko.KozachenkoLeonenkoEntropyEstimator(*data, k: int = 4, noise_level=1e-10, minkowski_p=inf, base: int | float | str = 'e')[source]
Bases:
RandomGeneratorMixin,EntropyEstimatorKozachenko-Leonenko estimator for Shannon entropies.
- Attributes:
- *dataarray_like
The data used to estimate the entropy.
- k
int The number of nearest neighbors to consider.
- noise_level
float The standard deviation of the Gaussian noise to add to the data to avoid issues with zero distances.
- minkowski_p
float, \(1 \leq p \leq \infty\) The power parameter for the Minkowski metric. Default is np.inf for maximum norm. Use 2 for Euclidean distance.
- Raises:
ValueErrorIf the number of nearest neighbors is not a positive integer
ValueErrorIf the noise level is negative
ValueErrorIf the Minkowski power parameter is invalid
Notes
Changing the number of nearest neighbors
kcan change the outcome, but the default value of \(k=4\) is recommended by [KStogbauerG11].