Kozachenko-Leonenko (KL) / Metric / kNN Entropy Estimation#
The Shannon differential entropy formula for a continuous random variable \( X \) with density \( p(x)\) is given as [Sha48]:
where \(p(x)\) is the probability density function (pdf).
The Kozachenko-Leonenko (KL) entropy estimator leverages a nearest-neighbor approach to estimate the Shannon entropy of a continuous random variable from a finite sample. The estimator approximates entropy as the expectation of the logarithm of the density. Given a sample \( \{x_1, x_2, \dots, x_N\} \), the density at each point \( x_i \) is estimated using the distance \( \epsilon(i) \) to its \( k \)-th nearest neighbor. Assuming local uniformity, the estimated density suffices \( \widehat{p}(x_i) \approx c_d \epsilon(i)^d \), where \( c_d \) is the volume of a unit \( d \)-dimensional ball. By leveraging order statistics, the expectation \( E(\log p_i) = \psi(k) - \psi(N) \) is obtained, where \( \psi(x) \) is the digamma function. Substituting this into the entropy definition leads to the final KL estimator [HSPVB07, KL87, KSG11]:
where:
\(\psi\) is the digamma function, the derivative of the logarithm of the gamma function \(\Gamma(x)\),
\(k\) is the number of nearest neighbors,
\(\epsilon_i\) is twice the distance from \(x_i\) to its \(k^{th}\) nearest neighbor, representing the diameter of the hypersphere encompassing the \(k\) neighbors,
\(c_d\) is the volume of the unit ball in \(d\)-dimensional space, where \(\log c_d = 0\) for the maximum norm and \(c_d = \pi^{d/2} / (\Gamma(1 + d/2) \cdot 2^d)\) for Euclidean spaces.
The package supports two variants of this estimator via the ksg_id parameter:
Type I (
ksg_id=1, default): The standard Kozachenko-Leonenko formula as shown above.Type II (
ksg_id=2): Uses a modified formula where \(\psi(k)\) is replaced by \(\psi(k) - 1/k\). This variant is sometimes used for consistency with KSG Type II mutual information estimators.
For demonstration, we generate a dataset of normally distributed values with mean \(0\) and standard deviation \(1\). We then calculate the entropy using the box kernel with a bandwidth of \(0.5\). The analytical expected values can be calculated with
where \(\sigma^2\) is the variance of the data.
import infomeasure as im
import numpy as np
rng = np.random.default_rng(692475)
std = 1.0
data = rng.normal(loc=0, scale=std, size=2000)
h = im.entropy(data, approach="metric", k=4)
h_expected = (1 / 2) * np.log(2 * np.pi * np.e * std ** 2)
h, h_expected
(np.float64(1.4233444293952129), np.float64(1.4189385332046727))
To access the local values, an estimator instance is needed.
est = im.estimator(data, measure="h", approach="metric", k=4)
est.result(), est.local_vals()
(np.float64(1.4233444311384387),
array([0.56276499, 1.66733558, 0.18937569, ..., 1.80094725, 2.09305941,
1.8138984 ], shape=(2000,)))
Higher-dimensional data also is easily processed.
im.entropy(rng.normal(loc=0, scale=1, size=(2000, 3)), approach="metric", k=4)
np.float64(4.237377089839342)
The estimator is implemented in the KozachenkoLeonenkoEntropyEstimator class,
which is part of the im.measures.entropy module.
- class infomeasure.estimators.entropy.kozachenko_leonenko.KozachenkoLeonenkoEntropyEstimator(*data, k: int = 4, ksg_id: int = 1, noise_level=1e-10, minkowski_p=inf, base: int | float | str = 'e')[source]
Bases:
RandomGeneratorMixin,EntropyEstimatorKozachenko-Leonenko entropy estimator for continuous data.
The Kozachenko-Leonenko estimator computes the Shannon entropy of continuous data using nearest neighbor distances. The estimator is based on the method from [KL87] and follows the implementation approach described in [KSG11].
\[\hat{H}_{KL} = -\psi(k) + \psi(N) + \log(c_d) + \frac{d}{N} \sum_{i=1}^{N} \log(2\rho_{k,i})\]where \(\psi\) is the digamma function, \(k\) is the number of nearest neighbors, \(N\) is the number of data points, \(d\) is the dimensionality, \(c_d\) is the volume of the \(d\)-dimensional unit ball for the chosen Minkowski norm, and \(\rho_{k,i}\) is the distance to the \(k\)-th nearest neighbor of point \(i\).
This estimator is particularly suitable for continuous multivariate data and provides asymptotically unbiased estimates of differential entropy. The method works by exploiting the relationship between nearest neighbor distances and local density, making it effective for high-dimensional data where traditional histogram-based methods fail.
- Parameters:
- *dataarray_like
The continuous data used to estimate the entropy. For multivariate data, each variable should be a column.
- k
int, default=4 The number of nearest neighbors to consider. Higher values provide more stable estimates but may introduce bias. The default value of 4 is recommended by [KSG11].
- noise_level
float, default=1e-10 The standard deviation of Gaussian noise added to the data to avoid issues with zero distances between identical points. Set to 0 to disable noise addition.
- minkowski_p
float, default=inf The power parameter for the Minkowski metric used in distance calculations. Common values are 2 (Euclidean distance) and inf (maximum norm/Chebyshev distance). Must satisfy \(1 \leq p \leq \infty\).
- ksg_id
int, default=1 The KSG estimator variant to use (1 or 2). Type I uses the standard formula. Type II uses a modified formula with \(\psi(k) - 1/k\).
- base
LogBaseType, default=Config.get(“base”) The logarithm base for entropy calculation. Can be 2, 10, “e”, or any positive number.
- Attributes:
- *data
tuple[array_like] The processed data used to estimate the entropy, converted to 2D arrays.
- k
int The number of nearest neighbors to consider.
- noise_level
float The standard deviation of the Gaussian noise added to the data.
- minkowski_p
float The power parameter for the Minkowski metric.
- ksg_id
int The KSG estimator variant to use.
- *data
- Raises:
ValueErrorIf the number of nearest neighbors is not a positive integer.
ValueErrorIf the noise level is negative.
ValueErrorIf the Minkowski power parameter is invalid (not in range [1, ∞]).
Notes
The choice of the number of nearest neighbors \(k\) affects the bias-variance tradeoff of the estimator. Smaller values of \(k\) reduce bias but increase variance, while larger values have the opposite effect. The default value of \(k=4\) provides a good balance for most applications.
The noise addition helps handle datasets with repeated values or points that are exactly identical, which would otherwise result in zero distances and numerical issues. The noise level should be small enough not to significantly alter the underlying distribution.
For high-dimensional data, the curse of dimensionality may affect the estimator’s performance, as nearest neighbor distances become less informative. In such cases, dimensionality reduction or alternative entropy estimation methods may be preferable.
Examples
>>> import numpy as np >>> import infomeasure as im >>> >>> # Generate 2D Gaussian data >>> np.random.seed(176250) >>> data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], 1000) >>> >>> # Estimate entropy >>> estimator = im.estimator(data, measure="h", approach="kl", k=4) >>> entropy_value = estimator.result() >>> print(f"Estimated entropy: {entropy_value:.3f}") Estimated entropy: 2.678 >>> print(f"Local values: {estimator.local_vals()}") Local values: [ 3.15330798 2.02688591 2.52250064 2.95236651 3.58801879 1.42033673 ... 2.91254223 1.92823136 3.63647704 2.05589055] >>> # Use different distance metric >>> estimator_euclidean = KozachenkoLeonenkoEntropyEstimator(data, k=4, minkowski_p=2) >>> entropy_euclidean = estimator_euclidean.entropy() np.float64(2.6772465397252208)