KozachenkoLeonenkoEntropyEstimator#
- class infomeasure.estimators.entropy.KozachenkoLeonenkoEntropyEstimator(*data, k: int = 4, ksg_id: int = 1, noise_level=1e-10, minkowski_p=inf, base: int | float | str = 'e')[source]
Bases:
RandomGeneratorMixin,EntropyEstimatorKozachenko-Leonenko entropy estimator for continuous data.
The Kozachenko-Leonenko estimator computes the Shannon entropy of continuous data using nearest neighbor distances. The estimator is based on the method from [KL87] and follows the implementation approach described in [KSG11].
\[\hat{H}_{KL} = -\psi(k) + \psi(N) + \log(c_d) + \frac{d}{N} \sum_{i=1}^{N} \log(2\rho_{k,i})\]where \(\psi\) is the digamma function, \(k\) is the number of nearest neighbors, \(N\) is the number of data points, \(d\) is the dimensionality, \(c_d\) is the volume of the \(d\)-dimensional unit ball for the chosen Minkowski norm, and \(\rho_{k,i}\) is the distance to the \(k\)-th nearest neighbor of point \(i\).
This estimator is particularly suitable for continuous multivariate data and provides asymptotically unbiased estimates of differential entropy. The method works by exploiting the relationship between nearest neighbor distances and local density, making it effective for high-dimensional data where traditional histogram-based methods fail.
- Parameters:
- *dataarray_like
The continuous data used to estimate the entropy. For multivariate data, each variable should be a column.
- k
int, default=4 The number of nearest neighbors to consider. Higher values provide more stable estimates but may introduce bias. The default value of 4 is recommended by [KSG11].
- noise_level
float, default=1e-10 The standard deviation of Gaussian noise added to the data to avoid issues with zero distances between identical points. Set to 0 to disable noise addition.
- minkowski_p
float, default=inf The power parameter for the Minkowski metric used in distance calculations. Common values are 2 (Euclidean distance) and inf (maximum norm/Chebyshev distance). Must satisfy \(1 \leq p \leq \infty\).
- ksg_id
int, default=1 The KSG estimator variant to use (1 or 2). Type I uses the standard formula. Type II uses a modified formula with \(\psi(k) - 1/k\).
- base
LogBaseType, default=Config.get(“base”) The logarithm base for entropy calculation. Can be 2, 10, “e”, or any positive number.
- Attributes:
- *data
tuple[array_like] The processed data used to estimate the entropy, converted to 2D arrays.
- k
int The number of nearest neighbors to consider.
- noise_level
float The standard deviation of the Gaussian noise added to the data.
- minkowski_p
float The power parameter for the Minkowski metric.
- ksg_id
int The KSG estimator variant to use.
- *data
- Raises:
ValueErrorIf the number of nearest neighbors is not a positive integer.
ValueErrorIf the noise level is negative.
ValueErrorIf the Minkowski power parameter is invalid (not in range [1, ∞]).
Notes
The choice of the number of nearest neighbors \(k\) affects the bias-variance tradeoff of the estimator. Smaller values of \(k\) reduce bias but increase variance, while larger values have the opposite effect. The default value of \(k=4\) provides a good balance for most applications.
The noise addition helps handle datasets with repeated values or points that are exactly identical, which would otherwise result in zero distances and numerical issues. The noise level should be small enough not to significantly alter the underlying distribution.
For high-dimensional data, the curse of dimensionality may affect the estimator’s performance, as nearest neighbor distances become less informative. In such cases, dimensionality reduction or alternative entropy estimation methods may be preferable.
Examples
>>> import numpy as np >>> import infomeasure as im >>> >>> # Generate 2D Gaussian data >>> np.random.seed(176250) >>> data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], 1000) >>> >>> # Estimate entropy >>> estimator = im.estimator(data, measure="h", approach="kl", k=4) >>> entropy_value = estimator.result() >>> print(f"Estimated entropy: {entropy_value:.3f}") Estimated entropy: 2.678 >>> print(f"Local values: {estimator.local_vals()}") Local values: [ 3.15330798 2.02688591 2.52250064 2.95236651 3.58801879 1.42033673 ... 2.91254223 1.92823136 3.63647704 2.05589055] >>> # Use different distance metric >>> estimator_euclidean = KozachenkoLeonenkoEntropyEstimator(data, k=4, minkowski_p=2) >>> entropy_euclidean = estimator_euclidean.entropy() np.float64(2.6772465397252208)