KernelEntropyEstimator#
- class infomeasure.estimators.entropy.KernelEntropyEstimator(*data, bandwidth: float | int, kernel: str, workers: int = 1, base: int | float | str = 'e')[source]
Bases:
WorkersMixin,EntropyEstimatorKernel entropy estimator for continuous data using Kernel Density Estimation (KDE).
The kernel entropy estimator computes the differential Shannon entropy by estimating the probability density function using kernel density estimation:
\[\hat{H}(X) = -\int \hat{f}(x) \log \hat{f}(x) \, dx \approx -\frac{1}{N} \sum_{i=1}^{N} \log \hat{f}(x_i)\]where \(\hat{f}(x)\) is the kernel density estimate:
\[\hat{f}(x) = \frac{1}{N h^d} \sum_{i=1}^{N} K\left(\frac{x - x_i}{h}\right)\]with \(K(\cdot)\) being the kernel function, \(h\) the bandwidth parameter, \(d\) the dimensionality, and \(N\) the number of data points.
For joint entropy of multiple variables, the estimator concatenates the variables into a single multivariate space and applies the same KDE approach.
The estimator supports both Gaussian and box (uniform) kernels. The choice of bandwidth is critical: small values can lead to under-smoothing and overfitting, while large values may over-smooth the data and obscure important features [GP25, Sil86].
- Parameters:
- *dataarray_like
The continuous data used to estimate the entropy. For univariate entropy, pass a single array. For joint entropy, pass multiple arrays.
- bandwidth
float|int The bandwidth parameter for the kernel. Controls the smoothness of the density estimate.
- kernel
str Type of kernel to use. Supported options are:
'gaussian': Gaussian (normal) kernel'box': Box (uniform) kernel
Compatible with the KDE implementation
kde_probability_density_function().- workers
int,optional Number of workers to use for parallel processing. Default is 1 (no parallelization). If set to -1, all available CPU cores will be used.
- base
float|str,optional Logarithm base for entropy calculation. Default is from global configuration.
- Attributes:
- *dataarray_like
The data used to estimate the entropy.
- bandwidth
float|int The bandwidth for the kernel.
- kernel
str Type of kernel to use.
- workers
int Number of workers to use for parallel processing.
- Returns:
- array_like
Local entropy values for each data point when calling entropy calculation methods. The mean of these values gives the overall entropy estimate.
See also
infomeasure.estimators.utils.kde.kde_probability_density_functionUnderlying KDE implementation
infomeasure.estimators.entropy.discrete.DiscreteEntropyEstimatorFor discrete data entropy estimation
Notes
Bandwidth Selection: The bandwidth parameter critically affects the quality of the entropy estimate. A small bandwidth can lead to under-sampling and high variance, while a large bandwidth may over-smooth the data, obscuring important details and introducing bias.
Kernel Choice:
Gaussian kernels provide smooth density estimates and are theoretically well-founded
Box kernels are computationally efficient and provide non-parametric estimates
Computational Complexity: The algorithm has O(N²) complexity for box kernels using KDTree queries, and varies for Gaussian kernels depending on the implementation.
Cross-entropy: Supported between two distributions by evaluating the density of the second distribution at points from the first distribution.
Examples
>>> import infomeasure as im >>> from numpy.random import default_rng >>> rng = default_rng(281769) >>> # Generate sample data >>> data = rng.normal(0, 1, 1000) >>> >>> # Create estimator >>> estimator = im.estimator(data, measure="h", approach="kernel", bandwidth=0.5, kernel='gaussian') >>> >>> # Calculate entropy >>> estimator.result() np.float64(1.366015332652949) >>> # Local values >>> estimator.local_vals() array([1.54017083, 1.35855839, 0.97949819, 0.97333173, 2.62084886, ... 1.08174049, 0.97418054, 1.88055967, 0.99614516, 0.98548583])