KernelEntropyEstimator

KernelEntropyEstimator#

class infomeasure.estimators.entropy.KernelEntropyEstimator(*data, bandwidth: float | int, kernel: str, workers: int = 1, base: int | float | str = 'e')[source]

Bases: WorkersMixin, EntropyEstimator

Kernel entropy estimator for continuous data using Kernel Density Estimation (KDE).

The kernel entropy estimator computes the differential Shannon entropy by estimating the probability density function using kernel density estimation:

\[\hat{H}(X) = -\int \hat{f}(x) \log \hat{f}(x) \, dx \approx -\frac{1}{N} \sum_{i=1}^{N} \log \hat{f}(x_i)\]

where \(\hat{f}(x)\) is the kernel density estimate:

\[\hat{f}(x) = \frac{1}{N h^d} \sum_{i=1}^{N} K\left(\frac{x - x_i}{h}\right)\]

with \(K(\cdot)\) being the kernel function, \(h\) the bandwidth parameter, \(d\) the dimensionality, and \(N\) the number of data points.

For joint entropy of multiple variables, the estimator concatenates the variables into a single multivariate space and applies the same KDE approach.

The estimator supports both Gaussian and box (uniform) kernels. The choice of bandwidth is critical: small values can lead to under-smoothing and overfitting, while large values may over-smooth the data and obscure important features [GP25, Sil86].

Parameters:

*dataarray_like

The continuous data used to estimate the entropy. For univariate entropy, pass a single array. For joint entropy, pass multiple arrays.

bandwidthfloat | int

The bandwidth parameter for the kernel. Controls the smoothness of the density estimate.

kernelstr

Type of kernel to use. Supported options are:

'gaussian': Gaussian (normal) kernel
'box': Box (uniform) kernel

Compatible with the KDE implementation kde_probability_density_function().

workersint, optional

Number of workers to use for parallel processing. Default is 1 (no parallelization). If set to -1, all available CPU cores will be used.

basefloat | str, optional

Logarithm base for entropy calculation. Default is from global configuration.

Attributes:

*dataarray_like: The data used to estimate the entropy.
bandwidthfloat | int: The bandwidth for the kernel.
kernelstr: Type of kernel to use.
workersint: Number of workers to use for parallel processing.

Returns:

array_like: Local entropy values for each data point when calling entropy calculation methods. The mean of these values gives the overall entropy estimate.

See also

infomeasure.estimators.utils.kde.kde_probability_density_function: Underlying KDE implementation
infomeasure.estimators.entropy.discrete.DiscreteEntropyEstimator: For discrete data entropy estimation

Notes

Bandwidth Selection: The bandwidth parameter critically affects the quality of the entropy estimate. A small bandwidth can lead to under-sampling and high variance, while a large bandwidth may over-smooth the data, obscuring important details and introducing bias.

Kernel Choice:

Gaussian kernels provide smooth density estimates and are theoretically well-founded
Box kernels are computationally efficient and provide non-parametric estimates

Computational Complexity: The algorithm has O(N²) complexity for box kernels using KDTree queries, and varies for Gaussian kernels depending on the implementation.

Cross-entropy: Supported between two distributions by evaluating the density of the second distribution at points from the first distribution.

Examples

>>> import infomeasure as im
>>> from numpy.random import default_rng
>>> rng = default_rng(281769)
>>> # Generate sample data
>>> data = rng.normal(0, 1, 1000)
>>>
>>> # Create estimator
>>> estimator = im.estimator(data, measure="h", approach="kernel", bandwidth=0.5, kernel='gaussian')
>>>
>>> # Calculate entropy
>>> estimator.result()
np.float64(1.366015332652949)
>>> # Local values
>>> estimator.local_vals()
array([1.54017083, 1.35855839, 0.97949819, 0.97333173, 2.62084886,
   ...
   1.08174049, 0.97418054, 1.88055967, 0.99614516, 0.98548583])