Kernel Entropy Estimation

Kernel Entropy Estimation#

The Shannon differential entropy formula for a continuous random variable \( X \) with density \( p(x)\) is given as [Sha48]:

\[ H(X) = -\int_{X} p(x) \log p(x) \, dx, \]

where \(p(x)\) is the probability density function (pdf).

See also

For a visual and interactive explanation of kernel density estimation (KDE), see the chapter Kernel Density Estimation of the Notes for Nonparametric Statistics of the MSc course Statistics for Data Science at Carlos III University of Madrid [GP25].

Kernel entropy estimation relies on probability density function (pdf) estimates obtained via kernel density estimation (KDE) to approximate the required probability in the above formula. Density estimation involves constructing an estimate of the pdf from the available dataset. KDE estimates density at a reference point by weighting all samples based on their distance from it, using a kernel function \((K)\) [Sil86]. Nearby points contribute more to the estimate, while distant points contribute less. The KDE estimate at a point \(x_n\) is given by:

\[ \hat{p}_r(x_n) = \frac{1}{N r^d} \sum_{n'=1}^{N} K \left( \frac{x_n - x_{n'}}{r} \right), \]

where

  • \(N\) is the number of data points,

  • \(r\) is the bandwidth or kernel radius,

  • \(d\) is the dimension of the data,

  • \(x_n\) and \(x_{n'}\) are the data points,

  • \(\hat{p}_r(x_n)\) is the estimated probability density for each data point. For multivariate kernel functions, the pdf is estimated by dividing by a factor of \(r^d\), where \(d\) is the number of dimensions. The estimated pdf is then used to compute the Shannon entropy.

This package supports two types of kernel functions:

  1. Box Kernel (Step Kernel):

    \[\begin{split} K = \begin{cases} 0 & \text{if } |u| \geq 1 \\ 1 & \text{otherwise} \end{cases} \end{split}\]

    where \(\hat{p}_r(x_n)\) is computed as the fraction of \(N\) points within a distance \(r\) from \(x_n\). In higher dimensions, the distance is calculated with the \(L_\infty\) norm. From the rectangular shape, the kernel gets its name.

  2. Gaussian Kernel:

    \[ K(r) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}r^2}, \]

    providing a smooth decline in weight with increasing distance from \(x_n\).

Tip

Kernel estimation is model-free but depends on the Kernel-width parameter \((r)\). A small \((r)\) can lead to under-sampling, while a large \((r)\) may over-smooth the data, obscuring details.

For demonstration, we generate a dataset of normally distributed values with mean \(0\) and standard deviation \(1\). We then calculate the entropy using the box kernel with a bandwidth of \(0.5\). The analytical expected values can be calculated with

\[ H(X) = \frac{1}{2} \log(2\pi e \sigma^2), \]

where \(\sigma^2\) is the variance of the data.

import infomeasure as im
import numpy as np
rng = np.random.default_rng(692475)

std = 1.0
data = rng.normal(loc=0, scale=std, size=2000)

h = im.entropy(data, approach="kernel", kernel="box", bandwidth=0.5)
h_expected = (1 / 2) * np.log(2 * np.pi * np.e * std ** 2)
h, h_expected
(np.float64(1.407495141636088), np.float64(1.4189385332046727))

Comparing the gaussian kernel:

im.entropy(data, approach="kernel", kernel="gaussian", bandwidth=0.5), h_expected
(np.float64(1.4230543908497726), np.float64(1.4189385332046727))

To access the local values, an estimator instance is needed.

est = im.estimator(data, measure="h", approach="kernel", kernel="box", bandwidth=0.5)
est.result(), est.local_vals()
(np.float64(1.407495141636088),
 array([1.2765435 , 0.92634107, 0.95972029, ..., 1.43969514, 1.88387476,
        1.2039728 ], shape=(2000,)))

For a 2D point cloud, it is as easy to calculate the entropy

im.entropy(
    rng.normal(loc=0, scale=1, size=(2000, 2)),
    approach="kernel", kernel="box", bandwidth=0.5
)
np.float64(2.805821757920304)

or the local values, as shown before.

The estimator is implemented in the KernelEntropyEstimator class, which is part of the im.measures.entropy module.

class infomeasure.estimators.entropy.kernel.KernelEntropyEstimator(*data, bandwidth: float | int, kernel: str, workers: int = 1, base: int | float | str = 'e')[source]

Bases: WorkersMixin, EntropyEstimator

Kernel entropy estimator for continuous data using Kernel Density Estimation (KDE).

The kernel entropy estimator computes the differential Shannon entropy by estimating the probability density function using kernel density estimation:

\[\hat{H}(X) = -\int \hat{f}(x) \log \hat{f}(x) \, dx \approx -\frac{1}{N} \sum_{i=1}^{N} \log \hat{f}(x_i)\]

where \(\hat{f}(x)\) is the kernel density estimate:

\[\hat{f}(x) = \frac{1}{N h^d} \sum_{i=1}^{N} K\left(\frac{x - x_i}{h}\right)\]

with \(K(\cdot)\) being the kernel function, \(h\) the bandwidth parameter, \(d\) the dimensionality, and \(N\) the number of data points.

For joint entropy of multiple variables, the estimator concatenates the variables into a single multivariate space and applies the same KDE approach.

The estimator supports both Gaussian and box (uniform) kernels. The choice of bandwidth is critical: small values can lead to under-smoothing and overfitting, while large values may over-smooth the data and obscure important features [GP25, Sil86].

Parameters:
*dataarray_like

The continuous data used to estimate the entropy. For univariate entropy, pass a single array. For joint entropy, pass multiple arrays.

bandwidthfloat | int

The bandwidth parameter for the kernel. Controls the smoothness of the density estimate.

kernelstr

Type of kernel to use. Supported options are:

  • 'gaussian': Gaussian (normal) kernel

  • 'box': Box (uniform) kernel

Compatible with the KDE implementation kde_probability_density_function().

workersint, optional

Number of workers to use for parallel processing. Default is 1 (no parallelization). If set to -1, all available CPU cores will be used.

basefloat | str, optional

Logarithm base for entropy calculation. Default is from global configuration.

Attributes:
*dataarray_like

The data used to estimate the entropy.

bandwidthfloat | int

The bandwidth for the kernel.

kernelstr

Type of kernel to use.

workersint

Number of workers to use for parallel processing.

Returns:
array_like

Local entropy values for each data point when calling entropy calculation methods. The mean of these values gives the overall entropy estimate.

Notes

Bandwidth Selection: The bandwidth parameter critically affects the quality of the entropy estimate. A small bandwidth can lead to under-sampling and high variance, while a large bandwidth may over-smooth the data, obscuring important details and introducing bias.

Kernel Choice:

  • Gaussian kernels provide smooth density estimates and are theoretically well-founded

  • Box kernels are computationally efficient and provide non-parametric estimates

Computational Complexity: The algorithm has O(N²) complexity for box kernels using KDTree queries, and varies for Gaussian kernels depending on the implementation.

Cross-entropy: Supported between two distributions by evaluating the density of the second distribution at points from the first distribution.

Examples

>>> import infomeasure as im
>>> from numpy.random import default_rng
>>> rng = default_rng(281769)
>>> # Generate sample data
>>> data = rng.normal(0, 1, 1000)
>>>
>>> # Create estimator
>>> estimator = im.estimator(data, measure="h", approach="kernel", bandwidth=0.5, kernel='gaussian')
>>>
>>> # Calculate entropy
>>> estimator.result()
np.float64(1.366015332652949)
>>> # Local values
>>> estimator.local_vals()
array([1.54017083, 1.35855839, 0.97949819, 0.97333173, 2.62084886,
   ...
   1.08174049, 0.97418054, 1.88055967, 0.99614516, 0.98548583])