ChaoWangJostEntropyEstimator

ChaoWangJostEntropyEstimator#

class infomeasure.estimators.entropy.ChaoWangJostEntropyEstimator(*args, **kwargs)[source]

Bases: DiscreteHEstimator

Advanced bias-corrected Shannon entropy estimator using coverage estimation.

The Chao-Wang-Jost estimator provides improved entropy estimates for incomplete sampling scenarios by accounting for unobserved species through sophisticated statistical corrections. This estimator is particularly valuable when dealing with ecological data, text analysis, or any discrete distribution where the sample may not capture all possible outcomes.

The Chao-Wang-Jost estimator addresses the systematic underestimation of entropy in finite samples by applying sophisticated statistical corrections. Through coverage estimation using singleton and doubleton counts, it provides reliable entropy estimates even with small or incomplete samples. Based on species accumulation theory and Good-Turing estimation principles, this approach is particularly valuable when the sample doesn’t capture all possible outcomes, such as in ecological diversity studies with incomplete species sampling or text analysis where vocabulary may be incompletely observed. The estimator is especially useful when standard entropy estimators show systematic bias due to sample size limitations.

Standard entropy estimators often underestimate diversity in finite samples, especially when the sampling is incomplete. This estimator overcomes this limitation by leveraging information from rare species (singletons and doubletons) to estimate sample coverage and correct for unobserved species. The theoretical foundation in species accumulation curves and Good-Turing frequency estimation provide a robust statistical framework for addressing sampling bias issues.

Mathematical Foundation:

The estimator combines observed entropy with a correction term based on coverage estimation:

\[\hat{H}_{\text{CWJ}} = \sum_{1 \leq n_i \leq N-1} \frac{n_i}{N} \left(\sum_{k=n_i}^{N-1} \frac{1}{k} \right) + \frac{f_1}{N} (1 - A)^{-N + 1} \left\{ - \log(A) - \sum_{r=1}^{N-1} \frac{1}{r} (1 - A)^r \right\}\]

where the coverage parameter \(A\) is estimated as:

\[\begin{split}A = \begin{cases} \frac{2 f_2}{(N-1) f_1 + 2 f_2} \, & \text{if} \, f_2 > 0 \\ \frac{2}{(N-1)(f_1 - 1) + 2} \, & \text{if} \, f_2 = 0, \; f_1 \neq 0 \\ 1, & \text{if} \, f_1 = f_2 = 0 \end{cases}\end{split}\]

Here, \(f_1\) represents the number of singletons (species observed exactly once) and \(f_2\) the number of doubletons (species observed exactly twice) in the sample [CWJ13].

Attributes:

*dataarray_like: The data used to estimate the entropy.

See also

infomeasure.estimators.functional.entropy: Functional interface for entropy calculation
infomeasure.estimators.entropy.discrete.DiscreteEntropyEstimator: Standard maximum likelihood entropy estimator

Notes

The algorithm is adapted from the entropart R library [MH15]
The correction becomes negligible when samples are complete (\(f_1 = f_2 = 0\))

Examples

>>> import infomeasure as im
>>>
>>> # Basic usage with incomplete sampling scenario
>>> data = [1, 1, 2, 3, 4, 5]  # Many singletons suggest incomplete sampling
>>> h_cwj = im.entropy(data, approach="chao_wang_jost", base=2)
>>> h_standard = im.entropy(data, approach="discrete", base=2)
>>> print(f"Chao-Wang-Jost: {h_cwj:.3f} bits")
Chao-Wang-Jost: 3.635 bits
>>> print(f"Standard: {h_standard:.3f} bits")
Standard: 2.252 bits
>>>
>>> # Ecological diversity example
>>> species_counts = [1, 1, 1, 2, 2, 3, 5, 8]  # Species abundance data
>>> diversity = im.entropy(species_counts, approach="cwj", base="e")
>>> print(f"Species diversity: {diversity:.3f} nats")
Species diversity: 2.054 nats