infomeasure.estimators.utils package#

Submodules#

infomeasure.estimators.utils.array module#

Array helpers for the base estimator.

infomeasure.estimators.utils.array.assure_2d_data(data) ndarray | tuple[ndarray, ...][source]#

Assure the data is 2D.

This function checks adds a new axis to the data if it is 1D. For tuples and generators, each element is checked and reshaped if necessary.

Parameters:
dataarray_like or tuple/Generator of array_like

The data to check.

Returns:
array_like or tuple of array_like

The reshaped data.

Raises:
ValueError

If the data is not supported for 2D conversion.

infomeasure.estimators.utils.discrete_interaction_information module#

Functions for interaction information, a multivariate generalization of mutual information.

infomeasure.estimators.utils.discrete_interaction_information.conditional_mutual_information_global(*data: tuple, cond: ndarray, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) float[source]#

Estimate the global conditional mutual information between multiple random variables and a conditioning variable.

Parameters:
*dataarray_like, shape (n_samples,)

The data used to estimate the global mutual information. You can pass an arbitrary number of data arrays as positional arguments.

condarray_like, shape (n_samples,)

The conditioning variable.

log_funccallable(), optional

The logarithm function to use. Default is the natural logarithm.

miller_madow_correctionstr | float | int, optional

If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value. log_func and miller_madow_correction should be the same base.

Returns:
float

The global conditional mutual information between the random variables.

Raises:
ValueError

If the conditioning variable is not one-dimensional.

Notes

If wanting a condition of joint random variables, one must join them beforehand into one dimension. This is due to the complexity of the calculation, keeping it arbitrary enough. One can join discrete random variables through reduce_joint_space(), but when using CMI and CTE, this will happen automatically when passing a tuple of RVs as the cond.

infomeasure.estimators.utils.discrete_interaction_information.conditional_mutual_information_local(*data: tuple, cond: ndarray, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) ndarray[source]#

Estimate the local conditional mutual information between multiple random variables and a conditioning variable.

The mean of the local conditional mutual information is the global conditional mutual information. Only calculating the global value is more efficient, so evaluating the local conditional mutual information should only be done when explicitly needed.

Parameters:
*dataarray_like, shape (n_samples,)

The data used to estimate the local mutual information. You can pass an arbitrary number of data arrays as positional arguments.

condarray_like, shape (n_samples,)

The conditioning variable.

log_funccallable(), optional

The logarithm function to use. Default is the natural logarithm.

miller_madow_correctionstr | float | int, optional

If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value. log_func and miller_madow_correction should be the same base.

Returns:
ndarray

The local conditional mutual information between the random variables.

infomeasure.estimators.utils.discrete_interaction_information.millermadow_mi_corr(k_i, k_joint, n, base, k_cond=None)[source]#

Computes the Miller-Madow mutual information correction term.

This function calculates a correction term used to adjust the bias in mutual information estimates, which arise due to finite sample size issues. The correction is based on the marginal counts and joint count in the observed distributions.

Parameters:
k_ilist[int]

A list containing the marginal cardinalities of individual variables in the dataset. Each element represents the number of unique values for the respective variable.

k_jointint

The cardinality of the joint distribution. Represents the number of unique observations across all combined dimensions.

nint

The sample size, representing the total number of observations in the data.

basestr | float | int

The logarithmic base used for the mutual information computation. If set to “e”, natural logarithm is used. Otherwise, log of the specified base is used.

k_condint, optional

The cardinality of the conditional variable. When this is used, k_i can be used as k_iZ, to calculate the correction for conditional MI.

Returns:
float

The calculated Miller-Madow correction term to adjust the mutual information value.

infomeasure.estimators.utils.discrete_interaction_information.mutual_information_global(*data: tuple, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) float[source]#

Estimate the global mutual information between multiple random variables.

Parameters:
*dataarray_like, shape (n_samples,)

The data used to estimate the global mutual information. You can pass an arbitrary number of data arrays as positional arguments.

log_funccallable(), optional

The logarithm function to use. Default is the natural logarithm.

miller_madow_correctionstr | float | int, optional

If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value. log_func and miller_madow_correction should be the same base.

Returns:
float

The global mutual information between the random variables.

infomeasure.estimators.utils.discrete_interaction_information.mutual_information_local(*data: tuple, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) ndarray[source]#

Estimate the local mutual information between multiple random variables.

The mean of the local mutual information is the global mutual information. Only calculating the global value is more efficient, so evaluating the local mutual information should only be done when explicitly needed.

Parameters:
*dataarray_like, shape (n_samples,)

The data used to estimate the local mutual information. You can pass an arbitrary number of data arrays as positional arguments.

log_funccallable(), optional

The logarithm function to use. Default is the natural logarithm.

miller_madow_correctionstr | float | int, optional

If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value. log_func and miller_madow_correction should be the same base.

Returns:
ndarray

The local mutual information between the random variables.

infomeasure.estimators.utils.discrete_transfer_entropy module#

Functions for efficient computation of discrete transfer entropy.

infomeasure.estimators.utils.discrete_transfer_entropy.combined_te_form(slice_method, *data, local: bool = False, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None, **slice_kwargs) float | ndarray[source]#

Calculate the Transfer Entropy using the combined TE formula.

Parameters:
slice_methodfunction

The slicing method to use for the symbolized data.

*dataarray_like

The source, destination, and if applicable, conditional data.

localbool, optional

Whether to calculate the local transfer entropy. If False, the global transfer entropy is calculated. Default is False.

log_funccallable(), optional

The logarithm function to use. Default is the natural logarithm.

miller_madow_correctionstr | float | int, optional

If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value. log_func and miller_madow_correction should be the same base.

**slice_kwargsdict

The history lengths for the source, destination, and if applicable, conditional data.

Returns:
float

The Transfer Entropy value.

infomeasure.estimators.utils.exponential_family module#

Helper functions for exponential family distributions.

Rényi entropy and Tsallis entropy are special cases of the more general family of exponential family distributions. This module provides helper functions for these distributions.

infomeasure.estimators.utils.exponential_family.calculate_common_entropy_components(data, k, at=None)[source]#

Calculate common components for entropy estimators.

Parameters:
dataarray_like

The data used to estimate the entropy.

kint

The number of nearest neighbors used in the estimation. Not including the data point itself.

atarray_like, optional

The parameter at which to evaluate the entropy components.

Returns:
tuple

Volume of the unit ball, k-th nearest neighbor distances, number of data points, and dimensionality of the data.

Raises:
ValueError

If the parameter k is selected too large.

ValueError

If both data and at have different dimensions.

infomeasure.estimators.utils.exponential_family.exponential_family_i1(k, V_m, rho_k, N, m, log_base_func)[source]#

Calculate the \(I_1\) of the exponential family distribution.

When \(q = 1\), the exponential family distribution reduces to the Shannon entropy.

Parameters:
kint

The number of nearest neighbors used in the estimation.

V_mfloat

Volume of the unit ball in m-dimensional space.

rho_karray_like

The k-th nearest neighbor distances.

Nint

Number of data points considered for the distances (Subtract 1 if own point not considered).

mint

Dimensionality of the data.

log_base_funccallable()

The logarithm function to use for the calculation with the chosen base.

Returns:
float

The \(I_1\) of the exponential family distribution

infomeasure.estimators.utils.exponential_family.exponential_family_iq(k, q, V_m, rho_k, N, m)[source]#

Calculate the \(I_q\) of the exponential family distribution.

Parameters:
kint

The number of nearest neighbors used in the estimation.

qfloat | int

The Rényi or Tsallis parameter, order or exponent. Sometimes denoted as \(\alpha\) or \(q\). Should not be 1.

V_mfloat

Volume of the unit ball in m-dimensional space.

rho_karray_like

The k-th nearest neighbor distances.

Nint

Number of data points considered for the distances (Subtract 1 if own point not considered).

mint

Dimensionality of the data.

Returns:
float

The \(I_q\) of the exponential family distribution

infomeasure.estimators.utils.kde module#

Kernel Density Estimation (KDE) utilities.

infomeasure.estimators.utils.kde.gaussian_kernel_densities(data, bandwidth, at=None, workers=1, eigen_threshold: float = 1e-10)[source]#

Calculate kde for gaussian kernel.

In case of multivariate data, checks rank of data and reduces dimensions if eigenvalues are below threshold. If already full rank, does no reprojection.

Parameters:
datandarray, shape (d, N)

Data points to estimate density for.

bandwidthfloat

Bandwidth parameter for kernel density estimation.

atarray, optional

A numpy array of points at which to evaluate the KDE. If None, the KDE is evaluated at the data points.

workersint, optional

Number of workers to use for parallel processing. Default is 1.

eigen_thresholdfloat, optional

Threshold for eigenvalues to determine rank of data. Default is 1e-10.

Returns:
densitiesndarray, shape (n,)

Estimated density values at data points.

infomeasure.estimators.utils.kde.kde_probability_density_function(data, bandwidth, at=None, kernel='box', workers=-1)[source]#

Estimate the probability density function for a given data set using Kernel Density Estimation (KDE).

Parameters:
dataarray

A numpy array of data points, where each column represents a dimension.

bandwidthfloat

The bandwidth for the kernel.

atarray, optional

A numpy array of points at which to evaluate the KDE. If None, the KDE is evaluated at the data points.

kernelstr

Type of kernel to use ('gaussian' or 'box').

workersint

Number of parallel processes to use. -1: Use all available CPU cores. Default is 1.

Returns:
ndarray[float]

KDE at the given point(s).

Raises:
ValueError

If the kernel type is not supported

ValueError

If the bandwidth is not a positive number.

infomeasure.estimators.utils.kde.parallel_kde_evaluate(data, at, bandwidth, workers)[source]#

Evaluate KDE on a set of data in parallel.

Parameters:
dataarray_like

The data to evaluate the KDE on.

atarray_like

The points at which to evaluate the KDE.

bandwidthfloat or str

The bandwidth to use for the KDE.

workersint

The number of worker processes to use for evaluation.

Notes

If the data is < 100000 samples or the number of workers is 1, evaluate the KDE on a single worker.

infomeasure.estimators.utils.kde.query_chunk(params)[source]#

Evaluate KDE on a chunk of data.

infomeasure.estimators.utils.normalize module#

Normalize a time series.

infomeasure.estimators.utils.normalize.normalize_data_0_1(data: ndarray) ndarray[source]#

Normalize the given data to scale each dimension to the range [0, 1].

Parameters:
data2D array_like

List of data points where each row is a data point in d-dimensional space.

Returns:
2D array

Data scaled to the range [0, 1].

Examples

>>> data = np.array([[1, 2], [3, 4], [5, 6]])
>>> normalize_data_0_1(data)
array([[0. , 0. ],
       [0.5, 0.5],
       [1. , 1. ]])

infomeasure.estimators.utils.ordinal module#

Ordinal / Permutation utility functions.

infomeasure.estimators.utils.ordinal.permutation_to_integer(perm: ndarray, dtype: type = <class 'numpy.uint64'>) int[source]#

Convert a permutation pattern to a unique integer. The Lehmer code is used to convert the permutation to an integer.

Parameters:
permndarray

A permutation pattern.

dtypetype, optional

The data type to use for the output. Default is uint64. Using this, the maximal number will be set and the smallest possible dtype will be used.

Returns:
intint, uint8, uint16, uint32, uint64

A unique integer representing the permutation pattern.

Raises:
ValueError

If the embedding_dim is too large to convert to an uint64 (maximal 20).

Notes

This approach has at least been known since 1888 [Lai88]. It is named after Derrick Henry Lehmer [Leh60].

Examples

>>> permutation_to_integer(array([0, 1]))
0
>>> permutation_to_integer(array([1, 0]))
1
>>> permutation_to_integer(array([0, 1, 2]))
0
>>> permutation_to_integer(array([2, 1, 0]))
5
infomeasure.estimators.utils.ordinal.reduce_joint_space(data: ndarray | tuple[ndarray] | Generator) ndarray[source]#

Reduce the data to the joint space.

Reduce features: Assigns each unique feature vector to a unique integer. This is equivalent to the unique indices of the unique rows. If data is 1D, returns the data as is.

Parameters:
datandarray, shape (n_samples,) or (n_samples, n_features), or tuple of arrays.

The data to reduce.

Returns:
ndarray, shape (n_samples,)

The data in the joint space.

Raises:
ValueError

If the data array is not 1D or 2D.

ValueError

If the data is a list.

Notes

The order of the unique values is not guaranteed, only the unique indices. This is because numpy sorts the values internally.

Examples

>>> from numpy import array
>>> data = array([[1, 2], [2, 3], [1, 2], [2, 3], [3, 4]])
>>> data.shape
(5, 2)
>>> reduce_joint_space(data)
array([0, 1, 0, 1, 2])
>>> reduce_joint_space(array([4, 5, 4, 5, 6]))
array([4, 5, 4, 5, 6])
>>> reduce_joint_space(array([[True, True, False], [False, True, False]]))
array([0, 1])
>>> reduce_joint_space(array([[3, 3], [2, 2], [1, 1]]))
array([2, 1, 0])
>>> reduce_joint_space((array([3, 2, 1]), array([3, 2, 1])))
array([2, 1, 0])
infomeasure.estimators.utils.ordinal.symbolize_series(series: ndarray, embedding_dim: int, step_size: int = 1, to_int=False, stable=False) ndarray[source]#

Convert a time series into a sequence of symbols (permutation patterns).

Parameters:
seriesndarray, shape (n,)

A numpy array of data points.

embedding_dimint

The size of the permutation patterns.

step_sizeint

The step size for the sliding windows. Takes every step_size-th element.

to_intbool, optional

Whether to convert the permutation patterns to integers. Default is False. This

stablebool, optional

If True, when sorting the data, the embedding_dim of equal elements is preserved. This can be useful for reproducibility and testing, but might be slower.

Returns:
patternsndarray, shape (n - (embedding_dim - 1) * step_size, embedding_dim)

A list of tuples representing the symbolized series.

Raises:
ValueError

If the embedding_dim is less than 1.

ValueError

If the step_size is less than 1.

Examples

>>> series = np.array([1, 2, 3, 2, 1])
>>> symbolize_series(series, 2, 1)
array([[0, 1],
       [0, 1],
       [1, 0],
       [1, 0]])

infomeasure.estimators.utils.te_slicing module#

Generalized data slicing method for transfer entropy estimators.

This module provides a method to slice the data arrays to prepare for transfer entropy (TE) calculation. The TE measures the information flow from a source variable (X) to a target/destination variable (Y). In this context, the future state is always associated with the target/destination variable.

Conventions:

  • X: Source variable

  • Y: Destination/target variable

  • dest_future: Future state of the destination variable (Y)

  • dest_history: Past states of the destination variable (Y)

  • src_history: Past states of the source variable (X)

The TE is calculated as:

\[\hat{T}(Y_{t+1}|Y^{(k)}, X^{(l)}) = \frac{1}{N} \sum_{i=1}^{N} \log \frac{g(\hat{y}_{i+1}, y_i^{(k)}, x_i^{(l)}) g(\hat y_i^{(k)})}{g(y_i^{(k)}, x_i^{(l)}) g(\hat{y}_{i+1}, y_i^{(k)})}\]
infomeasure.estimators.utils.te_slicing.cte_observations(source, destination, condition, src_hist_len=1, dest_hist_len=1, cond_hist_len=1, step_size=1, permute_src=False, resample_src=False, construct_joint_spaces: bool = True) tuple[ndarray, ndarray, ndarray, ndarray] | Iterable | tuple[source]#

Slice the data arrays to prepare for CTE calculation.

For CTE there are four observations that are required to calculate the conditional transfer entropy.

\[\hat{T}(Y_{t+1}|Y^{(k)}, X^{(l)}) = \frac{1}{N} \sum_{i=1}^{N} \log \frac{g(\hat{y}_{i+1}, y_i^{(k)}, z_i^{(m)}, x_i^{(l)}) g(\hat y_i^{(k)}, z_i^{(m)})}{g(y_i^{(k)}, z_i^{(m)}, x_i^{(l)}) g(\hat{y}_{i+1}, y_i^{(k)}, z_i^{(m)})}\]
Parameters:
sourcearray, shape (n,)

A numpy array of data points for the source variable (X).

destinationarray, shape (n,)

A numpy array of data points for the destination variable (Y).

conditionarray, shape (n,)

A numpy array of data points for the conditioning variable (Z).

src_hist_lenint, optional

Number of past observations (l) to consider for the source data (X). Default is 1, only one current observation, no further history. One future observation is always considered for the source data.

dest_hist_lenint, optional

Number of past observations (k) to consider for the destination data (Y). Default is 1, only one current observation, no further history.

cond_hist_lenint, optional

Number of past observations (m) to consider for the conditioning data (Z). Default is 1, only one current observation, no further history.

step_sizeint, optional

Step size for the time delay in the embedding. Default is None, which equals to 1, every observation is considered. If step_size is greater than 1, the history is subsampled. This applies to both the source and destination data.

permute_srcbool | Generator, optional

Whether to shuffle the sliced source history data. Default is False. This is used for the permutation TE. Rows are permuted, keeping the history intact. If a random number generator is provided, it will be used for shuffling. If True, a new random number generator will be created.

resample_srcbool | Generator, optional

Whether to resample the sliced source history data. Default is False. This is used for the permutation TE using bootstrapping. Rows are resampled with replacement, keeping the history intact. If a random number generator is provided, it will be used for resampling. If True, a new random number generator will be created.

construct_joint_spacesbool, optional

Whether to construct the joint spaces. Default is True. If False, the sliced source and destination data are returned instead.

Returns:
joint_space_dataarray, shape (max_len, src_hist_len + dest_hist_len + cond_hist_len + 1)

\(g(x_i^{(l)}, z_i^{(m)}, y_i^{(k)}, \hat{y}_{i+1})\): Conditional joint space data.

dest_past_embeddedarray, shape (max_len, dest_hist_len)

\(g(\hat y_i^{(k)}, z_i^{(m)})\) : Conditional embedded past destination data.

marginal_1_space_dataarray, shape (max_len, dest_hist_len + src_hist_len)

\(g(x_i^{(l)}, z_i^{(m)}, y_i^{(k)})\) : Conditional marginal space data for destination and source.

marginal_2_space_dataarray, shape (max_len, dest_hist_len + 1)

\(g(z_i^{(m)}, y_i^{(k)}, \hat{y}_{i+1})\) : Conditional marginal space data for destination.

sliced datatuple of arrays

If construct_joint_spaces is False, the sliced source, destination and conditional data are returned instead. Namely, the tuple contains:

  • src_history : array, shape (max_len, src_hist_len) \(x_i^{(l)}\) : Source history.

  • dest_history : array, shape (max_len, dest_hist_len) \(y_i^{(k)}\) : Destination history.

  • dest_future : array, shape (max_len,) \(\hat{y}_{i+1}\) : Destination future.

  • cond_history : array, shape (max_len, cond_hist_len) \(z_i^{(m)}\) : Condition history.

Raises:
TypeError

If the arguments are wrong types.

ValueError

If the history (src_hist_len or dest_hist_len or cond_hist_len times step_size) is greater than the length of the data.

ValueError

If src_hist_len, dest_hist_len, cond_hist_len, or step_size are not positive integers.

Notes

With max_len = data_len - (max(src_hist_len, dest_hist_len, cond_hist_len) - 1) * step_size.

infomeasure.estimators.utils.te_slicing.te_observations(source, destination, src_hist_len=1, dest_hist_len=1, step_size=1, permute_src=False, resample_src=False, construct_joint_spaces: bool = True) tuple[ndarray, ndarray, ndarray, ndarray] | Iterable | tuple[source]#

Slice the data arrays to prepare for TE calculation.

For TE there are four observations that are required to calculate the transfer entropy.

\[\hat{T}(Y_{t+1}|Y^{(k)}, X^{(l)}) = \frac{1}{N} \sum_{i=1}^{N} \log \frac{g(\hat{y}_{i+1}, y_i^{(k)}, x_i^{(l)}) g(\hat y_i^{(k)})}{g(y_i^{(k)}, x_i^{(l)}) g(\hat{y}_{i+1}, y_i^{(k)})}\]
Parameters:
sourcearray, shape (n,)

A numpy array of data points for the source variable (X).

destinationarray, shape (n,)

A numpy array of data points for the destination variable (Y).

src_hist_lenint, optional

Number of past observations (l) to consider for the source data (X). Default is 1, only one current observation, no further history. One future observation is always considered for the source data.

dest_hist_lenint, optional

Number of past observations (k) to consider for the destination data (Y). Default is 1, only one current observation, no further history.

step_sizeint, optional

Step size for the time delay in the embedding. Default is None, which equals to 1, every observation is considered. If step_size is greater than 1, the history is subsampled. This applies to both the source and destination data.

permute_srcbool | Generator, optional

Whether to shuffle the sliced source history data. Default is False. This is used for the permutation TE. Rows are permuted, keeping the history intact. If a random number generator is provided, it will be used for shuffling. If True, a new random number generator will be created.

resample_srcbool | Generator, optional

Whether to resample the sliced source history data. Default is False. This is used for the permutation TE using bootstrapping. Rows are resampled with replacement, keeping the history intact. If a random number generator is provided, it will be used for resampling. If True, a new random number generator will be created.

construct_joint_spacesbool, optional

Whether to construct the joint spaces. Default is True. If False, the sliced source and destination data are returned instead.

Returns:
joint_space_dataarray, shape (max_len, src_hist_len + dest_hist_len + 1)

\(g(x_i^{(l)}, y_i^{(k)}, \hat{y}_{i+1})\): Joint space data.

dest_past_embeddedarray, shape (max_len, dest_hist_len)

\(g(\hat y_i^{(k)})\) : Embedded past destination data.

marginal_1_space_dataarray, shape (max_len, dest_hist_len + src_hist_len)

\(g(x_i^{(l)}, y_i^{(k)})\) : Marginal space data for destination and source.

marginal_2_space_dataarray, shape (max_len, dest_hist_len + 1)

\(g(y_i^{(k)}, \hat{y}_{i+1})\) : Marginal space data for destination.

sliced datatuple of arrays

If construct_joint_spaces is False, the sliced source and destination data are returned instead. Namely, the tuple contains:

  • src_history : array, shape (max_len, src_hist_len) \(x_i^{(l)}\) : Source history.

  • dest_history : array, shape (max_len, dest_hist_len) \(y_i^{(k)}\) : Destination history.

  • dest_future : array, shape (max_len,) \(\hat{y}_{i+1}\) : Destination future.

Raises:
ValueError

If the history (src_hist_len or dest_hist_len times step_size) is greater than the length of the data.

ValueError

If src_hist_len, dest_hist_len, or step_size are not positive integers.

ValueError

If both permute_src and resample_src are provided.

Notes

  • For permutation TE, permute_src xor resample_src can be used.

  • With max_len = data_len - (max(src_hist_len, dest_hist_len) - 1) * step_size.

infomeasure.estimators.utils.test_unit_ball_volume module#

infomeasure.estimators.utils.unit_ball_volume module#

Helper function for the unit ball volume.

infomeasure.estimators.utils.unit_ball_volume.unit_ball_volume(d, r=1, p=2)[source]#

Calculate the volume of the d-dimensional ball with radius r in \(L^p\) norm.

\[\begin{split}V_d = \begin{cases} 2r & \text{if } d = 1 \text{ and } p = 2, \\ \frac{4\pi r^3}{3} & \text{if } d = 3 \text{ and } p = 2, \\ (2r)^d & \text{if } p = \infty, \\ \frac{(\pi r^2)^{d/2}}{\Gamma(1 + d/2)} & \text{if } p = 2, \\ \frac{(2r)^d \Gamma(1 + \frac{1}{p})^d}{\Gamma(1 + \frac{d}{p})} & \text{otherwise}. \end{cases}\end{split}\]
Parameters:
dint

The dimensionality of the space.

pfloat

The \(L^p\) norm.

rfloat, optional

The radius of the ball (default is 1).

Returns:
float

The volume of the d-dimensional ball with radius r in \(L^p\) norm.

Module contents#

Utilities for the estimators module of the infomeasure package.