infomeasure.estimators.utils package#
Submodules#
infomeasure.estimators.utils.array module#
Array helpers for the base estimator.
- infomeasure.estimators.utils.array.assure_2d_data(data) ndarray | tuple[ndarray, ...][source]#
Assure the data is 2D.
This function checks adds a new axis to the data if it is 1D. For tuples and generators, each element is checked and reshaped if necessary.
- Parameters:
- dataarray_like or tuple/Generator
ofarray_like The data to check.
- dataarray_like or tuple/Generator
- Returns:
- array_like or
tupleofarray_like The reshaped data.
- array_like or
- Raises:
ValueErrorIf the data is not supported for 2D conversion.
infomeasure.estimators.utils.discrete_interaction_information module#
Functions for interaction information, a multivariate generalization of mutual information.
- infomeasure.estimators.utils.discrete_interaction_information.conditional_mutual_information_global(*data: tuple, cond: ~numpy.ndarray, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) float[source]#
Estimate the global conditional mutual information between multiple random variables and a conditioning variable.
- Parameters:
- *dataarray_like,
shape(n_samples,) The data used to estimate the global mutual information. You can pass an arbitrary number of data arrays as positional arguments.
- condarray_like,
shape(n_samples,) The conditioning variable.
- log_func
callable(),optional The logarithm function to use. Default is the natural logarithm.
- miller_madow_correction
str|float|int,optional If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value.
log_funcandmiller_madow_correctionshould be the same base.
- *dataarray_like,
- Returns:
floatThe global conditional mutual information between the random variables.
- Raises:
ValueErrorIf the conditioning variable is not one-dimensional.
- infomeasure.estimators.utils.discrete_interaction_information.conditional_mutual_information_local(*data: tuple, cond: ~numpy.ndarray, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) ndarray[source]#
Estimate the local conditional mutual information between multiple random variables and a conditioning variable.
The mean of the local conditional mutual information is the global conditional mutual information. Only calculating the global value is more efficient, so evaluating the local conditional mutual information should only be done when explicitly needed.
- Parameters:
- *dataarray_like,
shape(n_samples,) The data used to estimate the local mutual information. You can pass an arbitrary number of data arrays as positional arguments.
- condarray_like,
shape(n_samples,) The conditioning variable.
- log_func
callable(),optional The logarithm function to use. Default is the natural logarithm.
- miller_madow_correction
str|float|int,optional If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value.
log_funcandmiller_madow_correctionshould be the same base.
- *dataarray_like,
- Returns:
ndarrayThe local conditional mutual information between the random variables.
- infomeasure.estimators.utils.discrete_interaction_information.millermadow_mi_corr(k_i, k_joint, n, base, k_cond=None)[source]#
Computes the Miller-Madow mutual information correction term.
This function calculates a correction term used to adjust the bias in mutual information estimates, which arise due to finite sample size issues. The correction is based on the marginal counts and joint count in the observed distributions.
- Parameters:
- k_i
list[int] A list containing the marginal cardinalities of individual variables in the dataset. Each element represents the number of unique values for the respective variable.
- k_joint
int The cardinality of the joint distribution. Represents the number of unique observations across all combined dimensions.
- n
int The sample size, representing the total number of observations in the data.
- base
str|float|int The logarithmic base used for the mutual information computation. If set to “e”, natural logarithm is used. Otherwise, log of the specified base is used.
- k_cond
int,optional The cardinality of the conditional variable. When this is used, k_i can be used as k_iZ, to calculate the correction for conditional MI.
- k_i
- Returns:
floatThe calculated Miller-Madow correction term to adjust the mutual information value.
- infomeasure.estimators.utils.discrete_interaction_information.mutual_information_global(*data: tuple, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) float[source]#
Estimate the global mutual information between multiple random variables.
- Parameters:
- *dataarray_like,
shape(n_samples,) The data used to estimate the global mutual information. You can pass an arbitrary number of data arrays as positional arguments.
- log_func
callable(),optional The logarithm function to use. Default is the natural logarithm.
- miller_madow_correction
str|float|int,optional If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value.
log_funcandmiller_madow_correctionshould be the same base.
- *dataarray_like,
- Returns:
floatThe global mutual information between the random variables.
- infomeasure.estimators.utils.discrete_interaction_information.mutual_information_local(*data: tuple, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None) ndarray[source]#
Estimate the local mutual information between multiple random variables.
The mean of the local mutual information is the global mutual information. Only calculating the global value is more efficient, so evaluating the local mutual information should only be done when explicitly needed.
- Parameters:
- *dataarray_like,
shape(n_samples,) The data used to estimate the local mutual information. You can pass an arbitrary number of data arrays as positional arguments.
- log_func
callable(),optional The logarithm function to use. Default is the natural logarithm.
- miller_madow_correction
str|float|int,optional If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value.
log_funcandmiller_madow_correctionshould be the same base.
- *dataarray_like,
- Returns:
ndarrayThe local mutual information between the random variables.
infomeasure.estimators.utils.discrete_transfer_entropy module#
Functions for efficient computation of discrete transfer entropy.
- infomeasure.estimators.utils.discrete_transfer_entropy.combined_te_form(slice_method, *data, local: bool = False, log_func: callable = <ufunc 'log'>, miller_madow_correction: str | float | int = None, **slice_kwargs) float | ndarray[source]#
Calculate the Transfer Entropy using the combined TE formula.
- Parameters:
- slice_method
function The slicing method to use for the symbolized data.
- *dataarray_like
The source, destination, and if applicable, conditional data.
- localbool,
optional Whether to calculate the local transfer entropy. If False, the global transfer entropy is calculated. Default is False.
- log_func
callable(),optional The logarithm function to use. Default is the natural logarithm.
- miller_madow_correction
str|float|int,optional If not None, apply the Miller-Madow correction to the global mutual information in the information unit of the passed value.
log_funcandmiller_madow_correctionshould be the same base.- **slice_kwargs
dict The history lengths for the source, destination, and if applicable, conditional data.
- slice_method
- Returns:
floatThe Transfer Entropy value.
infomeasure.estimators.utils.exponential_family module#
Helper functions for exponential family distributions.
Rényi entropy and Tsallis entropy are special cases of the more general family of exponential family distributions. This module provides helper functions for these distributions.
- infomeasure.estimators.utils.exponential_family.calculate_common_entropy_components(data, k, at=None)[source]#
Calculate common components for entropy estimators.
- Parameters:
- dataarray_like
The data used to estimate the entropy.
- k
int The number of nearest neighbors used in the estimation. Not including the data point itself.
- atarray_like,
optional The parameter at which to evaluate the entropy components.
- Returns:
tupleVolume of the unit ball, k-th nearest neighbor distances, number of data points, and dimensionality of the data.
- Raises:
ValueErrorIf the parameter
kis selected too large.ValueErrorIf both
dataandathave different dimensions.
- infomeasure.estimators.utils.exponential_family.exponential_family_i1(k, V_m, rho_k, N, m, log_base_func)[source]#
Calculate the \(I_1\) of the exponential family distribution.
When \(q = 1\), the exponential family distribution reduces to the Shannon entropy.
- Parameters:
- k
int The number of nearest neighbors used in the estimation.
- V_m
float Volume of the unit ball in m-dimensional space.
- rho_karray_like
The k-th nearest neighbor distances.
- N
int Number of data points considered for the distances (Subtract 1 if own point not considered).
- m
int Dimensionality of the data.
- log_base_func
callable() The logarithm function to use for the calculation with the chosen base.
- k
- Returns:
floatThe \(I_1\) of the exponential family distribution
- infomeasure.estimators.utils.exponential_family.exponential_family_iq(k, q, V_m, rho_k, N, m)[source]#
Calculate the \(I_q\) of the exponential family distribution.
- Parameters:
- k
int The number of nearest neighbors used in the estimation.
- q
float|int The Rényi or Tsallis parameter, order or exponent. Sometimes denoted as \(\alpha\) or \(q\). Should not be 1.
- V_m
float Volume of the unit ball in m-dimensional space.
- rho_karray_like
The k-th nearest neighbor distances.
- N
int Number of data points considered for the distances (Subtract 1 if own point not considered).
- m
int Dimensionality of the data.
- k
- Returns:
floatThe \(I_q\) of the exponential family distribution
infomeasure.estimators.utils.kde module#
Kernel Density Estimation (KDE) utilities.
- infomeasure.estimators.utils.kde.gaussian_kernel_densities(data, bandwidth, at=None, workers=1, eigen_threshold: float = 1e-10)[source]#
Calculate kde for gaussian kernel.
In case of multivariate data, checks rank of data and reduces dimensions if eigenvalues are below threshold. If already full rank, does no reprojection.
- Parameters:
- data
ndarray,shape(d,N) Data points to estimate density for.
- bandwidth
float Bandwidth parameter for kernel density estimation.
- at
array,optional A numpy array of points at which to evaluate the KDE. If None, the KDE is evaluated at the data points.
- workers
int,optional Number of workers to use for parallel processing. Default is 1.
- eigen_threshold
float,optional Threshold for eigenvalues to determine rank of data. Default is 1e-10.
- data
- Returns:
- densities
ndarray,shape(n,) Estimated density values at data points.
- densities
- infomeasure.estimators.utils.kde.kde_probability_density_function(data, bandwidth, at=None, kernel='box', workers=-1)[source]#
Estimate the probability density function for a given data set using Kernel Density Estimation (KDE).
- Parameters:
- data
array A numpy array of data points, where each column represents a dimension.
- bandwidth
float The bandwidth for the kernel.
- at
array,optional A numpy array of points at which to evaluate the KDE. If None, the KDE is evaluated at the data points.
- kernel
str Type of kernel to use (
'gaussian'or'box').- workers
int Number of parallel processes to use. -1: Use all available CPU cores. Default is 1.
- data
- Returns:
- Raises:
ValueErrorIf the kernel type is not supported
ValueErrorIf the bandwidth is not a positive number.
- infomeasure.estimators.utils.kde.parallel_kde_evaluate(data, at, bandwidth, workers)[source]#
Evaluate KDE on a set of data in parallel.
- Parameters:
- dataarray_like
The data to evaluate the KDE on.
- atarray_like
The points at which to evaluate the KDE.
- bandwidth
floatorstr The bandwidth to use for the KDE.
- workers
int The number of worker processes to use for evaluation.
Notes
If the data is < 100000 samples or the number of workers is 1, evaluate the KDE on a single worker.
infomeasure.estimators.utils.normalize module#
Normalize a time series.
- infomeasure.estimators.utils.normalize.normalize_data_0_1(data: ndarray) ndarray[source]#
Normalize the given data to scale each dimension to the range [0, 1].
- Parameters:
- data2D array_like
List of data points where each row is a data point in d-dimensional space.
- Returns:
- 2D
array Data scaled to the range [0, 1].
- 2D
Examples
>>> data = np.array([[1, 2], [3, 4], [5, 6]]) >>> normalize_data_0_1(data) array([[0. , 0. ], [0.5, 0.5], [1. , 1. ]])
infomeasure.estimators.utils.ordinal module#
Ordinal / Permutation utility functions.
- infomeasure.estimators.utils.ordinal.permutation_to_integer(perm: ~numpy.ndarray, dtype: type = <class 'numpy.uint64'>) int[source]#
Convert a permutation pattern to a unique integer. The Lehmer code is used to convert the permutation to an integer.
- Parameters:
- Returns:
- int
int,uint8,uint16,uint32,uint64 A unique integer representing the permutation pattern.
- int
- Raises:
ValueErrorIf the embedding_dim is too large to convert to an uint64 (maximal 20).
Notes
This approach has at least been known since 1888 [Lai88]. It is named after Derrick Henry Lehmer [Leh60].
Examples
>>> permutation_to_integer(array([0, 1])) 0 >>> permutation_to_integer(array([1, 0])) 1 >>> permutation_to_integer(array([0, 1, 2])) 0 >>> permutation_to_integer(array([2, 1, 0])) 5
- infomeasure.estimators.utils.ordinal.reduce_joint_space(data: ndarray | tuple[ndarray] | Generator) ndarray[source]#
Reduce the data to the joint space.
Reduce features: Assigns each unique feature vector to a unique integer. This is equivalent to the unique indices of the unique rows. If data is 1D, returns the data as is.
- Parameters:
- Returns:
ndarray,shape(n_samples,)The data in the joint space.
- Raises:
ValueErrorIf the data array is not 1D or 2D.
ValueErrorIf the data is a list.
Notes
The order of the unique values is not guaranteed, only the unique indices. This is because numpy sorts the values internally.
Examples
>>> from numpy import array >>> data = array([[1, 2], [2, 3], [1, 2], [2, 3], [3, 4]]) >>> data.shape (5, 2) >>> reduce_joint_space(data) array([0, 1, 0, 1, 2]) >>> reduce_joint_space(array([4, 5, 4, 5, 6])) array([4, 5, 4, 5, 6]) >>> reduce_joint_space(array([[True, True, False], [False, True, False]])) array([0, 1]) >>> reduce_joint_space(array([[3, 3], [2, 2], [1, 1]])) array([2, 1, 0]) >>> reduce_joint_space((array([3, 2, 1]), array([3, 2, 1]))) array([2, 1, 0])
- infomeasure.estimators.utils.ordinal.symbolize_series(series: ndarray, embedding_dim: int, step_size: int = 1, to_int=False, stable=False) ndarray[source]#
Convert a time series into a sequence of symbols (permutation patterns).
- Parameters:
- series
ndarray,shape(n,) A numpy array of data points.
- embedding_dim
int The size of the permutation patterns.
- step_size
int The step size for the sliding windows. Takes every step_size-th element.
- to_intbool,
optional Whether to convert the permutation patterns to integers. Default is False. This
- stablebool,
optional If True, when sorting the data, the embedding_dim of equal elements is preserved. This can be useful for reproducibility and testing, but might be slower.
- series
- Returns:
- patterns
ndarray,shape(n- (embedding_dim- 1) *step_size,embedding_dim) A list of tuples representing the symbolized series.
- patterns
- Raises:
ValueErrorIf the embedding_dim is less than 1.
ValueErrorIf the step_size is less than 1.
Examples
>>> series = np.array([1, 2, 3, 2, 1]) >>> symbolize_series(series, 2, 1) array([[0, 1], [0, 1], [1, 0], [1, 0]])
infomeasure.estimators.utils.te_slicing module#
Generalized data slicing method for transfer entropy estimators.
This module provides a method to slice the data arrays to prepare for transfer entropy (TE) calculation. The TE measures the information flow from a source variable (X) to a target/destination variable (Y). In this context, the future state is always associated with the target/destination variable.
Conventions:
X: Source variableY: Destination/target variabledest_future: Future state of the destination variable (Y)dest_history: Past states of the destination variable (Y)src_history: Past states of the source variable (X)
The TE is calculated as:
- infomeasure.estimators.utils.te_slicing.cte_observations(source, destination, condition, src_hist_len=1, dest_hist_len=1, cond_hist_len=1, step_size=1, permute_src=False, resample_src=False, construct_joint_spaces: bool = True) tuple[ndarray, ndarray, ndarray, ndarray] | Iterable | tuple[source]#
Slice the data arrays to prepare for CTE calculation.
For CTE there are four observations that are required to calculate the conditional transfer entropy.
\[\hat{T}(Y_{t+1}|Y^{(k)}, X^{(l)}) = \frac{1}{N} \sum_{i=1}^{N} \log \frac{g(\hat{y}_{i+1}, y_i^{(k)}, z_i^{(m)}, x_i^{(l)}) g(\hat y_i^{(k)}, z_i^{(m)})}{g(y_i^{(k)}, z_i^{(m)}, x_i^{(l)}) g(\hat{y}_{i+1}, y_i^{(k)}, z_i^{(m)})}\]- Parameters:
- source
array,shape(n,) A numpy array of data points for the source variable (X).
- destination
array,shape(n,) A numpy array of data points for the destination variable (Y).
- condition
array,shape(n,) A numpy array of data points for the conditioning variable (Z).
- src_hist_len
int,optional Number of past observations (l) to consider for the source data (X). Default is 1, only one current observation, no further history. One future observation is always considered for the source data.
- dest_hist_len
int,optional Number of past observations (k) to consider for the destination data (Y). Default is 1, only one current observation, no further history.
- cond_hist_len
int,optional Number of past observations (m) to consider for the conditioning data (Z). Default is 1, only one current observation, no further history.
- step_size
int,optional Step size for the time delay in the embedding. Default is None, which equals to 1, every observation is considered. If step_size is greater than 1, the history is subsampled. This applies to both the source and destination data.
- permute_srcbool |
Generator,optional Whether to shuffle the sliced source history data. Default is False. This is used for the permutation TE. Rows are permuted, keeping the history intact. If a random number generator is provided, it will be used for shuffling. If True, a new random number generator will be created.
- resample_srcbool |
Generator,optional Whether to resample the sliced source history data. Default is False. This is used for the permutation TE using bootstrapping. Rows are resampled with replacement, keeping the history intact. If a random number generator is provided, it will be used for resampling. If True, a new random number generator will be created.
- construct_joint_spacesbool,
optional Whether to construct the joint spaces. Default is True. If False, the sliced source and destination data are returned instead.
- source
- Returns:
- joint_space_data
array,shape(max_len,src_hist_len+dest_hist_len+cond_hist_len+ 1) \(g(x_i^{(l)}, z_i^{(m)}, y_i^{(k)}, \hat{y}_{i+1})\): Conditional joint space data.
- dest_past_embedded
array,shape(max_len,dest_hist_len) \(g(\hat y_i^{(k)}, z_i^{(m)})\) : Conditional embedded past destination data.
- marginal_1_space_data
array,shape(max_len,dest_hist_len+src_hist_len) \(g(x_i^{(l)}, z_i^{(m)}, y_i^{(k)})\) : Conditional marginal space data for destination and source.
- marginal_2_space_data
array,shape(max_len,dest_hist_len+ 1) \(g(z_i^{(m)}, y_i^{(k)}, \hat{y}_{i+1})\) : Conditional marginal space data for destination.
- sliced data
tupleofarrays If
construct_joint_spacesis False, the sliced source, destination and conditional data are returned instead. Namely, the tuple contains:src_history: array, shape (max_len, src_hist_len) \(x_i^{(l)}\) : Source history.dest_history: array, shape (max_len, dest_hist_len) \(y_i^{(k)}\) : Destination history.dest_future: array, shape (max_len,) \(\hat{y}_{i+1}\) : Destination future.cond_history: array, shape (max_len, cond_hist_len) \(z_i^{(m)}\) : Condition history.
- joint_space_data
- Raises:
TypeErrorIf the arguments are wrong types.
ValueErrorIf the history (
src_hist_lenordest_hist_lenorcond_hist_lentimesstep_size) is greater than the length of the data.ValueErrorIf
src_hist_len,dest_hist_len,cond_hist_len, orstep_sizeare not positive integers.
Notes
With
max_len = data_len - (max(src_hist_len, dest_hist_len, cond_hist_len) - 1) * step_size.
- infomeasure.estimators.utils.te_slicing.te_observations(source, destination, src_hist_len=1, dest_hist_len=1, step_size=1, permute_src=False, resample_src=False, construct_joint_spaces: bool = True) tuple[ndarray, ndarray, ndarray, ndarray] | Iterable | tuple[source]#
Slice the data arrays to prepare for TE calculation.
For TE there are four observations that are required to calculate the transfer entropy.
\[\hat{T}(Y_{t+1}|Y^{(k)}, X^{(l)}) = \frac{1}{N} \sum_{i=1}^{N} \log \frac{g(\hat{y}_{i+1}, y_i^{(k)}, x_i^{(l)}) g(\hat y_i^{(k)})}{g(y_i^{(k)}, x_i^{(l)}) g(\hat{y}_{i+1}, y_i^{(k)})}\]- Parameters:
- source
array,shape(n,) A numpy array of data points for the source variable (X).
- destination
array,shape(n,) A numpy array of data points for the destination variable (Y).
- src_hist_len
int,optional Number of past observations (l) to consider for the source data (X). Default is 1, only one current observation, no further history. One future observation is always considered for the source data.
- dest_hist_len
int,optional Number of past observations (k) to consider for the destination data (Y). Default is 1, only one current observation, no further history.
- step_size
int,optional Step size for the time delay in the embedding. Default is None, which equals to 1, every observation is considered. If step_size is greater than 1, the history is subsampled. This applies to both the source and destination data.
- permute_srcbool |
Generator,optional Whether to shuffle the sliced source history data. Default is False. This is used for the permutation TE. Rows are permuted, keeping the history intact. If a random number generator is provided, it will be used for shuffling. If True, a new random number generator will be created.
- resample_srcbool |
Generator,optional Whether to resample the sliced source history data. Default is False. This is used for the permutation TE using bootstrapping. Rows are resampled with replacement, keeping the history intact. If a random number generator is provided, it will be used for resampling. If True, a new random number generator will be created.
- construct_joint_spacesbool,
optional Whether to construct the joint spaces. Default is True. If False, the sliced source and destination data are returned instead.
- source
- Returns:
- joint_space_data
array,shape(max_len,src_hist_len+dest_hist_len+ 1) \(g(x_i^{(l)}, y_i^{(k)}, \hat{y}_{i+1})\): Joint space data.
- dest_past_embedded
array,shape(max_len,dest_hist_len) \(g(\hat y_i^{(k)})\) : Embedded past destination data.
- marginal_1_space_data
array,shape(max_len,dest_hist_len+src_hist_len) \(g(x_i^{(l)}, y_i^{(k)})\) : Marginal space data for destination and source.
- marginal_2_space_data
array,shape(max_len,dest_hist_len+ 1) \(g(y_i^{(k)}, \hat{y}_{i+1})\) : Marginal space data for destination.
- sliced data
tupleofarrays If
construct_joint_spacesis False, the sliced source and destination data are returned instead. Namely, the tuple contains:src_history: array, shape (max_len, src_hist_len) \(x_i^{(l)}\) : Source history.dest_history: array, shape (max_len, dest_hist_len) \(y_i^{(k)}\) : Destination history.dest_future: array, shape (max_len,) \(\hat{y}_{i+1}\) : Destination future.
- joint_space_data
- Raises:
ValueErrorIf the history (
src_hist_lenordest_hist_lentimesstep_size) is greater than the length of the data.ValueErrorIf
src_hist_len,dest_hist_len, orstep_sizeare not positive integers.ValueErrorIf both
permute_srcandresample_srcare provided.
Notes
For permutation TE,
permute_srcxorresample_srccan be used.With
max_len = data_len - (max(src_hist_len, dest_hist_len) - 1) * step_size.
infomeasure.estimators.utils.test_unit_ball_volume module#
infomeasure.estimators.utils.unit_ball_volume module#
Helper function for the unit ball volume.
- infomeasure.estimators.utils.unit_ball_volume.unit_ball_volume(d, r=1, p=2)[source]#
Calculate the volume of the d-dimensional ball with radius r in \(L^p\) norm.
\[\begin{split}V_d = \begin{cases} 2r & \text{if } d = 1 \text{ and } p = 2, \\ \frac{4\pi r^3}{3} & \text{if } d = 3 \text{ and } p = 2, \\ (2r)^d & \text{if } p = \infty, \\ \frac{(\pi r^2)^{d/2}}{\Gamma(1 + d/2)} & \text{if } p = 2, \\ \frac{(2r)^d \Gamma(1 + \frac{1}{p})^d}{\Gamma(1 + \frac{d}{p})} & \text{otherwise}. \end{cases}\end{split}\]
Module contents#
Utilities for the estimators module of the infomeasure package.