PECUZAL embedding module

Created on Mon Aug 10 09:48:16 2020

@author: hkraemer

pecuzal_embedding.all_neighbors(vtree, vs, ns, K, theiler, k_max)[source]

Compute K nearest neighbours for the points vs (having indices ns) from the tree vtree, while respecting the theiler-window.

Returns

  • indices (numpy.ndarray (len(vs),K)) – The indices of the K-nearest neighbours of all points vs (having indices ns)

  • dists (numpy.ndarray (len(vs),K)) – The distances to the K-nearest neighbours of all points vs (having indices ns)

pecuzal_embedding.all_neighbors_1dim(vs, ns, K, theiler)[source]

Compute K nearest neighbours for the points vs (having indices ns), while respecting the theiler-window.

Returns

  • indices (numpy.ndarray (len(vs),K)) – The indices of the K-nearest neighbours of all points vs (having indices ns)

  • dists (numpy.ndarray (len(vs),K)) – The distances to the K-nearest neighbours of all points vs (having indices ns)

pecuzal_embedding.choose_right_embedding_params(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]

Choose the right embedding parameters of the estar-statistic in any embedding cycle, but the first one, on the basis of minimal L, i.e. maximal L_decrease.

Parameters
  • estar (numpy.ndarray (len(taus), M)) – The M continuity statistic(s) for delays taus.

  • Y_act (numpy.ndarray) – The actual phase space trajectory.

  • s (numpy.ndarray` (N, M)) – The M time series of length N.

  • taus (iterable) – Possible delay values in sampling time units.

  • KNN (int) – The number of nearest neighbors to be considered in the L-statistic.

  • theiler (int) – Theiler window for excluding serial correlated points from neighbourhood, in sampling time units.

  • sample_size (float) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.].

  • norm (str) – The norm used for distance computations.

  • econ (bool) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.

Returns

  • L (float) – L-value of chosen peak in estar

  • tau_idx (int) – The corresponding index value of the chosen peak

  • ts_idx (int) – The number of the chosen time series to start with

pecuzal_embedding.choose_right_embedding_params_first(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]

Choose the right embedding parameters of the estar-statistic in the first embedding cycle. Return the L-decrease-value, the corresponding index value of the chosen peak tau_idx and the number of the chosen time series ts_idx to start with.

Parameters
  • estar (numpy.ndarray (len(taus), M)) – The M continuity statistic(s) for delays taus.

  • Y_act (numpy.ndarray) – The actual phase space trajectory.

  • s (numpy.ndarray` (N, M)) – The M time series of length N.

  • taus (iterable) – Possible delay values in sampling time units.

  • KNN (int) – The number of nearest neighbors to be considered in the L-statistic.

  • theiler (int) – Theiler window for excluding serial correlated points from neighbourhood, in sampling time units.

  • sample_size (float) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.].

  • norm (str) – The norm used for distance computations.

  • econ (bool) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.

Returns

  • L (float) – L-value of chosen peak in estar

  • tau_idx (int) – The corresponding index value of the chosen peak

  • ts_idx (int) – The number of the chosen time series to start with

pecuzal_embedding.comp_Ek2(Y, ns, NNidxs, T, K, norm)[source]

Compute the approximated conditional variance for a specific point in Y (with index ns) for time horizon T.

Returns

E2 – The approximated conditional variance for a specific point in state space ns (index value) with its K-nearest neighbors, which indices are stored in NNidxs, for a time horizon T. This corresponds to Eqs. 13 & 14 in [uzal2011].

Return type

float

pecuzal_embedding.compute_L_decrease(E2, E2_trial, eps2, eps2_trial, T, NN)[source]
pecuzal_embedding.continuity_per_timeseries(x, ns, allNNidxs, delays, K, alpha, p)[source]
pecuzal_embedding.continuity_statistic(s, taus, js, delays=range(0, 50), sample_size=1.0, K=13, theiler=1, norm='euclidean', alpha=0.05, p=0.5)[source]

Compute the continuity statistic for a trajectory defined by s, taus and js and all time series stored in s.

Parameters
  • s (numpy.ndarray (N, M)) – Input time series of length N. This can be a multivariate set, consisting of M time series, which are stored in the columns.

  • taus (list or numpy.ndarray) – Denotes what delay times will be used for constructing the trajectory for which the continuity statistic to all time series in s will be computed.

  • js (list or numpy.ndarray) – Denotes which of the timeseries contained in s will be used for constructing the trajectory using the delay values stored in taus. js can contain duplicate indices.

  • delays (iterable, optional) – Possible delay values in sampling time units (Default is delays = range(50)). The continuity statistic avrg_eps_star is a function of these delay values.

  • sample_size (float, optional) – Number of considered fiducial points as a fraction of input time series length, i.e. a float \(\in (0,1.]\) (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).

  • K (int, optional) – The amount of nearest neighbors in the \(\delta\)-ball. Must be at least 8 (in order to guarantee a valid statistic) and the Default is K = 13. The continuity statistic avrg_eps_star is computed by taking the minimum result over all \(k \in K\).

  • theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.

  • norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’

  • alpha (float, optional) – Significance level for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is alpha = 0.05).

  • p (float, optional) – Binominal p for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is p = 0.5).

Returns

avrg_eps_star – The continuity statistic avrg_eps_star for each of the M time series in s for the given trajectory specified by the general embedding parameters taus and js and for all delay values specified in delays.

Return type

numpy.ndarray (len(delays), M)

Notes

The full algorithm is too large to discuss here and written in detail in Ref. [pecora2007] and in summary in Ref. [kraemer2020] .

pecuzal_embedding.embedding_cycle_pecuzal(Y_act, Ys, counter, M, taus, theiler, sample_size, K, norm, alpha, p, KNN, tau_vals, ts_vals, Ls, eps, econ)[source]

Perform an embedding cycle of the multivariate embedding, but the first one.

pecuzal_embedding.entropy1d(x, binrule)[source]

Returns the Shannon entropy according to the bin rule specified.

pecuzal_embedding.entropy2d(x, y, bin_edges)[source]

Returns the Shannon entropy according to the bin rule specified.

pecuzal_embedding.eps_star(x, n, tau, NNidxs, delta_to_epsilon_amount, Ks)[source]
pecuzal_embedding.first_embedding_cycle_pecuzal(Ys, M, taus, theiler, sample_size, K, norm, alpha, p, KNN, tau_vals, ts_vals, Ls, eps, econ)[source]

Perform the first embedding cycle of the multivariate embedding.

pecuzal_embedding.genembed(s, taus, js)[source]

Perform an embedding with delays taus and time series stored in s, specified by their indices js

Parameters
  • s (numpy.ndarray (N, M)) – Input time series of length N. This can be a multivariate set, consisting of `M`time series, which are stored in the columns.

  • taus (list or numpy.ndarray) – Denotes what delay times will be used for constructing the trajectory for which the continuity statistic to all time series in s will be computed.

  • js (list or numpy.ndarray) – Denotes which of the timeseries contained in s will be used for constructing the trajectory using the delay values stored in taus. js can contain duplicate indices.

Returns

Y – The trajectory from the embedding of length N’ = N-sum(taus) of dimension d = len(taus).

Return type

numpy.ndarray (N’, d)

Notes

The generalized embedding works as follows: taus, js are list’s (or numpy.ndarray’s) of length d, which also coincides with the embedding dimension. For example, imagine input trajectory \(s = [x, y, z]\) where \(x, y, z\) are timeseries (the columns of s). If js = (0, 2, 1) and taus = (0, 2, 7) the created delay vector at each step t will be

\[(x(t), z(t+2), y(t+7))\]
pecuzal_embedding.get_binomial_table(p=0.5, alpha=0.05, trial_range=8)[source]

Compute the numbers of points from the \(\delta\)-neighborhood, which need to fall outside the \(\varepsilon\)-neighborhood, in order to reject the Null Hypothesis at a significance level \(\alpha\).

Parameters
  • p (float, optional) – Binominal p (Default is p = 0.5).

  • alpha (float, optional) – Significance level in order to be able to reject the Null on the basis of the binomial distribution (Default is alpha = 0.05).

  • trial_range (int, optional) – Number of considered delta-neighborhood-points (Default is trial_range = 8).

Returns

delta_to_epsilon_amount – A dictionary with delta_points as keys and the corresponding number of points in order to reject the Null, epsilon_points, constitute the values.

Return type

dict

Notes

One parameter of the binomial distribution is p, the other one would be the number of trials, i.e. the considered number of points of the \(\delta\)-neighborhood. trial_range determines the number of considered \(\delta\)-neighborhood-points, always starting from 8. For instance, if trial_range = 8, then \(\delta\)-neighborhood sizes from 8 up to 15 are considered.

pecuzal_embedding.get_maxima(s)[source]

Return the maxima of the given time series s and its indices

pecuzal_embedding.hcat_lagged_values(Y, s, tau)[source]

Add the tau lagged values of the timeseries s as additional component to Y (np.ndarray), in order to form a higher embedded dataset Z. The dimensionality of Z is thus equal to that of Y + 1.

pecuzal_embedding.local_L_statistics(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]

Return the maximum decrease of the L-statistic L_decrease and corresponding delay-indices max_idx for all local maxima in estar.

pecuzal_embedding.mi(x, maxlag=50)[source]

Compute the auto mutual information of a time series x up to a lag maxlag.

Parameters
  • x (numpy.ndarray) – Numpy array storing the time series values.

  • maxlag (int, optional) – The maximum lag in sampling units, i.e. an integer value

Returns

  • mi (numpy.ndarray (len(range(maxlag)))) – The auto mutual information of the given time series at each considered lag.

  • lags (numpy.ndarray (len(range(maxlag)))) – The considered lags

pecuzal_embedding.pecuzal_break_criterion(Ls, counter, max_num_of_cycles, threshold)[source]

Checks whether some break criteria are fulfilled

pecuzal_embedding.pecuzal_embedding(s, taus=range(0, 50), theiler=1, sample_size=1.0, K=13, KNN=3, L_threshold=0.0, alpha=0.05, p=0.5, max_cycles=10, econ=False)[source]

Performs an embedding of time series using the PECUZAL method

Parameters
  • s ('numpy.ndarray' (N, M)) – Input time series of length N as numpy array. This can be a multivariate set, where the M timeseries are stored in the columns.

  • taus (iterable, optional) – Possible delay values in sampling time units (Default is taus=range(50)). For each of the tau’s in taus the continuity statistic avrg_eps_star gets computed and further processed in order to find optimal delays for each embedding cycle.

  • theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.

  • sample_size (float, optional) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.] (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).

  • K (int, optional) – The amount of nearest neighbors in the Delta-ball. Must be at least 8 (in order to guarantee a valid statistic) and the Default is K = 13. The continuity statistic avrg_eps_star is computed in each embedding cycle, taking the minimum result over all k in K.

  • KNN (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is KNN = 3.

  • L_threshold (float, optional) – The algorithm breaks, when this threshold is exceeded by ΔL in an embedding cycle (set as a positive number, i.e. an absolute value of ΔL).

  • alpha (float, optional) – Significance level for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is alpha = 0.05).

  • p (float, optional) – Binominal p for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is p = 0.5).

  • max_cycles (int, optional) – The algorithm will stop after that many cycles no matter what. Default is max_cycles = 10.

  • econ (bool, optional) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.

Returns

  • Y (‘numpy.ndarray’ (N’, m)) – The trajectory from the embedding of length N’ = N-sum(tau_vals) of dimension m (embedding dimension)

  • tau_vals (‘list’ [int]) – The chosen delay values for each embedding cycle, len(tau_vals) = m.

  • ts_vals (‘list’ [int]) – The according time series number (index) chosen for each delay value in tau_vals, len(ts_vals) = m. For univariate embedding ts_vals is a vector of zeros of length tau_vals, because there is simply just one time series to choose from, i.e. index 0.

  • Ls (‘list’) – The \(\Delta L\)-statistic for each embedding cycle, including the very last encountered cycle, which will not enter the final trajectory Y. The total decrease of \(\Delta L\) is, thus, \(\Delta L_t = np.sum(Ls[:-1])\).

  • avrg_eps_stars (‘list’ [list]) – The continuity statistics for each embedding cycle. Contains avrg_eps_star of each embedding cycle.

Notes

The method works iteratively and gradually builds the final embedding vectors Y, as proposed in [kraemer2020] . Based on the continuity statistic avrg_eps_star [pecora2007] the algorithm picks an optimal delay value tau_i for each embedding cycle i. For achieving that, we take the inpute time series s, denoted as the actual phase space trajectory Y_actual and compute the continuity statistic avrg_eps_star. 1. Each local maxima in avrg_eps_star is used for constructing a candidate embedding trajectory Y_trial with a delay corresponding to that specific peak in avrg_eps_star. 2. We then compute the L-statistic [uzal2011] for Y_trial (L-trial) and Y_actual (L_actual) for increasing prediction time horizons (free parameter in the L-statistic) and save the maximum difference max(L-trial - L_actual) as \(\Delta L\) (Note that this is a negative number, since the L-statistic decreases with better reconstructions). 3. We pick the peak/tau-value, for which \(\Delta L\) is minimal (=maximum decrease of the overall L-value) and construct the actual embedding trajectory Y_actual (steps 1.-3. correspond to an embedding cycle). 4. We repeat steps 1.-3. with Y_actual as input and stop the algorithm when \(\Delta L\) is > 0, i.e. when and additional embedding component would not lead to a lower overall L-value. Y_actual -> Y.

In case of multivariate embedding, i.e. when embedding a set of M time series, in each embedding cycle the continuity statistic avrg_eps_star gets computed for all M time series available. The optimal delay value tau_i in each embedding cycle i is chosen as the peak/tau-value for which \(\Delta L\) is minimal under all available peaks and under all M avrg_eps_star’s. In the first embedding cycle there will be \(M^2\) different avrg_eps_star’s to consider, since it is not clear a priori which time series of the input should consitute the first component of the embedding vector and form Y_actual.

The range of considered delay values is determined in taus and for the nearest neighbor search we respect the Theiler window theiler. The final embedding vector is stored in Y (numpy.ndarray). The chosen delay values for each embedding cycle are stored in tau_vals (list of int) and the according time series numbers chosen for each delay value in tau_vals are stored in ts_vals. For univariate embedding ts_vals is a vector of ones of length len(tau_vals), because there is simply just one time series to choose from. The function also returns the \(\Delta Ls\)-values Ls for each embedding cycle and the continuity statistic avrg_eps_stars as list of `list`s.

For distance computations the Euclidean norm is used.

References

pecora2007(1,2)

Pecora et al., “A unified approach to attractor reconstruction”, Chaos, vol. 17, 013110, 2007. https://doi.org/10.1063/1.2430294

uzal2011(1,2,3,4)

Uzal et al., “Optimal reconstruction of dynamical systems: A noise amplification approach”, Physical Review E, vol. 84, 016223, 2011. https://doi.org/10.1103/PhysRevE.84.016223

pecuzal_embedding.pecuzal_multivariate_embedding_cycle(Y_act, flag, Ys, taus, theiler, counter, eps, tau_vals, norm, Ls, ts_vals, sample_size, K, alpha, p, KNN, econ)[source]

Perform one embedding cycle on Y_act with a multivariate set Ys

pecuzal_embedding.uzal_cost(Y, K=3, Tw=40, theiler=1, sample_size=1.0, norm='euclidean')[source]

Compute the L-statistic for the trajectory Y.

Parameters
  • Y (numpy.ndarray (N, d)) – State space vectors of length N and dimensionality d. This is usually an ouput from an embedding of some time series.

  • K (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is K = 3.

  • Tw (int, optional) – The maximal considered time horizon for obtaining the L-statistic. Default is Tw = 1.

  • theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.

  • sample_size (float, optional) – Number of considered fiducial points as a fraction of input trajectory length, i.e. a float from interval (0,1.] (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).

  • norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’

Returns

  • L (float) – The value of the proposed cost-function.

  • L_local (numpy.ndarray (N’,)) – The local value of the proposed cost-function. Note that this output is only meaningful, if you have set sample_size = 1.0, i.e. considering all points of the trajectory. The length of L_local is N’ = len(Y)-Tw.

Notes

The L-statistic is based on theoretical arguments on noise amplification, the complexity of the reconstructed attractor and a direct measure of local stretch, which constitutes an irrelevance measure [uzal2011]. Technically, it is the logarithm of the product of the \(\sigma\)-statistic and a normalization statistic \(\alpha\) :

\[L = log_{10}(\sigma\alpha)\]

The \(\sigma\)-statistic is computed as follows. \(\sigma = \sqrt{\sigma^2} = \sqrt{E^2/\varepsilon^2}\). \(E^2\) approximates the conditional variance at each point in state space and for a time horizon \(T \in Tw\), using \(K\) nearest neighbors. For each reference point of the state space trajectory Y, the neighborhood consists of the reference point itself and its K+1 nearest neighbors. \(E^2\) measures how strong a neighborhood expands during T time steps. \(E^2\) is averaged over many time horizons \(T = 1:Tw\). Consequently, \(\varepsilon^2\) is the size of the neighborhood at the reference point itself and is defined as the mean pairwise distance of the neighborhood. Finally, \(\sigma^2\) gets averaged over a range of reference points on the attractor, which is controlled by sample_size. This is just for performance reasons and the most accurate result will obviously be gained when setting sample_size=1.0.

The \(\alpha\)-statistic is a normalization factor, such that \(\sigma\)’s from different embeddings can be compared. \(\alpha^2\) is defined as the inverse of the sum of the inverse of all \(\varepsilon^2\)’s for all considered reference points.

pecuzal_embedding.uzal_cost_pecuzal(Y, Y_trial, Tw, K=3, theiler=1, norm='euclidean', econ=False)[source]

Compute Uzal etg al.’s L-statistic for the trajectories Y (L) and Y_trial (L_trial) up to the maximum given prediction horizon Tw.

Parameters
  • Y (numpy.ndarray (N, d)) – State space vectors of length N and dimensionality d.

  • Y_trial (numpy.ndarray (N’, d+1)) – State space vectors of length N’ and dimensionality d+1. This is usually result of Y undergoing a potential embedding cycle.

  • Tw (int) – The maximal considered time horizon for obtaining the \(\Delta L\).

  • K (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is K = 3.

  • theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.

  • norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’

  • econ (bool, optional) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.

Returns

:math:`Delta L` – The first minimal value of L_trial - L with respect to T \(\in\) 2:Tw (2:2:Tw, when econ is true).

Return type

float

Notes

This function makes use of the idea of the L-statistic proposed in [uzal2011] and specifically tailored for the needs of the PECUZAL algorithm to work, cf. Ref. [kraemer2020] . For each T \(\in\) 2:Tw, we compute \(\Delta L\) = L_trial - L. As soon as \(\Delta L\) increases with increasing T (since it is a negative value for a valuable embedding Y_trial with respect to Y) and \(\Delta L>0\), the algorithm breaks.

See also

uzal_cost()