PECUZAL embedding module¶
Created on Mon Aug 10 09:48:16 2020
@author: hkraemer
-
pecuzal_embedding.
all_neighbors
(vtree, vs, ns, K, theiler, k_max)[source]¶ Compute K nearest neighbours for the points vs (having indices ns) from the tree vtree, while respecting the theiler-window.
- Returns
indices (numpy.ndarray (len(vs),K)) – The indices of the K-nearest neighbours of all points vs (having indices ns)
dists (numpy.ndarray (len(vs),K)) – The distances to the K-nearest neighbours of all points vs (having indices ns)
-
pecuzal_embedding.
all_neighbors_1dim
(vs, ns, K, theiler)[source]¶ Compute K nearest neighbours for the points vs (having indices ns), while respecting the theiler-window.
- Returns
indices (numpy.ndarray (len(vs),K)) – The indices of the K-nearest neighbours of all points vs (having indices ns)
dists (numpy.ndarray (len(vs),K)) – The distances to the K-nearest neighbours of all points vs (having indices ns)
-
pecuzal_embedding.
choose_right_embedding_params
(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]¶ Choose the right embedding parameters of the estar-statistic in any embedding cycle, but the first one, on the basis of minimal L, i.e. maximal L_decrease.
- Parameters
estar (numpy.ndarray (len(taus), M)) – The M continuity statistic(s) for delays taus.
Y_act (numpy.ndarray) – The actual phase space trajectory.
s (numpy.ndarray` (N, M)) – The M time series of length N.
taus (iterable) – Possible delay values in sampling time units.
KNN (int) – The number of nearest neighbors to be considered in the L-statistic.
theiler (int) – Theiler window for excluding serial correlated points from neighbourhood, in sampling time units.
sample_size (float) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.].
norm (str) – The norm used for distance computations.
econ (bool) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.
- Returns
L (float) – L-value of chosen peak in estar
tau_idx (int) – The corresponding index value of the chosen peak
ts_idx (int) – The number of the chosen time series to start with
See also
-
pecuzal_embedding.
choose_right_embedding_params_first
(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]¶ Choose the right embedding parameters of the estar-statistic in the first embedding cycle. Return the L-decrease-value, the corresponding index value of the chosen peak tau_idx and the number of the chosen time series ts_idx to start with.
- Parameters
estar (numpy.ndarray (len(taus), M)) – The M continuity statistic(s) for delays taus.
Y_act (numpy.ndarray) – The actual phase space trajectory.
s (numpy.ndarray` (N, M)) – The M time series of length N.
taus (iterable) – Possible delay values in sampling time units.
KNN (int) – The number of nearest neighbors to be considered in the L-statistic.
theiler (int) – Theiler window for excluding serial correlated points from neighbourhood, in sampling time units.
sample_size (float) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.].
norm (str) – The norm used for distance computations.
econ (bool) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.
- Returns
L (float) – L-value of chosen peak in estar
tau_idx (int) – The corresponding index value of the chosen peak
ts_idx (int) – The number of the chosen time series to start with
See also
-
pecuzal_embedding.
comp_Ek2
(Y, ns, NNidxs, T, K, norm)[source]¶ Compute the approximated conditional variance for a specific point in Y (with index ns) for time horizon T.
- Returns
E2 – The approximated conditional variance for a specific point in state space ns (index value) with its K-nearest neighbors, which indices are stored in NNidxs, for a time horizon T. This corresponds to Eqs. 13 & 14 in [uzal2011].
- Return type
float
-
pecuzal_embedding.
continuity_statistic
(s, taus, js, delays=range(0, 50), sample_size=1.0, K=13, theiler=1, norm='euclidean', alpha=0.05, p=0.5)[source]¶ Compute the continuity statistic for a trajectory defined by s, taus and js and all time series stored in s.
- Parameters
s (numpy.ndarray (N, M)) – Input time series of length N. This can be a multivariate set, consisting of M time series, which are stored in the columns.
taus (list or numpy.ndarray) – Denotes what delay times will be used for constructing the trajectory for which the continuity statistic to all time series in s will be computed.
js (list or numpy.ndarray) – Denotes which of the timeseries contained in s will be used for constructing the trajectory using the delay values stored in taus. js can contain duplicate indices.
delays (iterable, optional) – Possible delay values in sampling time units (Default is delays = range(50)). The continuity statistic avrg_eps_star is a function of these delay values.
sample_size (float, optional) – Number of considered fiducial points as a fraction of input time series length, i.e. a float \(\in (0,1.]\) (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).
K (int, optional) – The amount of nearest neighbors in the \(\delta\)-ball. Must be at least 8 (in order to guarantee a valid statistic) and the Default is K = 13. The continuity statistic avrg_eps_star is computed by taking the minimum result over all \(k \in K\).
theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.
norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’
alpha (float, optional) – Significance level for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is alpha = 0.05).
p (float, optional) – Binominal p for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is p = 0.5).
- Returns
avrg_eps_star – The continuity statistic avrg_eps_star for each of the M time series in s for the given trajectory specified by the general embedding parameters taus and js and for all delay values specified in delays.
- Return type
numpy.ndarray (len(delays), M)
See also
Notes
The full algorithm is too large to discuss here and written in detail in Ref. [pecora2007] and in summary in Ref. [kraemer2020] .
-
pecuzal_embedding.
embedding_cycle_pecuzal
(Y_act, Ys, counter, M, taus, theiler, sample_size, K, norm, alpha, p, KNN, tau_vals, ts_vals, Ls, eps, econ)[source]¶ Perform an embedding cycle of the multivariate embedding, but the first one.
-
pecuzal_embedding.
entropy1d
(x, binrule)[source]¶ Returns the Shannon entropy according to the bin rule specified.
-
pecuzal_embedding.
entropy2d
(x, y, bin_edges)[source]¶ Returns the Shannon entropy according to the bin rule specified.
-
pecuzal_embedding.
first_embedding_cycle_pecuzal
(Ys, M, taus, theiler, sample_size, K, norm, alpha, p, KNN, tau_vals, ts_vals, Ls, eps, econ)[source]¶ Perform the first embedding cycle of the multivariate embedding.
-
pecuzal_embedding.
genembed
(s, taus, js)[source]¶ Perform an embedding with delays taus and time series stored in s, specified by their indices js
- Parameters
s (numpy.ndarray (N, M)) – Input time series of length N. This can be a multivariate set, consisting of `M`time series, which are stored in the columns.
taus (list or numpy.ndarray) – Denotes what delay times will be used for constructing the trajectory for which the continuity statistic to all time series in s will be computed.
js (list or numpy.ndarray) – Denotes which of the timeseries contained in s will be used for constructing the trajectory using the delay values stored in taus. js can contain duplicate indices.
- Returns
Y – The trajectory from the embedding of length N’ = N-sum(taus) of dimension d = len(taus).
- Return type
numpy.ndarray (N’, d)
Notes
The generalized embedding works as follows: taus, js are list’s (or numpy.ndarray’s) of length d, which also coincides with the embedding dimension. For example, imagine input trajectory \(s = [x, y, z]\) where \(x, y, z\) are timeseries (the columns of s). If js = (0, 2, 1) and taus = (0, 2, 7) the created delay vector at each step t will be
\[(x(t), z(t+2), y(t+7))\]
-
pecuzal_embedding.
get_binomial_table
(p=0.5, alpha=0.05, trial_range=8)[source]¶ Compute the numbers of points from the \(\delta\)-neighborhood, which need to fall outside the \(\varepsilon\)-neighborhood, in order to reject the Null Hypothesis at a significance level \(\alpha\).
- Parameters
p (float, optional) – Binominal p (Default is p = 0.5).
alpha (float, optional) – Significance level in order to be able to reject the Null on the basis of the binomial distribution (Default is alpha = 0.05).
trial_range (int, optional) – Number of considered delta-neighborhood-points (Default is trial_range = 8).
- Returns
delta_to_epsilon_amount – A dictionary with delta_points as keys and the corresponding number of points in order to reject the Null, epsilon_points, constitute the values.
- Return type
dict
Notes
One parameter of the binomial distribution is p, the other one would be the number of trials, i.e. the considered number of points of the \(\delta\)-neighborhood. trial_range determines the number of considered \(\delta\)-neighborhood-points, always starting from 8. For instance, if trial_range = 8, then \(\delta\)-neighborhood sizes from 8 up to 15 are considered.
-
pecuzal_embedding.
get_maxima
(s)[source]¶ Return the maxima of the given time series s and its indices
-
pecuzal_embedding.
hcat_lagged_values
(Y, s, tau)[source]¶ Add the tau lagged values of the timeseries s as additional component to Y (np.ndarray), in order to form a higher embedded dataset Z. The dimensionality of Z is thus equal to that of Y + 1.
-
pecuzal_embedding.
local_L_statistics
(estar, Y_act, s, taus, KNN, theiler, sample_size, norm, econ)[source]¶ Return the maximum decrease of the L-statistic L_decrease and corresponding delay-indices max_idx for all local maxima in estar.
-
pecuzal_embedding.
mi
(x, maxlag=50)[source]¶ Compute the auto mutual information of a time series x up to a lag maxlag.
- Parameters
x (numpy.ndarray) – Numpy array storing the time series values.
maxlag (int, optional) – The maximum lag in sampling units, i.e. an integer value
- Returns
mi (numpy.ndarray (len(range(maxlag)))) – The auto mutual information of the given time series at each considered lag.
lags (numpy.ndarray (len(range(maxlag)))) – The considered lags
-
pecuzal_embedding.
pecuzal_break_criterion
(Ls, counter, max_num_of_cycles, threshold)[source]¶ Checks whether some break criteria are fulfilled
-
pecuzal_embedding.
pecuzal_embedding
(s, taus=range(0, 50), theiler=1, sample_size=1.0, K=13, KNN=3, L_threshold=0.0, alpha=0.05, p=0.5, max_cycles=10, econ=False)[source]¶ Performs an embedding of time series using the PECUZAL method
- Parameters
s ('numpy.ndarray' (N, M)) – Input time series of length N as numpy array. This can be a multivariate set, where the M timeseries are stored in the columns.
taus (iterable, optional) – Possible delay values in sampling time units (Default is taus=range(50)). For each of the tau’s in taus the continuity statistic avrg_eps_star gets computed and further processed in order to find optimal delays for each embedding cycle.
theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.
sample_size (float, optional) – Number of considered fiducial points as a fraction of input time series length, i.e. a float from interval (0,1.] (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).
K (int, optional) – The amount of nearest neighbors in the Delta-ball. Must be at least 8 (in order to guarantee a valid statistic) and the Default is K = 13. The continuity statistic avrg_eps_star is computed in each embedding cycle, taking the minimum result over all k in K.
KNN (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is KNN = 3.
L_threshold (float, optional) – The algorithm breaks, when this threshold is exceeded by ΔL in an embedding cycle (set as a positive number, i.e. an absolute value of ΔL).
alpha (float, optional) – Significance level for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is alpha = 0.05).
p (float, optional) – Binominal p for obtaining the continuity statistic avrg_eps_star in each embedding cycle (Default is p = 0.5).
max_cycles (int, optional) – The algorithm will stop after that many cycles no matter what. Default is max_cycles = 10.
econ (bool, optional) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.
- Returns
Y (‘numpy.ndarray’ (N’, m)) – The trajectory from the embedding of length N’ = N-sum(tau_vals) of dimension m (embedding dimension)
tau_vals (‘list’ [int]) – The chosen delay values for each embedding cycle, len(tau_vals) = m.
ts_vals (‘list’ [int]) – The according time series number (index) chosen for each delay value in tau_vals, len(ts_vals) = m. For univariate embedding ts_vals is a vector of zeros of length tau_vals, because there is simply just one time series to choose from, i.e. index 0.
Ls (‘list’) – The \(\Delta L\)-statistic for each embedding cycle, including the very last encountered cycle, which will not enter the final trajectory Y. The total decrease of \(\Delta L\) is, thus, \(\Delta L_t = np.sum(Ls[:-1])\).
avrg_eps_stars (‘list’ [list]) – The continuity statistics for each embedding cycle. Contains avrg_eps_star of each embedding cycle.
See also
Notes
The method works iteratively and gradually builds the final embedding vectors Y, as proposed in [kraemer2020] . Based on the continuity statistic avrg_eps_star [pecora2007] the algorithm picks an optimal delay value tau_i for each embedding cycle i. For achieving that, we take the inpute time series s, denoted as the actual phase space trajectory Y_actual and compute the continuity statistic avrg_eps_star. 1. Each local maxima in avrg_eps_star is used for constructing a candidate embedding trajectory Y_trial with a delay corresponding to that specific peak in avrg_eps_star. 2. We then compute the L-statistic [uzal2011] for Y_trial (L-trial) and Y_actual (L_actual) for increasing prediction time horizons (free parameter in the L-statistic) and save the maximum difference max(L-trial - L_actual) as \(\Delta L\) (Note that this is a negative number, since the L-statistic decreases with better reconstructions). 3. We pick the peak/tau-value, for which \(\Delta L\) is minimal (=maximum decrease of the overall L-value) and construct the actual embedding trajectory Y_actual (steps 1.-3. correspond to an embedding cycle). 4. We repeat steps 1.-3. with Y_actual as input and stop the algorithm when \(\Delta L\) is > 0, i.e. when and additional embedding component would not lead to a lower overall L-value. Y_actual -> Y.
In case of multivariate embedding, i.e. when embedding a set of M time series, in each embedding cycle the continuity statistic avrg_eps_star gets computed for all M time series available. The optimal delay value tau_i in each embedding cycle i is chosen as the peak/tau-value for which \(\Delta L\) is minimal under all available peaks and under all M avrg_eps_star’s. In the first embedding cycle there will be \(M^2\) different avrg_eps_star’s to consider, since it is not clear a priori which time series of the input should consitute the first component of the embedding vector and form Y_actual.
The range of considered delay values is determined in taus and for the nearest neighbor search we respect the Theiler window theiler. The final embedding vector is stored in Y (numpy.ndarray). The chosen delay values for each embedding cycle are stored in tau_vals (list of int) and the according time series numbers chosen for each delay value in tau_vals are stored in ts_vals. For univariate embedding ts_vals is a vector of ones of length len(tau_vals), because there is simply just one time series to choose from. The function also returns the \(\Delta Ls\)-values Ls for each embedding cycle and the continuity statistic avrg_eps_stars as list of `list`s.
For distance computations the Euclidean norm is used.
References
- pecora2007(1,2)
Pecora et al., “A unified approach to attractor reconstruction”, Chaos, vol. 17, 013110, 2007. https://doi.org/10.1063/1.2430294
- uzal2011(1,2,3,4)
Uzal et al., “Optimal reconstruction of dynamical systems: A noise amplification approach”, Physical Review E, vol. 84, 016223, 2011. https://doi.org/10.1103/PhysRevE.84.016223
-
pecuzal_embedding.
pecuzal_multivariate_embedding_cycle
(Y_act, flag, Ys, taus, theiler, counter, eps, tau_vals, norm, Ls, ts_vals, sample_size, K, alpha, p, KNN, econ)[source]¶ Perform one embedding cycle on Y_act with a multivariate set Ys
-
pecuzal_embedding.
uzal_cost
(Y, K=3, Tw=40, theiler=1, sample_size=1.0, norm='euclidean')[source]¶ Compute the L-statistic for the trajectory Y.
- Parameters
Y (numpy.ndarray (N, d)) – State space vectors of length N and dimensionality d. This is usually an ouput from an embedding of some time series.
K (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is K = 3.
Tw (int, optional) – The maximal considered time horizon for obtaining the L-statistic. Default is Tw = 1.
theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.
sample_size (float, optional) – Number of considered fiducial points as a fraction of input trajectory length, i.e. a float from interval (0,1.] (Default is sample_size = 1.0, i.e., all points of the acutal trajectory get considered).
norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’
- Returns
L (float) – The value of the proposed cost-function.
L_local (numpy.ndarray (N’,)) – The local value of the proposed cost-function. Note that this output is only meaningful, if you have set sample_size = 1.0, i.e. considering all points of the trajectory. The length of L_local is N’ = len(Y)-Tw.
Notes
The L-statistic is based on theoretical arguments on noise amplification, the complexity of the reconstructed attractor and a direct measure of local stretch, which constitutes an irrelevance measure [uzal2011]. Technically, it is the logarithm of the product of the \(\sigma\)-statistic and a normalization statistic \(\alpha\) :
\[L = log_{10}(\sigma\alpha)\]The \(\sigma\)-statistic is computed as follows. \(\sigma = \sqrt{\sigma^2} = \sqrt{E^2/\varepsilon^2}\). \(E^2\) approximates the conditional variance at each point in state space and for a time horizon \(T \in Tw\), using \(K\) nearest neighbors. For each reference point of the state space trajectory Y, the neighborhood consists of the reference point itself and its K+1 nearest neighbors. \(E^2\) measures how strong a neighborhood expands during T time steps. \(E^2\) is averaged over many time horizons \(T = 1:Tw\). Consequently, \(\varepsilon^2\) is the size of the neighborhood at the reference point itself and is defined as the mean pairwise distance of the neighborhood. Finally, \(\sigma^2\) gets averaged over a range of reference points on the attractor, which is controlled by sample_size. This is just for performance reasons and the most accurate result will obviously be gained when setting sample_size=1.0.
The \(\alpha\)-statistic is a normalization factor, such that \(\sigma\)’s from different embeddings can be compared. \(\alpha^2\) is defined as the inverse of the sum of the inverse of all \(\varepsilon^2\)’s for all considered reference points.
-
pecuzal_embedding.
uzal_cost_pecuzal
(Y, Y_trial, Tw, K=3, theiler=1, norm='euclidean', econ=False)[source]¶ Compute Uzal etg al.’s L-statistic for the trajectories Y (L) and Y_trial (L_trial) up to the maximum given prediction horizon Tw.
- Parameters
Y (numpy.ndarray (N, d)) – State space vectors of length N and dimensionality d.
Y_trial (numpy.ndarray (N’, d+1)) – State space vectors of length N’ and dimensionality d+1. This is usually result of Y undergoing a potential embedding cycle.
Tw (int) – The maximal considered time horizon for obtaining the \(\Delta L\).
K (int, optional) – The number of nearest neighbors to be considered in the L-statistic, Default is K = 3.
theiler (int, optional) – Theiler window for excluding serial correlated points from neighbourhood. In sampling time units, Default is theiler = 1.
norm (str, optional) – The norm used for distance computations. Must be either ‘euclidean’ (Default) or ‘chebyshev’
econ (bool, optional) – Economy-mode for L-statistic computation. Instead of computing L-statistics for time horizons 2:Tw, here we only compute them for 2:2:Tw.
- Returns
:math:`Delta L` – The first minimal value of L_trial - L with respect to T \(\in\) 2:Tw (2:2:Tw, when econ is true).
- Return type
float
Notes
This function makes use of the idea of the L-statistic proposed in [uzal2011] and specifically tailored for the needs of the PECUZAL algorithm to work, cf. Ref. [kraemer2020] . For each T \(\in\) 2:Tw, we compute \(\Delta L\) = L_trial - L. As soon as \(\Delta L\) increases with increasing T (since it is a negative value for a valuable embedding Y_trial with respect to Y) and \(\Delta L>0\), the algorithm breaks.
See also