Data Augmentation for Time Series¶
In deep learning, our dataset should help the optimization mechanism locate a good spot in the parameter space. However, realworld data is not necessarily diverse enough that covers the required situations with enough records. For example, some datasets may be extremely imbalanced class labels which leads to poor performance in classification tasks ^{3}. Another problem with a limited dataset is that the trained model may not generalize well ^{4}^{5}.
We will cover two topics in this section: Augmenting the dataset and application of the augmented data to model training.
Augmenting the Dataset¶
There are many different ways of augmenting time series data ^{4}^{6}. We categorize the methods into the following groups:
 Random transformations, e.g., jittering;
 Pattern mixing, e.g., DBA;^{7}
 Generative models, e.g.,
 phenomenological generative models such as AR ^{8},
 first principle models such as economical models ^{9},
 deep generative models such as TimeGAN or TS GAN ^{10}^{11}.
We also treat the first two methods, random transformations and pattern mixing as basic methods.
Basic Methods¶
In the following table, we group some of the data augmentation methods by two dimensions, the category of the method, and the domain of where the method is applied.
Projected Domain  Time Scale  Magnitude  

Random Transformation  Frequency Masking, Frequency Warping, Fourier Transform, STFT  Permutation, Slicing, Time Warping, Time Masking, Cropping  Jittering, Flipping, Scaling, Magnitude Warping 
Pattern Mixing  EMDA^{12}, SFM^{13}  Guided Warping^{14}  DFM^{9}, Interpolation, DBA^{7} 
For completeness, we will explain some of the methods in more detail in the following.
Perturbation in Fourier Domain¶
In the Fourier domain, for each the amplitude \(A_f\) and phase \(\phi_f\) at a specific frequency, we can perform^{15}
 magnitude replacement using a Gaussian distribution, and
 phase shift by adding Gaussian noise.
We perform such perturbations at some chosen frequency.
Slicing, Permutation, and Bootstrapping¶
We can slice a series into small segments. With the slices, we can perform different operations to create new series.
 Window Slicing (WS): In a classification task, we can take the slices from the original series and assign the same class label to the slice ^{16}. The slices can also be interpolated to match the length of the original series ^{4}.
 Permutation: We take the slices and permute them to form a new series ^{17}.
 Moving Block Bootstrapping (MBB): First, we remove the trend and seasonability. Then we draw blocks of fixed length from the residual of the series until the desired length of the series is met. Finally, we combine the newly formed residual with trend and seasonality to form a new series ^{18}.
Warping¶
Both the time scale and magnitude can be warped. For example,
 Time Warping: We distort time intervals by taking a range of data points and upsample or downsample it ^{6}.
 Magnitude Warping: the magnitude of the time series is rescaled.
Dynamic Time Warping (DTW)
Given two sequences, \(S^{(1)}\) and \(S^{(2)}\), the Dynamic Time Warping (DTW) algorithm finds the best way to align two sequences. During this alignment process, we quantify the misalignment using a distance similar to the Levenshtein distance, where the distance between two series \(S^{(1)}_{1:i}\) (with \(i\) elements) and \(S^{(2)}_{1:j}\) (with \(j\) elements) is^{7}
where \(S^{(1)}_i\) is the \(i\)the element of the series \(S^{(1)}\), \(d(x,y)\) is a predetermined distance, e.g., Euclidean distance. This definition reveals the recursive nature of the DTW distance.
Notations in the Definition: \(S_{1:i}\) and \(S_{i}\)
The notation \(S_{1:i}\) stands for a series that contains the elements starting from the first to the \(i\)th in series \(S\). For example, we have a series
The notation \(S^1_{1:4}\) represents
The notation \(S_i\) indicates the \(i\)th element in \(S\). For example,
If we map these two notations to Python,
 \(S_{1:i}\) is equivalent to
S[0:i]
, and  \(S_i\) is equivalent to
S[i1]
.
Note that the indices in Python look strange. This is also the reason we choose to use subscripts not square brackets in our definition.
Levenshtein Distance
Given two words, e.g., \(w^{a} = \mathrm{cats}\) and \(w^{b} = \mathrm{katz}\). Suppose we can only use three operations: insertions, deletions and substitutions. The Levenshtein distance calculates the number of such operations needed to change from the first word \(w^a\) to the second one \(w^b\) by applying singlecharacter edits. In this example, we need two replacements, i.e., "c" > "k"
and "s" > "z"
.
The Levenshtein distance can be solved using recursive algorithms ^{1}.
DTW is very useful when comparing series with different lengths. For example, most error metrics require the actual time series and predicted series to have the same length. In the case of different lengths, we can perform DTW when calculating these metrics^{2}.
The forecasting package darts provides a demo of DTW.
DTW Barycenter Averaging
DTW Barycenter Averaging (DBA) constructs a series \(\bar{\mathcal S}\) out of a set of series \(\{\mathcal S^{(\alpha)}\}\) so that \(\bar{\mathcal S}\) is the barycenter of \(\{\mathcal S^{(\alpha)}\}\) measured by Dynamic Time Warping (DTW) distance ^{7}.
Barycenter Averaging Based on DTW Distance¶
Petitjean et al proposed a time series averaging algorithm based on DTW distance which is dubbed DTW Barycenter Averaging (DBA).
DBA Implementation
Series Mixing¶
Another class of data augmentation methods is mixing the series. For example, we take two randomly drawn series and average them using DTW Barycenter Averaging (DBA) ^{7}. (DTW, dynamic time warping, is an algorithm to calculate the distance between sequential datasets by matching the data points on each of the series ^{7}^{19}.) To augment a dataset, we can choose from a list of strategies ^{20}^{21}:
 Average All series using different sets of weights to create new synthetic series.
 Average Selected series based on some strategies. For example, Forestier et al proposed choosing an initial series and combining it with its nearest neighbors ^{21}.
 Average Selected with Distance is Average Selected but neighbors that are far from the initial series are downweighted ^{21}.
Some other similar methods are
 Equalized Mixture Data Augmentation (EMDA) calculates the weighted average of spectrograms of the same class label^{12}.
 Stochastic Feature Mapping (SFM) is a data augmentation method in audio data^{13}.
Data Generating Process¶
Time series data can also be augmented using some assumed data generating process (DGP). Some methods, such as GRATIS ^{8}, utilize simple generic methods such as AR/MAR. Some other methods, such as Gaussian Trees ^{22}, utilize more complicated hidden structures using graphs, which can approximate more complicated data generating processes. These methods do not necessarily reflect the actual data generating process but the data is generated using some parsimonious phenomenological models. Some other methods are more tuned toward detailed mechanisms. There are also methods using generative deep neural networks such as GAN.
Dynamic Factor Model (DFM)¶
For example, we have a series \(X(t)\) which depends on a latent variable \(f(t)\)^{9},
where \(f(t)\) is determined by a differential equation
In the above equations, \(\eta(t)\) and \(\xi(t)\) are the irreducible noise.
The above two equations can be combined into one firstorder differential equation.
Once the model is fit, it can be used to generate new data points. However, we will have to understand whether the data is generated in such processes.
Applying the Synthetic Data to Model Training¶
Once we prepared the synthetic dataset, there are two strategies to include them in our model training ^{20}.
Strategy  Description 

Pooled Strategy  Synthetic data + original data > model 
Transfer Strategy  Synthetic data > pretrained model; pretrained model + original data > model 
The pooled strategy takes the synthetic data and original data then feeds them together into the training pipeline. The transfer strategy uses the synthetic data to pretrain the model, then uses transfer learning methods (e.g., freeze weights of some layers) to train the model on the original data.

trekhleb. javascriptalgorithms/src/algorithms/string/levenshteindistance at master · trekhleb/javascriptalgorithms. In: GitHub [Internet]. [cited 27 Jul 2022]. Available: https://github.com/trekhleb/javascriptalgorithms/tree/master/src/algorithms/string/levenshteindistance ↩

Unit8. Metrics — darts documentation. In: Darts [Internet]. [cited 7 Mar 2023]. Available: https://unit8co.github.io/darts/generated_api/darts.metrics.metrics.html?highlight=dtw#darts.metrics.metrics.dtw_metric ↩

Hasibi R, Shokri M, Dehghan M. Augmentation scheme for dealing with imbalanced network traffic classification using deep learning. 2019.http://arxiv.org/abs/1901.00204. ↩

Iwana BK, Uchida S. An empirical survey of data augmentation for time series classification with neural networks. 2020.http://arxiv.org/abs/2007.15951. ↩↩↩

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data 2019; 6: 1–48. ↩

Wen Q, Sun L, Yang F, Song X, Gao J, Wang X et al. Time series data augmentation for deep learning: A survey. 2020.http://arxiv.org/abs/2002.12478. ↩↩

Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition 2011; 44: 678–693. ↩↩↩↩↩↩

Kang Y, Hyndman RJ, Li F. GRATIS: GeneRAting TIme series with diverse and controllable characteristics. 2019.http://arxiv.org/abs/1903.02787. ↩↩

Stock JH, Watson MW. Chapter 8  dynamic factor models, FactorAugmented vector autoregressions, and structural vector autoregressions in macroeconomics. In: Taylor JB, Uhlig H (eds). Handbook of macroeconomics. Elsevier, 2016, pp 415–525. ↩↩↩

Yoon J, Jarrett D, Schaar M van der. Timeseries generative adversarial networks. In: Wallach H, Larochell H, Beygelzime A, Buc F dAlche, Fox E, Garnett R (eds). Advances in neural information processing systems. Curran Associates, Inc., 2019https://papers.nips.cc/paper/2019/hash/c9efe5f26cd17ba6216bbe2a7d26d490Abstract.html. ↩

Brophy E, Wang Z, She Q, Ward T. Generative adversarial networks in time series: A survey and taxonomy. 2021.http://arxiv.org/abs/2107.11098. ↩

Takahashi N, Gygli M, Van Gool L. AENet: Learning deep audio features for video analysis. 2017.http://arxiv.org/abs/1701.00599. ↩↩

Cui X, Goel V, Kingsbury B. Data augmentation for deep neural network acoustic modeling. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2014, pp 5582–5586. ↩↩

Iwana BK, Uchida S. Time series data augmentation for neural networks by time warping with a discriminative teacher. 2020.http://arxiv.org/abs/2004.08780. ↩

Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. RobustTAD: Robust time series anomaly detection via decomposition and convolutional neural networks. 2020.http://arxiv.org/abs/2002.09545. ↩

Le Guennec A, Malinowski S, Tavenard R. Data augmentation for time series classification using convolutional neural networks. In: ECML/PKDD workshop on advanced analytics and learning on temporal data. 2016https://halshs.archivesouvertes.fr/halshs01357973/document. ↩

Um TT, Pfister FMJ, Pichler D, Endo S, Lang M, Hirche S et al. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. 2017.http://arxiv.org/abs/1706.00527. ↩

Bergmeir C, Hyndman RJ, Benı́tez JM. Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting 2016; 32: 303–312. ↩

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: Current status and future directions. 2019.http://arxiv.org/abs/1909.00590. ↩

Bandara K, Hewamalage H, Liu YH, Kang Y, Bergmeir C. Improving the accuracy of global forecasting models using time series data augmentation. 2020.http://arxiv.org/abs/2008.02663. ↩↩

Forestier G, Petitjean F, Dau HA, Webb GI, Keogh E. Generating synthetic time series to augment sparse datasets. In: 2017 IEEE international conference on data mining (ICDM). 2017, pp 865–870. ↩↩↩

Cao H, Tan VYF, Pang JZF. A parsimonious mixture of gaussian trees model for oversampling in imbalanced and multimodal timeseries classification. IEEE transactions on neural networks and learning systems 2014; 25: 2226–2239. ↩