# Data Augmentation for Time Series¶

In deep learning, our dataset should help the optimization mechanism locate a good spot in the parameter space. However, real-world data is not necessarily diverse enough that covers the required situations with enough records. For example, some datasets may be extremely imbalanced class labels which leads to poor performance in classification tasks 3. Another problem with a limited dataset is that the trained model may not generalize well 45.

We will cover two topics in this section: Augmenting the dataset and application of the augmented data to model training.

## Augmenting the Dataset¶

There are many different ways of augmenting time series data 46. We categorize the methods into the following groups:

• Random transformations, e.g., jittering;
• Pattern mixing, e.g., DBA;7
• Generative models, e.g.,
• phenomenological generative models such as AR 8,
• first principle models such as economical models 9,
• deep generative models such as TimeGAN or TS GAN 1011.

We also treat the first two methods, random transformations and pattern mixing as basic methods.

### Basic Methods¶

In the following table, we group some of the data augmentation methods by two dimensions, the category of the method, and the domain of where the method is applied.

Projected Domain Time Scale Magnitude
Random Transformation Frequency Masking, Frequency Warping, Fourier Transform, STFT Permutation, Slicing, Time Warping, Time Masking, Cropping Jittering, Flipping, Scaling, Magnitude Warping
Pattern Mixing EMDA12, SFM13 Guided Warping14 DFM9, Interpolation, DBA7

For completeness, we will explain some of the methods in more detail in the following.

#### Perturbation in Fourier Domain¶

In the Fourier domain, for each the amplitude $$A_f$$ and phase $$\phi_f$$ at a specific frequency, we can perform15

• magnitude replacement using a Gaussian distribution, and
• phase shift by adding Gaussian noise.

We perform such perturbations at some chosen frequency.

#### Slicing, Permutation, and Bootstrapping¶

We can slice a series into small segments. With the slices, we can perform different operations to create new series.

• Window Slicing (WS): In a classification task, we can take the slices from the original series and assign the same class label to the slice 16. The slices can also be interpolated to match the length of the original series 4.
• Permutation: We take the slices and permute them to form a new series 17.
• Moving Block Bootstrapping (MBB): First, we remove the trend and seasonability. Then we draw blocks of fixed length from the residual of the series until the desired length of the series is met. Finally, we combine the newly formed residual with trend and seasonality to form a new series 18.

#### Warping¶

Both the time scale and magnitude can be warped. For example,

• Time Warping: We distort time intervals by taking a range of data points and upsample or downsample it 6.
• Magnitude Warping: the magnitude of the time series is rescaled.
Dynamic Time Warping (DTW)

Given two sequences, $$S^{(1)}$$ and $$S^{(2)}$$, the Dynamic Time Warping (DTW) algorithm finds the best way to align two sequences. During this alignment process, we quantify the misalignment using a distance similar to the Levenshtein distance, where the distance between two series $$S^{(1)}_{1:i}$$ (with $$i$$ elements) and $$S^{(2)}_{1:j}$$ (with $$j$$ elements) is7

\begin{align} D(S^{(1)}_{1:i}, S^{(2)}_{1:j}) =& d(S^{(1)}_i, S^{(2)}_j)\\ & + \operatorname{min}\left[ D(S^{(1)}_{1:i-1}, S^{(2)}_{1:j-1}), D(S^{(1)}_{1:i}, S^{(2)}_{1:j-1}), D(S^{(1)}_{1:i-1}, S^{(2)}_{1:j}) \right], \end{align}

where $$S^{(1)}_i$$ is the $$i$$the element of the series $$S^{(1)}$$, $$d(x,y)$$ is a predetermined distance, e.g., Euclidean distance. This definition reveals the recursive nature of the DTW distance.

Notations in the Definition: $$S_{1:i}$$ and $$S_{i}$$

The notation $$S_{1:i}$$ stands for a series that contains the elements starting from the first to the $$i$$th in series $$S$$. For example, we have a series

$S^1 = [s^1_1, s^1_2, s^1_3, s^1_4, s^1_5, s^1_6].$

The notation $$S^1_{1:4}$$ represents

$S^1_{1:4} = [s^1_1, s^1_2, s^1_3, s^1_4].$

The notation $$S_i$$ indicates the $$i$$th element in $$S$$. For example,

$S^1_4 = s^1_4.$

If we map these two notations to Python,

• $$S_{1:i}$$ is equivalent to S[0:i], and
• $$S_i$$ is equivalent to S[i-1].

Note that the indices in Python look strange. This is also the reason we choose to use subscripts not square brackets in our definition.

Levenshtein Distance

Given two words, e.g., $$w^{a} = \mathrm{cats}$$ and $$w^{b} = \mathrm{katz}$$. Suppose we can only use three operations: insertions, deletions and substitutions. The Levenshtein distance calculates the number of such operations needed to change from the first word $$w^a$$ to the second one $$w^b$$ by applying single-character edits. In this example, we need two replacements, i.e., "c" -> "k" and "s" -> "z".

The Levenshtein distance can be solved using recursive algorithms 1.

DTW is very useful when comparing series with different lengths. For example, most error metrics require the actual time series and predicted series to have the same length. In the case of different lengths, we can perform DTW when calculating these metrics2.

The forecasting package darts provides a demo of DTW.

DTW Barycenter Averaging

DTW Barycenter Averaging (DBA) constructs a series $$\bar{\mathcal S}$$ out of a set of series $$\{\mathcal S^{(\alpha)}\}$$ so that $$\bar{\mathcal S}$$ is the barycenter of $$\{\mathcal S^{(\alpha)}\}$$ measured by Dynamic Time Warping (DTW) distance 7.

## Barycenter Averaging Based on DTW Distance¶

Petitjean et al proposed a time series averaging algorithm based on DTW distance which is dubbed DTW Barycenter Averaging (DBA).

### Series Mixing¶

Another class of data augmentation methods is mixing the series. For example, we take two randomly drawn series and average them using DTW Barycenter Averaging (DBA) 7. (DTW, dynamic time warping, is an algorithm to calculate the distance between sequential datasets by matching the data points on each of the series 719.) To augment a dataset, we can choose from a list of strategies 2021:

• Average All series using different sets of weights to create new synthetic series.
• Average Selected series based on some strategies. For example, Forestier et al proposed choosing an initial series and combining it with its nearest neighbors 21.
• Average Selected with Distance is Average Selected but neighbors that are far from the initial series are down-weighted 21.

Some other similar methods are

• Equalized Mixture Data Augmentation (EMDA) calculates the weighted average of spectrograms of the same class label12.
• Stochastic Feature Mapping (SFM) is a data augmentation method in audio data13.

### Data Generating Process¶

Time series data can also be augmented using some assumed data generating process (DGP). Some methods, such as GRATIS 8, utilize simple generic methods such as AR/MAR. Some other methods, such as Gaussian Trees 22, utilize more complicated hidden structures using graphs, which can approximate more complicated data generating processes. These methods do not necessarily reflect the actual data generating process but the data is generated using some parsimonious phenomenological models. Some other methods are more tuned toward detailed mechanisms. There are also methods using generative deep neural networks such as GAN.

#### Dynamic Factor Model (DFM)¶

For example, we have a series $$X(t)$$ which depends on a latent variable $$f(t)$$9,

$X(t) = \mathbf A f(t) + \eta(t),$

where $$f(t)$$ is determined by a differential equation

$\frac{f(t)}{dt} = \mathbf B f(t) + \xi(t).$

In the above equations, $$\eta(t)$$ and $$\xi(t)$$ are the irreducible noise.

The above two equations can be combined into one first-order differential equation.

Once the model is fit, it can be used to generate new data points. However, we will have to understand whether the data is generated in such processes.

## Applying the Synthetic Data to Model Training¶

Once we prepared the synthetic dataset, there are two strategies to include them in our model training 20.

Strategy Description
Pooled Strategy Synthetic data + original data -> model
Transfer Strategy Synthetic data -> pre-trained model; pre-trained model + original data -> model

The pooled strategy takes the synthetic data and original data then feeds them together into the training pipeline. The transfer strategy uses the synthetic data to pre-train the model, then uses transfer learning methods (e.g., freeze weights of some layers) to train the model on the original data.

1. trekhleb. javascript-algorithms/src/algorithms/string/levenshtein-distance at master · trekhleb/javascript-algorithms. In: GitHub [Internet]. [cited 27 Jul 2022]. Available: https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/string/levenshtein-distance

2. Unit8. Metrics — darts  documentation. In: Darts [Internet]. [cited 7 Mar 2023]. Available: https://unit8co.github.io/darts/generated_api/darts.metrics.metrics.html?highlight=dtw#darts.metrics.metrics.dtw_metric

3. Hasibi R, Shokri M, Dehghan M. Augmentation scheme for dealing with imbalanced network traffic classification using deep learning. 2019.http://arxiv.org/abs/1901.00204

4. Iwana BK, Uchida S. An empirical survey of data augmentation for time series classification with neural networks. 2020.http://arxiv.org/abs/2007.15951

5. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data 2019; 6: 1–48.

6. Wen Q, Sun L, Yang F, Song X, Gao J, Wang X et al. Time series data augmentation for deep learning: A survey. 2020.http://arxiv.org/abs/2002.12478

7. Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition 2011; 44: 678–693.

8. Kang Y, Hyndman RJ, Li F. GRATIS: GeneRAting TIme series with diverse and controllable characteristics. 2019.http://arxiv.org/abs/1903.02787

9. Stock JH, Watson MW. Chapter 8 - dynamic factor models, Factor-Augmented vector autoregressions, and structural vector autoregressions in macroeconomics. In: Taylor JB, Uhlig H (eds). Handbook of macroeconomics. Elsevier, 2016, pp 415–525.

10. Yoon J, Jarrett D, Schaar M van der. Time-series generative adversarial networks. In: Wallach H, Larochell H, Beygelzime A, Buc F dAlche, Fox E, Garnett R (eds). Advances in neural information processing systems. Curran Associates, Inc., 2019https://papers.nips.cc/paper/2019/hash/c9efe5f26cd17ba6216bbe2a7d26d490-Abstract.html

11. Brophy E, Wang Z, She Q, Ward T. Generative adversarial networks in time series: A survey and taxonomy. 2021.http://arxiv.org/abs/2107.11098

12. Takahashi N, Gygli M, Van Gool L. AENet: Learning deep audio features for video analysis. 2017.http://arxiv.org/abs/1701.00599

13. Cui X, Goel V, Kingsbury B. Data augmentation for deep neural network acoustic modeling. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2014, pp 5582–5586.

14. Iwana BK, Uchida S. Time series data augmentation for neural networks by time warping with a discriminative teacher. 2020.http://arxiv.org/abs/2004.08780

15. Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. RobustTAD: Robust time series anomaly detection via decomposition and convolutional neural networks. 2020.http://arxiv.org/abs/2002.09545

16. Le Guennec A, Malinowski S, Tavenard R. Data augmentation for time series classification using convolutional neural networks. In: ECML/PKDD workshop on advanced analytics and learning on temporal data. 2016https://halshs.archives-ouvertes.fr/halshs-01357973/document

17. Um TT, Pfister FMJ, Pichler D, Endo S, Lang M, Hirche S et al. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. 2017.http://arxiv.org/abs/1706.00527

18. Bergmeir C, Hyndman RJ, Benı́tez JM. Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting 2016; 32: 303–312.

19. Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: Current status and future directions. 2019.http://arxiv.org/abs/1909.00590

20. Bandara K, Hewamalage H, Liu Y-H, Kang Y, Bergmeir C. Improving the accuracy of global forecasting models using time series data augmentation. 2020.http://arxiv.org/abs/2008.02663

21. Forestier G, Petitjean F, Dau HA, Webb GI, Keogh E. Generating synthetic time series to augment sparse datasets. In: 2017 IEEE international conference on data mining (ICDM). 2017, pp 865–870.

22. Cao H, Tan VYF, Pang JZF. A parsimonious mixture of gaussian trees model for oversampling in imbalanced and multimodal time-series classification. IEEE transactions on neural networks and learning systems 2014; 25: 2226–2239.

Contributors: LM