Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

- View all journals

## Time series articles from across Nature Portfolio

A time series is a sequence of measurements performed at successive points in time. Time series can be used to derive models to predict future values based on previously observed values.

## Latest Research and Reviews

## Time-integrated BMP signaling determines fate in a stem cell model for early human development

The interpretation of the key developmental signal BMP remains poorly understood. Here, the authors show that the total time-integrated signaling controls differentiation in a stem cell embryo model and provide a possible mechanism.

- Seth Teague
- Gillian Primavera
- Idse Heemskerk

## Quantifying changes in individual-specific template-based representations of center-of-mass dynamics during walking with ankle exoskeletons using Hybrid-SINDy

- Michael C. Rosenberg
- Joshua L. Proctor
- Katherine M. Steele

## Forecasting local hospital bed demand for COVID-19 using on-request simulations

- Raisa Kociurzynski
- Angelo D’Ambrosio
- Tjibbe Donker

## The impact of healthy pregnancy on features of heart rate variability and pulse wave morphology derived from wrist-worn photoplethysmography

- M. J. Almario Escorcia

## Time-resolved proteomic profiling reveals compositional and functional transitions across the stress granule life cycle

Stress granules (SGs) are dynamic compartments with a poorly characterized transition in composition and function during prolonged stress. In this study, the authors investigated the dynamic changes in SG constituents during acute to prolonged heat shock using time-resolved proteomic profiling.

- Yufeng Zhang

## A semiconductor 96-microplate platform for electrical-imaging based high-throughput phenotypic screening

Cell-based phenotypic assays link in vitro discovery to disease pathology. Here, the authors report a semiconductor-based microplate platform to perform high-throughput, high-dimensional “electrical imaging” for label-free assessment of live cell morphology and function.

- Shalaka Chitale
- Jeffrey Abbott

## Quick links

- Explore articles by subject
- Guide to authors
- Editorial policies

Help | Advanced Search

## Computer Science > Machine Learning

Title: foundation models for time series analysis: a tutorial and survey.

Abstract: Time series analysis stands as a focal point within the data mining community, serving as a cornerstone for extracting valuable insights crucial to a myriad of real-world applications. Recent advancements in Foundation Models (FMs) have fundamentally reshaped the paradigm of model design for time series analysis, boosting various downstream tasks in practice. These innovative approaches often leverage pre-trained or fine-tuned FMs to harness generalized knowledge tailored specifically for time series analysis. In this survey, we aim to furnish a comprehensive and up-to-date overview of FMs for time series analysis. While prior surveys have predominantly focused on either the application or the pipeline aspects of FMs in time series analysis, they have often lacked an in-depth understanding of the underlying mechanisms that elucidate why and how FMs benefit time series analysis. To address this gap, our survey adopts a model-centric classification, delineating various pivotal elements of time-series FMs, including model architectures, pre-training techniques, adaptation methods, and data modalities. Overall, this survey serves to consolidate the latest advancements in FMs pertinent to time series analysis, accentuating their theoretical underpinnings, recent strides in development, and avenues for future research exploration.

## Submission history

Access paper:.

- HTML (experimental)
- Other Formats

## References & Citations

- Google Scholar
- Semantic Scholar

## BibTeX formatted citation

## Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

- Institution

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

## Student Notebook

Time-series methods in experimental research.

- Behavioral Science
- Experimental Psychology
- Methodology
- Quantitative
- Statistical Analysis

For many experimental psychologists, the go-to methodological designs are cross-sectional. Cross-sectional studies involve measuring the relationship between some variable(s) of interest at one point in time; some common examples include single-session lab studies and online surveys (e.g., via MTurk). These designs can be useful for isolating relationships between variables, establishing conditions of convergent and discriminant validity, and utilizing samples that are statistically representative of larger populations. Nevertheless, quantitative researchers have noted that attempts to measure and analyze interindividual variation are incomplete without an accompanying account of the underlying temporal dynamics that define these processes (e.g., Molenaar, 2008; Molenaar, Huizenga, & Nesselroade, 2002). This claim follows from the idea that cross-sectional designs, while potentially well-suited for large samples, are often underpowered, overgeneralized, and ill-approximated to the statistical assumptions implied by general linear methods. For these reasons, psychological scientists should consider supplementing their methodological toolkits with time-series techniques to explicitly investigate the time-dependent variation that can be observed within individual subjects.

The purpose of this article is to briefly discuss the importance of time-series methods in experimental research and to acquaint the reader with some statistical techniques that are easily accessible and can be employed when testing hypotheses with time-series data.

## Measuring Behavior as a Time Series

According to Daniel T. Kaplan and Leon Glass (1995), there are two critical features of a time series that differentiate it from cross-sectional data-collection procedures:

- Repeated measurements of a given behavior are taken across time at equally spaced intervals. Taking multiple measurements is essential for understanding how any given behavior unfolds over time, and doing so at equal intervals affords a clear investigation of how the dynamics of that behavior manifest at distinct time scales.
- The temporal ordering of measurements is preserved. Doing so is the only way to fully examine the dynamics governing a particular process. If we expect that a given stimulus will influence the development of a behavior in a particular way, utilizing summary statistics will completely ignore the temporal ordering of the data and likely occlude one’s view of important behavioral dynamics.

Linear computations such as mean and variance merely describe global properties of a data set and thus may fail to capture meaningful patterns that only can be identified by looking at the sequential dependency between time points. Consequently, time-series techniques provide a valuable approach in studying psychological processes, which are, by their very nature, fundamentally embedded in time. (For a more detailed treatment of this subject, see Deboeck, Montpetit, Bergeman, & Boker, 2009.)

## Analyzing Time-Series Data

Once you’ve collected a series of behavioral measurements on your variable(s) of interest, there are a variety of ways to explore and quantify the observed dynamics. Here are a few techniques that can be used to investigate patterns within time-series data:

Autocorrelation/Cross-correlation. An autocorrelation reflects the magnitude of time dependency between observations within a time series. An autocorrelation plot depicts correlations between measurements X t and X t+n , such that each value represents the extent to which any given behavior is related to previous behaviors within the series. A cross-correlation involves relating two time series that are shifted in time at lag n (i.e., X t and Y t+n ), and can reveal, for example, whether one process tends to “lead” the other’s behavior or whether they oscillate together.

Recurrence quantification analysis (RQA). RQA begins by simply plotting a time series against itself (i.e., X t against X t ) and then quantifies whether certain states of the behavior remain stable or recur in time, as well as what percentage of the series is constituted by deterministic patterns. Cross-RQA also can be used to analyze the degree of recurrence and deterministic patterning between two processes, and it has been applied to the study of interpersonal coordination and postural control (e.g., Shockley, Santana, & Fowler, 2003) as well as to the quantification of emotional synchrony in dyadic conflict discussions (Main, Paxton, & Dale, 2016).

Phase space reconstruction (PSR). When obtaining a behavioral time-series, one of your goals could be to determine what variables are involved in producing particular patterns of behavior and what the possible structure of the underlying dynamics may be. One way to accomplish this is to reconstruct the phase space, which is a multidimensional plot that represents all possible states within the process and can be used to approximate the number of variables involved in producing the observed behavioral changes. For example, we may interpret high trait self-esteem as representing a strong tendency for an individual to adopt and maintain positive self-evaluations. Collecting repeated measurements of state self-esteem and then performing a PSR could help describe the strength of that individual’s tendency to retain a positive image of herself as well as reveal the compensatory dynamics that follow from a negative self-evaluative state.

Spectral analysis. Mathematically, any time series can be transformed into a linear composition of sine and cosine waves with varying frequencies. One goal in analyzing time-series data is often to find out what deterministic cycles (i.e., which of the component waves) account for the most variance within the series. Performing a spectral decomposition transforms a time series into a set of constituent sine and cosine waves that then are used to calculate the series’ power spectral density function (PSD). Plotting the series’ PSD reveals the squared correlations between each component frequency and the series as a whole, yielding a similarly intuitive interpretation to R 2 in multiple regression. In this vein, Gottschalk, Bauer, and Whybrow (1995) applied spectral analysis toward studying the changes in self-reported mood among bipolar patients and control subjects, finding that bipolar individuals tended to exhibit cyclical patterns of mood change that were significantly more chaotic and deterministic than the comparatively random fluctuations observed in control subjects.

Differential equation modeling. Essentially, differential equations allow one to study how different variables change with one another as well as how the state of one variable can be influenced by how it is changing (Deboeck & Bergeman, 2013). Derivative estimates of a single time series can be calculated by a number of different techniques from which differential equations then are constructed and tested based on the researcher’s predictions about how those variables are related. An intuitive example of this might be in considering a committed romantic relationship, in which changes in one person’s level of emotional satisfaction conceivably lead to changes in their partner’s level of satisfaction and vice versa. Each partner’s feelings might be coupled with the other’s in a complex manner, such that differential equations could be used to model their emotional relationship and show how changes in one person’s mood are inextricably linked with changes in the other’s mood.

## Applying These Techniques to Your Research

Though these methods may appear foreign and somewhat challenging at first, they quickly become more intuitive once seen in an applied context. The above list represents only some of the more common techniques used in time-series analysis, especially those that have been applied successfully within the psychological sciences.

## References and Further Reading

Deboeck, P. R., & Bergeman, C. S. (2013). The reservoir model: A differential equation model of psychological regulation. Psychological Methods, 18 , 237–256.

Deboeck, P. R., Montpetit, M. A., Bergeman, C. S., & Boker, S. M. (2009). Using derivative estimates to describe intraindividual variability at multiple time scales. Psychological Methods, 14 , 367–386.

Gottschalk, A., Bauer, M. S., & Whybrow, P. C. (1995). Evidence of chaotic mood variation in bipolar disorder. Archives of General Psychiatry, 52 , 947–959.

Holden, J. G., Riley, M. A., Gao, J., & Torre, K. (2013). Fractal analyses: Statistical and methodological innovations and best practices. Retrieved September 13, 2016, from http://www.frontiersin.org/books/Fractal_Analyses_Statistical_And_Methodological_Innovations_And_Best_Practices/179

Kaplan, D., & Glass, L. (1995). Understanding nonlinear dynamics . New York, NY: Springer.

Main, A., Paxton, A., & Dale, R. (2016). An exploratory analysis of emotion dynamics between mothers and adolescents during conflict discussions. Emotion . Advance online publication. doi:10.1037/emo0000180

Molenaar, P. C. M. (2008). Consequences of the ergodic theorems for classical test theory, factor analysis, and the analysis of developmental processes. In S. M. Hofer & D. F. Alwin (Eds.), Handbook of cognitive aging: Interdisciplinary perspectives (pp. 90–104). Thousand Oaks, CA: SAGE Publications.

Molenaar, P. C. M., Huizenga, H. M., & Nesselroade, J. R. (2003). The relationship between the structure of interindividual and intraindividual variability: A theoretical and empirical vindication of Developmental Systems Theory. In U. M. Staudinger & U. Lindenberger (Eds.), Understanding human development (pp. 339–360). Dordrecht, the Netherlands: Kluwer.

Riley, M. A., & Van Orden, G. C. (2005). Tutorials in contemporary nonlinear methods for the behavioral sciences . Retrieved March 1, 2005, from http://www.nsf.gov/sbe/bcs/pac/nmbs/nmbs.jsp

Shockley, K., Santana, M. V., & Fowler, C. A. (2003). Mutual interpersonal postural constraints are involved in cooperative conversation. Journal of Experimental Psychology: Human Perception and Performance, 29 , 326–332.

## About the Author

Trevor Swanson is a third-year PhD student studying experimental psychology at the University of Kansas. His research focuses on the temporal dynamics of self-evaluation, and his broader interests include the application of dynamical systems theory in studying psychopathology. He can be contacted at [email protected].

## Research Briefs

Recent highlights from APS journals articles on learning about working memory, psychological measurement, patterns of implicit and explicit attitudes, and much more.

## Careers Up Close: Jolynn Pek on Quantifying Uncertainty

Pek analyzes sources of uncertainty in statistical results and develops new approaches to practicing replicable psychological science.

## Speaking Evidence to Policy

The Commission on Evidence-Based Policymaking has issued a report that points to significant new opportunities in research and leadership for psychological scientists with expertise in areas including judgment and decision making, analysis and research design, data privacy and ethics issues, and more.

## Privacy Overview

## Evaluating time series forecasting models: an empirical study on performance estimation methods

- Published: 13 October 2020
- Volume 109 , pages 1997–2028, ( 2020 )

## Cite this article

- Vitor Cerqueira ORCID: orcid.org/0000-0002-9694-8423 1 ,
- Luis Torgo 1 , 2 , 3 &
- Igor Mozetič 4

22k Accesses

97 Citations

5 Altmetric

Explore all metrics

Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. This process is a fundamental stage in any machine learning project. In this paper we study the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in this type of data. Currently, there is no consensual approach. We contribute to the literature by presenting an extensive empirical study which compares different performance estimation methods for time series forecasting tasks. These methods include variants of cross-validation, out-of-sample (holdout), and prequential approaches. Two case studies are analysed: One with 174 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that blocked cross-validation can be applied to stationary time series. However, when the time series are non-stationary, the most accurate estimates are produced by out-of-sample methods, particularly the holdout approach repeated in multiple testing periods.

## Similar content being viewed by others

## Model Selection for Time Series Forecasting An Empirical Analysis of Multiple Estimators

Vitor Cerqueira, Luis Torgo & Carlos Soares

## A case study comparing machine learning with statistical methods for time series forecasting: size matters

Resampling strategies for imbalanced time series forecasting.

Nuno Moniz, Paula Branco & Luís Torgo

Avoid common mistakes on your manuscript.

## 1 Introduction

Performance estimation denotes the process of using the available data to estimate the loss that a predictive model will incur in new, yet unseen, observations. Estimating the performance of a predictive model is a fundamental stage in any machine learning project. Practitioners carry out performance estimation to select the most appropriate model and its parameters. Crucially, the process of performance estimation is one of the most reliable approaches to analyse the generalisation ability of predictive models. Such analysis is important not only to select the best model, but also to verify that the respective model solves the underlying predictive task.

Choosing an appropriate performance estimation method usually depends on the characteristics of the data set. When observations are independent and identically distributed (i.i.d.), cross validation is one of the most widely used approaches (Geisser 1975 ). One of the reasons for its popularity is its efficient use of data (Arlot and Celisse 2010 ). However, many data sets in real-world applications are not i.i.d., for example, time series. Time series forecasting is an important machine learning problem. This task has a high practical utility in organizations across many domains of application.

When the observations in a data set are not i.i.d., the standard cross-validation approach is not directly applicable. Cross-validation breaks the temporal order of time series observations, which may lead to unrealistic estimates of predictive performance. In effect, when dealing with this type of data sets, practitioners typically apply an out-of-sample (also known as holdout) approach to estimate the performance of predictive models. Essentially, the predictive model is built in the initial part of the data. The subsequent observations are used for testing. Notwithstanding, there are particular scenarios in which cross-validation may be beneficial. For example, when the time series is stationary, or the sample size is small and data efficiency becomes important (Bergmeir et al. 2018 ).

Several approaches have been developed in recent decades to estimate the performance of forecasting models. However, there is no consensual approach. In this context, we contribute to the literature by carrying out an extensive empirical study which compares several approaches which are often used in practice.

We compare a set of estimation methods which can be broadly split into three categories: out-of-sample (OOS), prequential, and cross-validation (CVAL). OOS approaches are commonly used to estimate the performance of models when the data comprises some degree of temporal dependency. The core idea behind these approaches is to leave the last part of the data for testing. Although this type of approaches do not make a complete use of the available data, they preserve the temporal order of observations. This aspect may be important to cope with the temporal correlation among consecutive observations, and to mimic a realistic deployment scenario. Prequential approaches (Dawid 1984 ) are also common in incremental or high-frequency data sets such as data streams (Gama et al. 2014 ). Prequential denotes an evaluation procedure in which an observation (or a set of observations) is first used for testing, and then to re-train or update the model.

CVAL approaches make a more efficient use of the available data as each observation is used to both train and test a model over the different iterations of the procedure. This property may be beneficial in some scenarios in which the sample size is small (Bergmeir et al. 2018 ). Although the classical K-fold cross validation assumes the data to be i.i.d., some variants of it have been developed which mitigate this problem. In effect, some of these variants have been shown to provide better estimate of performance relative to OOS methods in time series tasks (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

The key factor that distinguishes OOS and prequential approaches from CVAL ones is that the former always preserve the temporal order of observations. This means that a model is never tested on past data relative to the training data set. The central research question we address in this paper is the following: How do estimation methods compare with each other in terms of performance estimation ability for different types of time series data? To accomplish this, we applied different estimation methods in two case studies. One is comprised of 174 real-world time series with potential non-stationarities and the other is a stationary synthetic environment (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

The results suggest that, as Bergmeir et al. point out (Bergmeir et al. 2018 ), cross-validation approaches can be applied to stationary time series. However, many real-world phenomena comprise complex non-stationary sources of variation. In these cases, applying holdout in several testing periods shows the best performance.

This paper is an extension to an article already published (Cerqueira et al. 2017 ). In this work, we substantially increase the experimental setup both in methods and data sets used; we provide additional analyses such as the impact of stationarity; and a more in-depth and critical discussion of the results.

This paper is structured as follows. The literature on performance estimation for time series forecasting tasks is reviewed in Sect. 2 . Materials and methods are described in Sect. 3 , including the predictive task, time series data sets, performance estimation methodology, and experimental design. The results of the experiments are reported in Sect. 4 . A discussion of our results is carried out in Sect. 5 . Finally, the conclusions of our empirical study are provided in Sect. 6 .

In the interest of reproducibility, the methods and data sets are publicly available. Footnote 1

## 2 Background

In this section we provide a background to this paper. We review the typical estimation methods used in time series forecasting and explain the motivation for this study.

In general, performance estimation methods for time series forecasting tasks are designed to cope with the dependence between observations. This is typically accomplished by having a model tested on observations future to the ones used for training.

## 2.1 Out-of-sample (OOS) approaches

When using OOS performance estimation procedures, a time series is split into two parts: an initial fit period in which a model is trained, and a subsequent (temporally) testing period held out for estimating the loss of that model. This simple approach ( Holdout ) is depicted in Fig. 1 . However, within this type of procedure one can adopt different strategies regarding training/testing split point, growing or sliding window settings, and eventual update of the models. In order to produce a robust estimate of predictive performance, (Tashman 2000 ) recommends employing these strategies in multiple test periods. One might create different sub-samples according to, for example, business cycles (Fildes 1989 ). For a more general setting one can also adopt a randomized approach. This is similar to random sub-sampling (or repeated holdout) in the sense that they consist of repeating a learning plus testing cycle several times using different, but possibly overlapping data samples ( Rep-Holdout ). This idea is illustrated in Fig. 2 , where one iteration of a repeated holdout is shown. A point a is randomly chosen from the available sampling window (constrained by the training and testing sizes) of a time series Y. This point then marks the end of the training set, and the start of the testing set.

Simple out-of-sample procedure: an initial part of the available observations are used for fitting a predictive model. The last part of the data is held out, where the predictive model is tested

Example of one iteration of the repeated holdout procedure. A point a is chosen from the available window. Then, a previous part of observations are used for training, while a subsequent part of observations are used for testing

## 2.2 Prequential

OOS approaches are similar to prequential or interleaved-test-then-train evaluation (Dawid 1984 ). Prequential is typically used in data streams mining. The idea is that each observation is first used to test the model, and then to train the model. This can be applied in blocks of sequential instances (Modha and Masry 1998 ). In the initial iteration, only the first two blocks are used, the first for training and the second for testing. In the next iteration, the second block is merged with the first, and the third block is used for test. This procedure continues until all blocks are tested ( Preq-Bls ). This procedure is exemplified in the left side of Fig. 3 , in which the data is split into 5 blocks.

Variants of prequential approach applied in blocks for performance estimation. This strategy can be applied using a growing window (left, right), or a sliding window (middle). One can also introduce a gap between the training and test sets

A variant of this idea is illustrated in the middle scheme of Fig. 3 . Instead of merging the blocks after each iteration (growing window), one can forget the older blocks in a sliding window fashion ( Preq-Sld-Bls ). This idea is typically adopted when past data becomes deprecated, which is common in non-stationary environments. Another variant of the prequential approach is represented in the right side of Fig. 3 . This illustrates a prequential approach applied in blocks, where a gap block is introduced ( Preq-Bls-Gap ). The rationale behind this idea is to increase the independence between training and test sets.

The prequential approaches can be regarded as variations of the holdout procedure applied in multiple testing periods ( Rep-Holdout ). The core distinction is that the train and test prequential splits are not randomized, but pre-determined by the number of blocks or repetitions.

## 2.3 Cross-validation approaches

The typical approach when using K-fold cross-validation is to randomly shuffle the data and split it in K equally-sized folds or blocks. Each fold is a subset of the data comprising t / K randomly assigned observations, where t is the number of observations. After splitting the data into K folds, each fold is iteratively picked for testing. A model is trained on K-1 folds and its loss is estimated on the left out fold ( CV ). In fact, the initial random shuffle of observations before splitting into different blocks is not intrinsic to cross-validation (Geisser 1975 ). Notwithstanding, the random shuffling is a common practice among data science professionals. This approach to cross-validation is illustrated in the left side of Fig. 4 .

Variants of cross-validation estimation procedures

## 2.3.1 Variants designed for time-dependent data

Some variants of K-fold cross-validation have been proposed specially designed for dependent data, such as time series (Arlot and Celisse 2010 ). However, theoretical problems arise by applying this technique directly to this type of data. The dependency among observations is not taken into account since cross-validation assumes the observations to be i.i.d.. This might lead to overly optimistic estimations and consequently, poor generalisation ability of predictive models on new observations. For example, prior work has shown that cross-validation yields poor estimations for the task of choosing the bandwidth of a kernel estimator in correlated data (Hart and Wehrly 1986 ). To overcome this issue and approximate independence between the training and test sets, several methods have been proposed as variants of this procedure. We will focus on variants designed to cope with temporal dependency among observations.

The Blocked Cross-Validation (Snijders 1988 ) ( CV-Bl ) procedure is similar to the standard form described above. The difference is that there is no initial random shuffling of observations. In time series, this renders K blocks of contiguous observations. The natural order of observations is kept within each block, but broken across them. This approach to cross-validation is also illustrated in the left side of Fig. 4 . Since the random shuffle of observations is not being illustrated, the figure for CV-Bl is identical to the one shown for CV .

The Modified CV procedure (McQuarrie and Tsai 1998 ) ( CV-Mod ) works by removing observations from the training set that are correlated with the test set. The data is initially randomly shuffled and split into K equally-sized folds similarly to K-fold cross-validation. Afterwards, observations from the training set within a certain temporal range of the observations of the test set are removed. This ensures independence between the training and test sets. However, when a significant amount of observations are removed from training, this may lead to model under-fit. This approach is also described as non-dependent cross-validation (Bergmeir and Benítez 2012 ). The graph in the middle of Fig. 4 illustrates this approach.

The hv-Blocked Cross-Validation ( CV-hvBl ) proposed by Racine ( 2000 ) extends blocked cross-validation to further increase the independence among observations. Specifically, besides blocking the observations in each fold, which means there is no initial randomly shuffle of observations, it also removes adjacent observations between the training and test sets. Effectively, this creates a gap between both sets. This idea is depicted in the right side of Fig. 4 .

## 2.3.2 Usefulness of cross-validation approaches

Recently there has been some work on the usefulness of cross-validation procedures for time series forecasting tasks. Bergmeir and Benítez ( 2012 ) present a comparative study of estimation procedures using stationary time series. Their empirical results show evidence that in such conditions cross-validation procedures yield more accurate estimates than an OOS approach. Despite the theoretical issue of applying standard cross-validation, they found no practical problem in their experiments. Notwithstanding, the Blocked cross-validation is suggested for performance estimation using stationary time series.

Bergmeir et al. ( 2014 ) extended their previous work for directional time series forecasting tasks. These tasks are related to predicting the direction (upward or downward) of the observable. The results from their experiments suggest that the hv-Blocked CV procedure provides more accurate estimates than the standard out-of-sample approach. These were obtained by applying the methods on stationary time series.

Finally, Bergmeir et al. ( 2018 ) present a simulation study comparing standard cross-validation to out-of-sample evaluation. They used three data generating processes and performed 1000 Monte Carlo trials in each of them. For each trial and generating process, a stationary time series with 200 values was created. The results from the simulation suggest that cross-validation systematically yields more accurate estimates, provided that the model is correctly specified.

In a related empirical study (Mozetič et al. 2018 ), Mozetič et al. compare estimation procedures on several large time-ordered Twitter datasets. They find no significant difference between the best cross-validation and out-of-sample evaluation procedures. However, they do find that standard, randomized cross-validation is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.

Despite the results provided by these previous works we argue that they are limited in two ways. First, the used experimental procedure is biased towards cross-validation approaches. While these produce several error estimates (one for each fold), the OOS approach is evaluated in a one-shot estimation, where the last part of the time series is withheld for testing. OOS methods can be applied in several windows for more robust estimates, as recommended by Tashman ( 2000 ). By using a single origin, one is prone to particular issues related to that origin.

Second, the results are based on stationary time series, most of them artificial. Time series stationarity is equivalent to identical distribution in the terminology of more traditional predictive tasks. Hence, the synthetic data generation processes and especially the stationary assumption limit interesting patterns that can occur in real-world time series. Our working hypothesis is that in more realistic scenarios one is likely to find time series with complex sources of non-stationary variations.

In this context, this paper provides an extensive comparative study using a wide set of methods for evaluating the performance of univariate time series forecasting models. The analysis is carried out using a real-world scenario as well as a synthetic case study used in the works described previously (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

## 2.4 Related work on performance estimation with dependent data

The problem of performance estimation has also been under research in different scenarios. While we focus on time series forecasting problems, the following works study performance estimation methods in different predictive tasks.

## 2.4.1 Spatio-temporal dependencies

Geo-referenced time series are becoming more prevalent due to the increase of data collection from sensor networks. In these scenarios, the most appropriate estimation procedure is not obvious as spatio-temporal dependencies are at play. Oliveira et al. ( 2018 ) presented an extensive empirical study of performance estimation for forecasting problems with spatio-temporal time series. The results reported by the authors suggest that both cross-validation and out-of-sample methods are applicable in these scenarios. Like previous work in time-dependent domains (Bergmeir and Benítez 2012 ; Mozetič et al. 2018 ), Oliveira et al. suggest the use of blocking when using a cross-validation estimation procedure.

## 2.4.2 Data streams mining

Data streams mining is concerned with predictive models that evolve continuously over time in response to concept drift (Gama et al. 2014 ). Gama et al. ( 2013 ) provide a thorough overview of the evaluation of predictive models for data streams mining. The authors defend the usage of the prequential estimator with a forgetting mechanism, such as a fading factor or a sliding window.

This work is related to ours in the sense that it deals with performance estimation using time-dependent data. Notwithstanding, the paradigm of data streams mining is in line with sequential analysis (Wald 1973 ). As such, the assumption is that the sample size is not fixed in advance, and predictive models are evaluated as observations are collected. In our setting, given a time series data set, we want to estimate the loss that a predictive models will incur in unseen observations future to that data set.

## 3 Materials and methods

In this section we present the materials and methods used in this work. First, we define the prediction task. Second, the time series data sets are described. We then formalize the methodology employed for performance estimation. Finally, we overview the experimental design.

## 3.1 Predictive task definition

A time series is a temporal sequence of values \(Y = \{y_1, y_2, \dots , y_t\}\) , where \(y_i\) is the value of Y at time i and t is the length of Y . We remark that we use the term time series assuming that Y is a numeric variable, i.e., \(y_i \in \mathbb {R}, \forall\) \(y_i \in Y\) .

Time series forecasting denotes the task of predicting the next value of the time series, \(y_{t+1}\) , given the previous observations of Y . We focus on a purely auto-regressive modelling approach, predicting future values of time series using its past lags.

To be more precise, we use time delay embedding (Takens 1981 ) to represent Y in an Euclidean space with embedding dimension p . Effectively, we construct a set of observations which are based on the past p lags of the time series. Each observation is composed of a feature vector \(x_i \in \mathbb {X} \subset \mathbb {R}^p\) , which denotes the previous p values, and a target vector \(y_i \in \mathbb {Y} \subset \mathbb {R}\) , which represents the value we want to predict. The objective is to construct a model \(f : \mathbb {X} \rightarrow \mathbb {Y}\) , where f denotes the regression function.

Summarizing, we generate the following matrix:

Taking the first row of the matrix as an example, the target value is \(y_{p+1}\) , while the attributes (predictors) are the previous p values \(\{y_1, \dots , y_{p}\}\) .

## 3.2 Time series data

Two different case studies are used to analyse the performance estimation methods: a scenario comprised of real-world time series and a synthetic setting used in prior works (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ) for addressing the issue of performance estimation for time series forecasting tasks.

## 3.2.1 Real-world time series

Regarding real-world time series, we use a set of time series from the benchmark database tsdl (Hyndman and Yang 2019 ). From this database, we selected all the univariate time series with at least 500 observations and which have no missing values. This query returned 149 time series. We also included 25 time series used in previous work by Cerqueira et al. ( 2019 ). From the set of 62 time series used by the authors, we selected those with at least 500 observations and which were not originally from the tsdl database (which are already retrieved as described above). We refer to the work by Cerqueira et al. ( 2019 ) for a description of the time series. In summary, our database of real-world time series comprises 174 time series. The threshold of 500 observations is included so that learning algorithms have enough data to build a good predictive model. The 174 time series represent phenomena from different domains of applications. These include finance, physics, economy, energy, and meteorology. They also cover distinct sampling frequencies, such as hourly or daily. In terms of sample size, the distribution ranges from 506 to 23741 observations. However, we truncated to time series to a maximum of 4000 observations to speed up computations. The database is available online (c.f. footnote 1). We refer to the sources for further information on these time series Hyndman and Yang ( 2019 ); Cerqueira et al. ( 2019 ).

Stationarity

We analysed the stationarity of the time series comprising the real-world case study. Essentially, a time series is said to be stationary if its characteristics do not depend on the time that the data is observed (Hyndman and Athanasopoulos 2018 ). In this work we consider a stationarity of order 2. This means that a time series is considered stationary if it has constant mean, constant variance, and an auto-covariance that does not depend on time. Henceforth we will refer a time series as stationary if it is stationary of order 2.

In order to test if a given time series is stationary we follow the wavelet spectrum test described by Nason ( 2013 ). This test starts by computing an evolutionary wavelet spectral approximation. Then, for each scale of this approximation, the coefficients of the Haar wavelet are computed. Any large Haar coefficient is evidence of a non-stationarity. An hypothesis test is carried out to assess if a coefficient is large enough to reject the null hypothesis of stationarity. In particular, we apply a multiple hypothesis test with a Bonferroni correction and a false discovery rate (Nason 2013 ).

Application of the wavelet spectrum test to a non-stationary time series. Each red horizontal arrow denote a non-stationarity identified by the test

In Fig. 5 is shown an example of the application of the wavelet spectrum test to a non-stationary time series. In the graphic, each red horizontal arrow denotes a non-stationarity found by the test. The left-hand side axis denotes the scale of the time series. The right-hand axis represents the scale of the wavelet periodogram and where the non-stationarities are found. Finally, the lengths of the arrows denote the scale of the Haar wavelet coefficient whose null hypothesis was rejected. For a thorough description of this method we refer to the work by Nason ( 2013 ). Out of the 174 time series used in this work, 97 are stationary, while the remaining 77 are non-stationary.

## 3.2.2 Synthetic time series

We use three synthetic use cases defined in previous works by Bergmeir et al. ( 2014 , ( 2018 ). The data generating processes are all stationary and are designed as follows:

A stable auto-regressive process with lag 3, i.e., the next value of the time series is dependent on the past 3 observations;

An invertible moving average process with lag 1;

A seasonal auto-regressive process with lag 12 and seasonal lag 1.

For the first two cases, S1 and S2, real-valued roots of the characteristic polynomial are sampled from the uniform distribution \([-r;-1.1]\cup [1.1,r]\) , where r is set to 5 (Bergmeir and Benítez 2012 ). Afterwards, the roots are used to estimate the models and create the time series. The data is then processed by making the values all positive. This is accomplished by subtracting the minimum value and adding 1. The third case S3 is created by fitting a seasonal auto-regressive model to a time series of monthly total accidental deaths in the USA (Brockwell and Davis 2013 ). For a complete description of the data generating process we refer to the work by Bergmeir and Benítez ( 2012 ); Bergmeir et al. ( 2018 ). Similarly to Bergmeir et al., for each use case we performed 1000 Monte Carlo simulations. In each repetition a time series with 200 values was generated.

## 3.3 Performance estimation methodology

Performance estimation addresses the issue of estimating the predictive performance of predictive models. Frequently, the objective behind these tasks is to compare different solutions for solving a predictive task. This includes selecting among different learning algorithms and hyper-parameter tuning for a particular one.

Training a learning model and evaluating its predictive ability on the same data has been proven to produce biased results due to overfitting (Arlot and Celisse 2010 ). In effect, several methods for performance estimation have been proposed in the literature, which use new data to estimate the performance of models. Usually, new data is simulated by splitting the available data. Part of the data is used for training the learning algorithm and the remaining data is used to test and estimate the performance of the model.

For many predictive tasks the most widely used of these methods is K-fold cross-validation (Stone 1974 ) (c.f. Sect. 2 for a description). The main advantages of this method is its universal splitting criteria and efficient use of all the data. However, cross-validation is based on the assumption that observations in the underlying data are independent. When this assumption is violated, for example in time series data, theoretical problems arise that prevent the proper use of this method in such scenarios. As we described in Sect. 2 several methods have been developed to cope with this issue, from out-of-sample approaches (Tashman 2000 ) to variants of the standard cross-validation, e.g., block cross-validation (Snijders 1988 ).

Our goal in this paper is to compare a wide set of estimation procedures, and test their suitability for different types of time series forecasting tasks. In order to emulate a realistic scenario we split each time series data in two parts. The first part is used to estimate the loss that a given learning model will incur on unseen future observations. This part is further split into training and test sets as described before. The second part is used to compute the true loss that the model incurred. This strategy allows the computation of unbiased estimates of error since a model is always tested on unseen observations.

The workflow described above is summarised in Fig. 6 . A time series Y is split into an estimation set \(Y^{est}\) and a subsequent validation set \(Y^{val}\) . First, \(Y^{est}\) is used to calculate \(\hat{g}\) , which represents the estimate of the loss that a predictive model m will incur on future new observations. This is accomplished by further splitting \(Y^{est}\) into training and test sets according to the respective estimation procedure \(g_i\) , \(i \in \{1,\dots ,u\}\) . The model m is built on the training set and \(\hat{g}_i\) is computed on the test set.

Second, in order to evaluate the estimates \(\hat{g}_i\) produced by the methods \(g_i\) , \(i \in \{1,\dots ,u\}\) , the model m is re-trained using the complete set \(Y^{est}\) and tested on the validation set \(Y^{val}\) . Effectively, we obtain \(L^m\) , the ground truth loss that m incurs on new data.

In summary, the goal of an estimation method \(g_i\) is to approximate \(L^m\) by \(\hat{g}_i\) as well as possible. In Sect. 3.4.3 we describe how to quantify this approximation.

Experimental comparison procedure (Cerqueira et al. 2017 ): a time series is split into an estimation set \(Y^{est}\) and a subsequent validation set \(Y^{val}\) . The first is used to estimate the error \(\hat{g}\) that the model m will incur on unseen data, using u different estimation methods. The second is used to compute the actual error \(L^m\) incurred by m . The objective is to approximate \(L^m\) by \(\hat{g}\) as well as possible

## 3.4 Experimental design

The experimental design was devised to address the following research question: How do the predictive performance estimates of cross-validation methods compare to the estimates of out-of-sample approaches for time series forecasting tasks?

Existing empirical evidence suggests that cross-validation methods provide more accurate estimations than traditionally used OOS approaches in stationary time series forecasting (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ) (see Sect. 2 ). However, many real-world time series comprise complex structures. These include cues from the future that may not have been revealed in the past. In such cases, preserving the temporal order of observations when estimating the predictive ability of models may be an important component.

## 3.4.1 Embedding dimension and estimation set size

We estimate the optimal embedding dimension ( p ) (the value which minimises generalisation error) using the method of False Nearest Neighbours (Kennel et al. 1992 ). This method analyses the behaviour of the nearest neighbours as we increase p . According to Kennel et al. ( 1992 ), with a low sub-optimal p many of the nearest neighbours will be false. Then, as we increase p and approach an optimal embedding dimension those false neighbours disappear. We set the tolerance of false nearest neighbours to 1%. Regarding the synthetic case study, we fixed the embedding dimension to 5. The reason for this setup is to try to follow the experimental setup by Bergmeir et al. ( 2018 ).

The estimation set ( \(Y^{est}\) ) in each time series is the first 70% observations of the time series – see Fig. 6 . The validation period is comprised of the subsequent 30% observations ( \(Y^{val}\) ). These values are typically used when partitioning data sets for performance estimation.

## 3.4.2 Estimation methods

In the experiments we apply a total of 11 performance estimation methods, which are divided into cross-validation, out-of-sample, and prequential approaches. The cross-validation methods are the following:

CV Standard, randomized K-fold cross-validation;

CV-Bl Blocked K-fold cross-validation;

CV-Mod Modified K-fold cross-validation;

CV-hvBl hv-Blocked K-fold cross-validation;

The out-of-sample approaches are the following:

Holdout A simple OOS approach–the first 70% of \(Y^E\) is used for training and the subsequent 30% is used for testing;

Rep-Holdout OOS tested in nreps testing periods with a Monte Carlo simulation using 70% of the total observations t of the time series in each test. For each period, a random point is picked from the time series. The previous window comprising 60% of t is used for training and the following window of 10% of t is used for testing.

Finally, we include the following prequetial approaches:

Preq-Bls Prequential evaluation in blocks in a growing fashion;

Preq-Sld-Bls Prequential evaluation in blocks in a sliding fashion–the oldest block of data is discarded after each iteration;

Preq-Bls-Gap ] Prequential evaluation in blocks in a growing fashion with a gap block–this is similar to the method above, but comprises a block separating the training and testing blocks in order to increase the independence between the two parts of the data;

Preq-Grow and Preq-Slide As baselines we also include the exhaustive prequential methods in which an observation is first used to test the predictive model and then to train it. We use both a growing/landmark window ( Preq-Grow ) and a sliding window ( Preq-Slide ).

We refer to Sect. 2 for a complete description of the methods. The number of folds K or repetitions nreps in these methods is 10, which is a commonly used setting in the literature. The number of observations removed in CV-Mod and CV-hvBl (c.f. Sect. 2 ) is the embedding dimension p in each time series.

## 3.4.3 Evaluation metrics

Our goal is to study which estimation method provides a \(\hat{g}\) that best approximates \(L^m\) . Let \(\hat{g}^m_i\) denote the estimated loss by the learning model m using the estimation method g on the estimation set, and \(L^m\) denote the ground truth loss of learning model m on the test set. The objective is to analyze how well \(\hat{g}^m_i\) approximates \(L^m\) . This is quantified by the absolute predictive accuracy error (APAE) metric and the predictive accuracy error (PAE) (Bergmeir et al. 2018 ):

The APAE metric evaluates the error size of a given estimation method. On the other hand, PAE measures the error bias, i.e., whether a given estimation method is under-estimating or over-estimating the true error.

Another question regarding evaluation is how a given learning model is evaluated regarding its forecasting accuracy. In this work we evaluate models according to root mean squared error (RMSE). This metric is traditionally used for measuring the differences between the estimated values and the actual values.

## 3.4.4 Learning algorithms

We applied the following learning algorithms:

RBR A rule-based regression algorithm from the Cubist R package (Kuhn et al. 2014 ), which is a variant of the model tree by Quinlan ( 1993 ). The main parameter, the number of committees (c.f. Kuhn et al. 2014 ), was set to 5.;

RF A Random Forest algorithm, which is an ensemble of decision trees (Breiman 2001 ). We resort to the implementation from the ranger R package (Wright 2015 ). The number of trees in the ensemble was set to 100.;

GLM A generalized linear model (McCullagh 2019 ) regression with a Gaussian distribution and a Ridge penalty mixing. We used the implementation of the glmnet R package (Friedman et al. 2010 ).

The remaining parameters of each method were set to defaults. These three learning algorithms are widely used in regression tasks. For time series forecasting in particular, (Cerqueira et al. 2019 ) showed their usefulness when applied as part of a dynamic heterogeneous ensemble. In particular, the RBR method showed the best performance among 50 other approaches.

## 4 Empirical experiments

4.1 results with synthetic case study.

In this section we start by analysing the average rank, and respective standard deviation, of each estimation method and for each synthetic scenario (S1, S2, and S3), according to the metric APAE. For example, a rank of 1 in a given Monte Carlo repetition means that the respective method was the best estimator in that repetition. These analyses are reported in Figs. 7 , 8 , 9 . This initial experiment is devised to reproduce the results by Bergmeir et al. ( 2018 ). Later, we will analyse how these results compare when using real-world time series.

The results shown by the average ranks corroborate those presented by Bergmeir and Benítez ( 2012 ), Bergmeir et al. ( 2014 ), Bergmeir et al. ( 2018 ). Cross-validation approaches, blocked ones in particular, perform better (i.e., show a lower average rank) relative to the simple out-of-sample procedure Holdout . This can be concluded from all three scenarios: S1, S2, and S3.

Average rank and respective standard deviation of each estimation methods in case study S1 using the RBR learning method

Focusing on scenario S1, the estimation method with the best average rank is Preq-Bls-Gap , followed by the other two prequential variants ( Preq-Sld-Bls , and Preq-Bls ). Although the Holdout procedure is clearly a relatively poor estimator, the repeated holdout in multiple testing periods ( Rep-Holdout ) shows a better average rank than the standard cross validation approach ( CV ). Among cross validation procedures, CV-hvBl presents the best average rank.

Scenario S2 shows a seemingly different story relative to S1. In this problem, the blocked cross validation procedures show the best estimation ability. Among all, CV-hvBl shows the best average rank.

Average rank and respective standard deviation of each estimation methods in case study S2 using the RBR learning method

Regarding the scenario S3, the outcome is less clear than the previous two scenarios. The methods show a closer average rank among them, with large standard deviations. Preq-Sld-Bls shows the best estimation ability, followed by the two blocked cross-validation approaches, CV-Bl and CV-hvBl .

Average rank and respective standard deviation of each estimation methods in case study S3 using the RBR learning method

In summary, this first experiment corroborates the experiment carried out by Bergmeir et al. ( 2018 ). Notwithstanding, other methods that the authors did not test show an interesting estimation ability in these particular scenarios, namely the prequential variants. In this section, we focused on the RBR learning algorithm. In Sect. 1 of the appendix, we include the results for the GLM and RF learning methods. Overall, the results are similar across the learning algorithms.

The synthetic scenarios comprise time series that are stationary. However, real-world time series often comprise complex dynamics that break stationarity. When choosing a performance estimation method one should take this issue into consideration. To account for time series stationarity, in the next section we analyze the estimation methods using real-world time series.

## 4.2 Results with real-world data

In this section we analyze the performance estimation ability of each method using a case study which includes 174 real-world time series from different domains.

Rank distribution of each estimation method across the 174 real-world time series

First, in Fig. 10 , we show the rank distribution of each performance estimation method across the 174 time series. This figure shows a large dispersion of rank across the methods. This outcome indicates that there is no particular performance estimation method which is the most appropriate for all time series. Crucially, this result also motivates the need to study which time series characteristics (e.g. stationarity) most influence which method is more adequate for a given task.

## 4.2.1 Stationary time series

In Fig. 11 , we start by analyzing the average rank, and respective standard deviation, of each estimation method using the APAE metric. We focus on the 97 stationary time series in the database.

Similarly to the synthetic case study, the blocked cross-validation approaches CV-Bl and CV-hvBl show a good estimation ability in terms of average rank. Conversely, the other cross-validation approaches are the worst estimators. This outcome highlights the importance of blocking when using cross-validation. Rep-Holdout is the best estimator among the OOS approaches, while Preq-Bls shows the best score among the prequential methods.

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the RBR learning method

We also study the statistical significance of the obtained results in terms of error size (APAE) according to a Bayesian analysis (Benavoli et al. 2017 ). Particularly, we applied the Bayes signed-rank test to compare pairs of methods across multiple problems. We arbitrarily define the region of practical equivalence (Benavoli et al. 2017 ) (ROPE) to be the interval [-2.5%, 2.5%] in terms of APAE. Essentially, this means that two methods show indistinguishable performance if the difference in performance between them falls within this interval. For a thorough description of the Bayesian analysis for comparing predictive models we refer to the work by Benavoli et al. ( 2017 ). In this analysis, it is necessary to use a scale invariant measure of performance. Therefore, we transform the metric APAE into the percentage difference of APAE relative to a baseline. In this experiment we fix the method Rep-Holdout as the baseline.

According to the illustration in Fig. 12 , the probability of Rep-Holdout winning (i.e., showing a significantly better estimation ability) is generally larger than the opposite. The exception is when it is compared with the blocked cross-validation approaches CV-Bl and CV-hvBl .

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RBR learning method

For stationary time series, the blocked cross-validation approach CV-Bl seems to be the best estimation method among those analysed. The average rank analysis suggests that, on average, the relative position of this method is better than the other approaches. A more rigorous statistical analysis (using the Bayes signed-rank method) suggests that both CV-Bl and CV-hvBl are significantly better choices relative to Rep-Holdout .

## 4.2.2 Non-stationary time series

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the RBR learning method

In Fig. 13 we present a similar analysis for the 77 non-stationary time series, whose results are considerably different relative to stationary time series. In this scenario, Rep-Holdout show the best average rank, followed by the blocked cross-validation approaches. Again, the standard cross-validation approach CV shows the worst score.

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test using the RBR learning method

Figure 14 shows the results of the Bayes signed-rank test. This analysis suggests that Rep-Holdout is a significantly better estimator relative to the other approaches when we are dealing with non-stationary time series.

To summarise, we compared the performance of estimation methods in a large set of real-world time series and controlled for stationarity. The results suggest that, for stationary time series, the blocked cross-validation approach CV-Bl is the best option. However, when the time series are non-stationary, the OOS approach Rep-Holdout is significantly better than the others.

On top of this, the results from the experiments also suggest the following outcomes:

As Tashman pointed out (Tashman 2000 ), applying the holdout approach in multiple testing periods leads to a better performance relative to a single partition of the data set. Specifically, Rep-Holdout shows a better performance relative to Holdout in the two scenarios;

Prequential applied in blocks and in a growing window fashion ( Preq-Bls ) is the best prequential approach. Specifically, its average rank is better than the online approaches Preq-Slide and Preq-Grow ; and the blocked approaches in which the window is sliding (as opposed to growing – Preq-Sld-Bls ) or when a gap is introduced between the training and testing sets ( Preq-Bls-Gap ).

In the Sect. 1 of appendix, we show the results for the GLM and RF learning methods. The conclusions drawn from these algorithms are similar to what was reported above.

## 4.2.3 Error Bias

In order to study the direction of the estimation error, in Fig. 15 we present for each method the (log scaled) percentage difference between the estimation error and the true error according to the PAE metric. In this graphic, values below the zero line denote under-estimations of error, while values above the zero line represent over-estimations. In general, the estimation methods tend to over-estimate the error (i.e. are pessimistic estimators). The online prequential approaches Preq-Slide and Preq-Grow , and the non-blocked versions of cross-validation CV and CV-Mod tend to under-estimate the error (i.e. are optimistic estimators).

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the real-world case study, and using the RBR learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error

## 4.2.4 Impact of sample size

Preserving the temporal order of observations, albeit more realistic, comes at a cost since less data is available for estimating predictive performance. As (Bergmeir et al. 2018 argue, this may be important for small data sets, where a more efficient use of the data (e.g. CV ) may be beneficial.

Previous work on time series forecasting has shown that sample size matters when selecting the predictive model (Cerqueira et al. 2019 ). Learning algorithms with a flexible functional form, e.g. decision trees, tend to work better for larger sample sizes relative to traditional forecasting approaches, for example, exponential smoothing.

We carried out an experiment to test whether the sample size of time series has any effect on the relative performance of the estimation methods. In this specific analysis, we focus on a subset of 90 time series (out of the 174 ones) with at least 1000 observations. Further, we truncated these to 1000 observations so all time series have equal size. Then, we proceed as follows. We repeated the performance estimation methodology described in Sect. 3.3 for increasing sample sizes. In the first iteration, each time series comprises an estimation set of 100 observations, and a validation set of 100 observations. The methodology is applied under these conditions. In the next iterations, the estimation set grows by 100 observations (to 200), and the validation set represents the subsequent 100 points. The process is repeated until the time series is fully processed.

Average rank of each performance estimation method with an increasing training sample size

We measure the average rank of each estimation method across the 90 time series after each iteration according to the APAE metric. These scores are illustrated in Fig. 16 . The estimation set size in shown in the x-axis, while the average rank score is on the y-axis. Overall, the relative positions of each method remain stable as the sample size grows. This outcome suggests that this characteristic is not crucial when selecting the estimation method.

We remark that this analysis is restricted to the sample sizes tested (up to 1000 observations). For large scale data sets the recommendation by Dietterich ( 1998 ), and usually adopted in practice, is to apply a simple out-of-sample estimation procedure ( Holdout ).

## 4.2.5 Descriptive model

What makes an estimation method appropriate for a given time series is related to the characteristics of the data. For example, in the previous section we analyzed the impact that stationarity has in terms of what is the best estimation method.

The real-world time series case study comprises a set of time series from different domains. In this section we present, as a descriptive analysis, a tree-based model that relates some characteristics of time series according to the most appropriate estimation method for that time series. Basically, we create a predictive task in which the attributes are some characteristics of a time series, and the categorical target variable is the estimation method that best approximates the true loss in that time series. We use CART (Breiman 2017 ) (classification and regression tree) algorithm for obtaining the model for this task. The characteristics used as predictor variables are the following summary statistics:

Trend , estimating according to the ratio between the standard deviation of the time series and the standard deviation of the differenced time series;

Skewness, for measuring the symmetry of the distribution of the time series;

Kurtosis , as a measure of flatness of the distribution of the time series relative to a normal distribution;

5-th and 95-th Percentiles (Perc05, Perc95) of the standardized time series;

Inter-quartile range ( IQR ), as a measure of the spread of the standardized time series;

Serial correlation, estimated using a Box-Pierce test statistic;

Long-range dependence, using a Hurst exponent estimation with wavelet transform;

Maximum Lyapunov Exponent, as a measure of the level of chaos in the time series;

a boolean variable, indicating whether or not the respective time series is stationary according to the wavelet spectrum test (Nason 2013 ).

These characteristics have been shown to be useful in other problems concerning time series forecasting (Wang et al. 2006 ). The features used in the final decision tree are written in boldface. The decision tree is shown in Fig. 17 . The numbers below the name of the method in each node denote the number of times the respective method is best over the number of time series covered in that node.

Decision tree that maps the characteristics of time series to the most appropriate estimation method. Graphic created using the rpart.plot framework (Milborrow 2018 )

Some of the estimation methods do not appear in the tree model. The tree leaves, which represent a decision, include the estimation methods CV-Bl , Rep-Holdout , Preq-Slide , and Preq-Bls-Gap .

The estimation method in the root node is CV-Bl , which is the method which is the best most of the times across the 174 time series. The first split is performed according to the kurtosis of time series. Basically, if the kurtosis is not above 2, the tree leads to a leaf node with Preq-Slide as the most appropriate estimation method. Otherwise, the tree continues with more tests in order to find the most suitable estimation method for each particular scenario.

## 5 Discussion

5.1 impact of the results.

In the experimental evaluation we compare several performance estimation methods in two distinct scenarios: (1) a synthetic case study in which artificial data generating processes are used to create stationary time series; and (2) a real-world case study comprising 174 time series from different domains. The synthetic case study is based on the experimental setup used in previous studies by Bergmeir et al. for the same purpose of evaluating performance estimation methods for time series forecasting tasks (Bergmeir and Benítez 2012 ; Bergmeir et al. 2014 , 2018 ).

Bergmeir et al. show in previous studies (Bergmeir and Benitez 2011 ; Bergmeir and Benítez 2012 ) that the blocked form of cross-validation, denoted here as CV-Bl , yields more accurate estimates than a simple out-of-sample evaluation ( Holdout ) for stationary time series forecasting tasks. The method CV is also suggested to be “a better choice than OOS[ Holdout ] evaluation” as long as the data are well fitted by the model (Bergmeir et al. 2018 ). To some extent part of the results from our experiments corroborate these conclusions. Specifically, this is verified by the APAE incurred by the estimation procedures in the synthetic case studies. In our experiments we found out that, in the synthetic case study, prequential variants provide a good estimation ability, which is often better relative to cross validation variants. Furthermore, the results in the synthetic stationary case studies do not reflect those obtained using real-world time series. On one hand, we corroborate the conclusions of previous work (Bergmeir and Benítez 2012 ) that blocked cross-validation ( CV-Bl ) is applicable to stationary time series. On the other hand, when dealing with non-stationary data sets, holdout applied with multiple randomized testing periods ( Rep-Holdout ) provides the most accurate performance estimates.

In a real-world environment we are prone to deal with time series with complex structures and different sources of non-stationary variations. These comprise nuances of the future that may not have revealed themselves in the past (Tashman 2000 ). Consequently, we believe that in these scenarios, Rep-Holdout is a better option as performance estimation method relative to cross-validation approaches.

## 5.2 Scope of the real-world case study

In this work we center our study on univariate numeric time series. Nevertheless, we believe that the conclusions of our study are independent of this assumption and should extend for other types of time series. The objective is to predict the next value of the time series, assuming immediate feedback from the environment. Moreover, we focus on time series with a high sampling frequency, for example, hourly or daily data. The main reason for this is because high sampling frequency is typically associated with more data, which is important for fitting the predictive models from a machine learning point of view. Standard forecasting benchmark data are typically more centered around low sampling frequency time series, for example the M competition data (Makridakis et al. 1982 ).

## 5.3 Future work

We showed that stationarity is a crucial time series property to take into account when selecting the performance estimation method. On the other hand, data sample size appears not be have a significant effect, though the analysis is restricted to time series up to 1000 data points.

We studied the possibility of there being other time series characteristics which may be relevant for performance estimation. We built a descriptive model, which partitions the best estimation method according to different time series characteristics, such as kurtosis or trend. We believe that this approach may be interesting for further scientific enquiry. For example, one could leverage this type of model to automatically select the most appropriate performance estimation method. Such model could be embedded into an automated machine learning framework.

Our conclusions were drawn from a database of 174 time series from distinct domains of application. In future work, it would be interesting to carry out a similar analysis on time series from specific domains, for example, finance. The stock market contains rich financial data which attracts a lot of attention. Studying the most appropriate approach to evaluate predictive models embedded within trading systems is an interesting research direction.

## 6 Final remarks

In this paper we analyse the ability of different methods to approximate the loss that a given predictive model will incur on unseen data. We focus on performance estimation for time series forecasting tasks. Since there is currently no settled approach in these problems, our objective is to compare different available methods and test their suitability.

We analyse several methods that can be generally split into out-of-sample approaches and cross-validation methods. These were applied to two case studies: a synthetic environment with stationary time series and a real-world scenario with potential non-stationarities.

In stationary time series, the blocked cross-validation method ( CV-Bl ) is shown to have a competitive estimation ability. However, when non-stationarities are present, the out-of-sample holdout procedure applied in multiple testing periods ( Rep-Holdout ) is a significantly better choice.

https://github.com/vcerqueira/performance_estimation.

Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys , 4 , 40–79.

Article MathSciNet MATH Google Scholar

Benavoli, A., Corani, G., Demšar, J., & Zaffalon, M. (2017). Time for a change: A tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research , 18 (1), 2653–2688.

MathSciNet MATH Google Scholar

Bergmeir, C., & Benitez, J.M. (2011) Forecaster performance evaluation with cross-validation and variants. In: 2011 11th international conference on intelligent systems design and applications (ISDA), pp. 849–854. IEEE.

Bergmeir, C., & Benítez, J. M. (2012). On the use of cross-validation for time series predictor evaluation. Information Sciences , 191 , 192–213.

Article Google Scholar

Bergmeir, C., Costantini, M., & Benítez, J. M. (2014). On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis , 76 , 132–143.

Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis , 120 , 70–83.

Breiman, L. (2001). Random forests. Machine Learning , 45 (1), 5–32.

Article MATH Google Scholar

Breiman, L. (2017). Classification and Regression Trees . New York: Routledge.

Book Google Scholar

Brockwell, P.J., & Davis, R.A. (2013). Time series: theory and methods. Springer Science & Business Media, Berlin

Cerqueira, V., Torgo, L., Pinto, F., & Soares, C. (2019). Arbitrage of forecasting experts. Machine Learning , 108 (6), 913–944.

Cerqueira, V., Torgo, L., Smailović, J., & Mozetič, I. (2017) A comparative study of performance estimation methods for time series forecasting. In 2017 IEEE international conference on data science and advanced analytics (DSAA) (pp. 529–538). IEEE.

Cerqueira, V., Torgo, L., & Soares, C. (2019). Machine learning vs statistical methods for time series forecasting: Size matters. arXiv preprint arXiv:1909.13316 .

Dawid, A. P. (1984). Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the Royal Statistical Society: Series A (General) , 147 (2), 278–290.

Article MathSciNet Google Scholar

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation , 10 (7), 1895–1923.

Fildes, R. (1989). Evaluation of aggregate and individual forecast method selection rules. Management Science , 35 (9), 1056–1065.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software , 33 (1), 1–22.

Gama, J., Sebastião, R., & Rodrigues, P. P. (2013). On evaluating stream learning algorithms. Machine Learning , 90 (3), 317–346.

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR) , 46 (4), 44.

Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association , 70 (350), 320–328.

Hart, J. D., & Wehrly, T. E. (1986). Kernel regression estimation using repeated measurements data. Journal of the American Statistical Association , 81 (396), 1080–1088.

Hyndman, R., & Yang, Y. (2019) tsdl: Time series data library. https://github.com/FinYang/tsdl .

Hyndman, R.J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.

Kennel, M. B., Brown, R., & Abarbanel, H. D. (1992). Determining embedding dimension for phase-space reconstruction using a geometrical construction. Physical Review A , 45 (6), 3403.

Kuhn, M., Weston, S., & Keefer, C. (2014). code for Cubist by Ross Quinlan, N.C.C.: Cubist: Rule- and Instance-Based Regression Modeling. R package version 0.0.18.

Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., et al. (1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting , 1 (2), 111–153.

McCullagh, P. (2019). Generalized linear models . New York: Routledge.

Book MATH Google Scholar

McQuarrie, A. D., & Tsai, C. L. (1998). Regression and time series model selection . Singapore: World Scientific.

Milborrow, S. (2018). rpart.plot: Plot ’rpart’ Models: An Enhanced Version of ’plot.rpart’. https://CRAN.R-project.org/package=rpart.plot . R package version 3.0.6.

Modha, D. S., & Masry, E. (1998). Prequential and cross-validated regression estimation. Machine Learning , 33 (1), 5–39.

Mozetič, I., Torgo, L., Cerqueira, V., & Smailović, J. (2018). How to evaluate sentiment classifiers for Twitter time-ordered data? PLoS ONE , 13 (3), e0194317.

Nason, G. (2013). A test for second-order stationarity and approximate confidence intervals for localized autocovariances for locally stationary time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75 (5), 879–904.

Oliveira, M., Torgo, L., & Costa, V.S. (2018) Evaluation procedures for forecasting with spatio-temporal data. In Joint European conference on machine learning and knowledge discovery in databases (pp. 703–718). Berlin: Springer.

Quinlan, J.R. (1993). Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning (pp. 236–243).

Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics , 99 (1), 39–61.

Snijders, T.A. (1988). On cross-validation for predictor evaluation in time series. In On model uncertainty and its statistical implications (pp. 56–69). Berlin: Springer.

Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika (pp. 509–515).

Takens, F. (1981). Dynamical systems and turbulence, Warwick 1980: Proceedings of a Symposium Held at the University of Warwick 1979/80, chap. Detecting strange attractors in turbulence, pp. 366–381. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/BFb0091924 .

Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: An analysis and review. International Journal of Forecasting , 16 (4), 437–450.

Wald, A. (1973). Sequential analysis . Philadelphia: Courier Corporation.

MATH Google Scholar

Wang, X., Smith, K., & Hyndman, R. (2006). Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery , 13 (3), 335–364.

Wright MN (2015) Ranger: A fast implementation of random forests . R package version 0.3.0.

Download references

## Acknowledgements

The authors would like to acknowledge the valuable input of anonymous reviewers. The work of V. Cerqueira was financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within project UIDB/50014/2020. The work of L. Torgo was undertaken, in part, thanks to funding from the Canada Research Chairs program. The work of I. Mozetič was supported by the Slovene Research Agency through research core funding no. P2-103, and by the EC REC project IMSyPP (Grant No. 875263). The results of this publication reflect only the author’s view and the Commission is not responsible for any use that may be made of the information it contains.

## Author information

Authors and affiliations.

LIAAD-INESC TEC, Porto, Portugal

Vitor Cerqueira & Luis Torgo

University of Porto, Porto, Portugal

Dalhousie University, Halifax, Canada

Jozef Stefan Institute, Ljubljana, Slovenia

Igor Mozetič

You can also search for this author in PubMed Google Scholar

## Corresponding author

Correspondence to Vitor Cerqueira .

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Larisa Soldatova, Joaquin Vanschoren.

## 1.1 Synthetic time series results with GLM and RF

See Figs. 18 , 19 , 20 , 21 , 22 , 23 , 24 .

Average rank and respective standard deviation of each estimation methods in case study S1–using GLM learning method (complement to Fig. 7 )

Average rank and respective standard deviation of each estimation methods in case study S1–using RF learning method (complement to Figure 7 )

Average rank and respective standard deviation of each estimation methods in case study S2 - using GLM learning method (complement to Figure 8 )

Average rank and respective standard deviation of each estimation methods in case study S2–using RF learning method (complement to Fig. 8 )

Average rank and respective standard deviation of each estimation methods in case study S3–using GLM learning method (complement to Fig. 9 )

Average rank and respective standard deviation of each estimation methods in case study S3–using RF learning method (complement to Fig. 9 )

## 1.2 Real-world time series results with GLM and RF

1.2.1 6.0.1 stationary time series.

See Figs. 24 , 25 , 26 , 27 .

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the GLM learning method (complement to Fig. 11 )

Average rank and respective standard deviation of each estimation method in stationary real-world time series using the RF learning method (complement to Fig. 11 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the GLM learning method (complement to Fig. 12 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RF learning method (complement to Fig. 12 )

## 1.2.2 6.0.2 Non-stationary time series

See Figs. 28 , 29 , 30 , 31 .

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the GLM learning method (complement to Fig. 13 )

Average rank and respective standard deviation of each estimation method in non-stationary real-world time series using the RF learning method (complement to Fig. 13 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the GLM learning method (complement to Fig. 14 )

Proportion of probability of the outcome when comparing the performance estimation ability of the respective estimation method with the Rep-Holdout method with non-stationary real-world time series. The probabilities are computed using the Bayes signed-rank test and using the RF learning method (complement to Fig. 14 )

## 1.2.3 6.0.3 Error bias using the GLM and RF methods

See Figs. 32 , 33 .

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the RWTS case study, and using the GLM learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error (complement to Fig. 15 )

Log percentage difference of the estimated loss relative to the true loss for each estimation method in the RWTS case study, and using the RF learning method. Values below the zero line represent under-estimations of error. Conversely, values above the zero line represent over-estimations of error (complement to Fig. 15 )

## Rights and permissions

Reprints and permissions

## About this article

Cerqueira, V., Torgo, L. & Mozetič, I. Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn 109 , 1997–2028 (2020). https://doi.org/10.1007/s10994-020-05910-7

Download citation

Received : 30 May 2019

Revised : 01 June 2020

Accepted : 25 August 2020

Published : 13 October 2020

Issue Date : November 2020

DOI : https://doi.org/10.1007/s10994-020-05910-7

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Performance estimation
- Model selection
- Cross validation
- Time series
- Forecasting

Advertisement

- Find a journal
- Publish with us
- Track your research

- Architecture and Design
- Asian and Pacific Studies
- Business and Economics
- Classical and Ancient Near Eastern Studies
- Computer Sciences
- Cultural Studies
- Engineering
- General Interest
- Geosciences
- Industrial Chemistry
- Islamic and Middle Eastern Studies
- Jewish Studies
- Library and Information Science, Book Studies
- Life Sciences
- Linguistics and Semiotics
- Literary Studies
- Materials Sciences
- Mathematics
- Social Sciences
- Sports and Recreation
- Theology and Religion
- Publish your article
- The role of authors
- Promoting your article
- Abstracting & indexing
- Publishing Ethics
- Why publish with De Gruyter
- How to publish with De Gruyter
- Our book series
- Our subject areas
- Your digital product at De Gruyter
- Contribute to our reference works
- Product information
- Tools & resources
- Product Information
- Promotional Materials
- Orders and Inquiries
- FAQ for Library Suppliers and Book Sellers
- Repository Policy
- Free access policy
- Open Access agreements
- Database portals
- For Authors
- Customer service
- People + Culture
- Journal Management
- How to join us
- Working at De Gruyter
- Mission & Vision
- De Gruyter Foundation
- De Gruyter Ebound
- Our Responsibility
- Partner publishers

Your purchase has been completed. Your documents are now available to view.

## Journal of Time Series Econometrics

- Online ISSN: 1941-1928
- Type: Journal
- Language: English
- Publisher: De Gruyter
- First published: January 1, 2009
- Publication Frequency: 2 Issues per Year
- Audience: macroeconomists, applied statisticians, and financial analysts

## time series Recently Published Documents

Total documents.

- Latest Documents
- Most Cited Documents
- Contributed Authors
- Related Sources
- Related Keywords

## Time-Series Prediction in Nodal Networks Using Recurrent Neural Networks and a Pairwise-Gated Recurrent Unit Approach

Comparison of active covid-19 cases per population using time-series models.

This research explored the precision of diverse time-series models for COVID-19 epidemic detection in all the thirty-six different states and the Federal Capital Territory (FCT) in Nigeria with the maximum count of daily cumulative of confirmed, recovered and death cases as of 4 November 2020 of COVID-19 and populace of each state. A 14-multi step ahead forecast system for active coronavirus cases was built, analyzed and compared for six (6) different deep learning-stimulated and statistical time-series models using two openly accessible datasets. The results obtained showed that based on RMSE metric, ARIMA model obtained the best values for four of the states (0.002537, 0.001969.12E-058, 5.36E-05 values for Lagos, FCT, Edo and Delta states respectively). While no method is all-encompassing for predicting daily active coronavirus cases for different states in Nigeria, ARIMA model obtains the highest-ranking prediction performance and attained a good position results in other states.

## A Kalman filter based ARX time series modeling for force identification on flexible manipulators

Long-term prediction of blood pressure time series using anfis system based on dkfcm clustering, time-series forecasting and analysis of covid-19 outbreak in highly populated countries- a data-driven approach.

A novel corona virus, COVID-19 is spreading across different countries in an alarming proportion and it has become a major threat to the existence of human community. With more than eight lakh death count within a very short span of seven months, this deadly virus has affected more than 24 million people across 213 countries and territories around the world. Time-series analysis, modeling and forecasting is an important research area that explores the hidden insights from larger set of time-bound data for arriving better decisions. In this work, data analysis on COVID-19 dataset is performed by comparing the top six populated countries in the world. The data used for the evaluation is taken for a time period from 22nd January 2020 to 23rd August 2020.A novel time-series forecasting approach based on Auto-regressive integrated moving average (ARIMA) model is also proposed. The results will help the researchers from medical and scientific community to gauge the trend of the disease spread and improvise containment strategies accordingly.

## Comparison of Active COVID-19 Cases Per Population Using Time-Series Models

What will be the economic impact of the new medical device regulation an interrupted time-series analysis of foreign trade data, measurement of out-of-plane displacement in time series using laser lines and a camera with a diffraction grating, univariate and multivariate time series modeling using a harmonic decomposition methodology, multi-year monitoring land surface phenology in relation to climatic variables using modis-ndvi time-series in mediterranean forest, northeast tunisia, export citation format, share document.

## AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks

Time series problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. Bayesian approaches start with an assumption about the data's patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like Gaussian processes (GPs) and Structural Time Series are extensively used for modeling time series data, e.g., the commonly used Mauna Loa CO2 dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don't produce reliable confidence intervals.

To that end, we introduce AutoBNN , a new open-source package written in JAX . AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks.

AutoBNN is based on a line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned kernel structures. The kernel function of a GP encodes assumptions about the function being modeled, such as the presence of trends, periodicity or noise. With learned GP kernels, the kernel function is defined compositionally: it is either a base kernel (such as Linear , Quadratic , Periodic , Matérn or ExponentiatedQuadratic ) or a composite that combines two or more kernel functions using operators such as Addition , Multiplication , or ChangePoint . This compositional kernel structure serves two related purposes. First, it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series. Second, techniques like Sequential Monte Carlo can be used for discrete searches over small structures and can output interpretable results.

AutoBNN improves upon these ideas, replacing the GP with Bayesian neural networks (BNNs) while retaining the compositional kernel structure. A BNN is a neural network with a probability distribution over weights rather than a fixed set of weights. This induces a distribution over outputs, capturing uncertainty in the predictions. BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and TPU hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with traditional deep BNNs , which have the ability to do feature discovery. One could imagine "hybrid" architectures, in which users specify a top-level structure of Add ( Linear , Periodic , Deep ), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information.

How might one translate a GP with compositional kernels into a BNN then? A single layer neural network will typically converge to a GP as the number of neurons (or "width") goes to infinity . More recently, researchers have discovered a correspondence in the other direction — many popular GP kernels (such as Matern , ExponentiatedQuadratic , Polynomial or Periodic ) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite. For example, the figures below show the difference in the covariance between pairs of observations, and regression results of the true GPs and their corresponding width-10 neural network versions.

Finally, the translation is completed with BNN analogues of the Addition and Multiplication operators over GPs, and input warping to produce periodic kernels. BNN addition is straightforwardly given by adding the outputs of the component BNNs. BNN multiplication is achieved by multiplying the activations of the hidden layers of the BNNs and then applying a shared dense layer. We are therefore limited to only multiplying BNNs with the same hidden width.

## Using AutoBNN

The AutoBNN package is available within Tensorflow Probability . It is implemented in JAX and uses the flax.linen neural network library. It implements all of the base kernels and operators discussed so far ( Linear , Quadratic , Matern , ExponentiatedQuadratic , Periodic , Addition , Multiplication ) plus one new kernel and three new operators:

- a OneLayer kernel, a single hidden layer ReLU BNN,
- a ChangePoint operator that allows smoothly switching between two kernels,
- a LearnableChangePoint operator which is the same as ChangePoint except position and slope are given prior distributions and can be learnt from the data, and
- a WeightedSum operator.

WeightedSum combines two or more BNNs with learnable mixing weights, where the learnable weights follow a Dirichlet prior . By default, a flat Dirichlet distribution with concentration 1.0 is used.

WeightedSums allow a "soft" version of structure discovery, i.e., training a linear combination of many possible models at once. In contrast to structure discovery with discrete structures, such as in AutoGP , this allows us to use standard gradient methods to learn structures, rather than using expensive discrete optimization. Instead of evaluating potential combinatorial structures in series, WeightedSum allows us to evaluate them in parallel.

To easily enable exploration, AutoBNN defines a number of model structures that contain either top-level or internal WeightedSums . The names of these models can be used as the first parameter in any of the estimator constructors, and include things like sum_of_stumps (the WeightedSum over all the base kernels) and sum_of_shallow (which adds all possible combinations of base kernels with all operators).

The figure below demonstrates the technique of structure discovery on the N374 (a time series of yearly financial data starting from 1949) from the M3 dataset. The six base structures were ExponentiatedQuadratic (which is the same as the Radial Basis Function kernel, or RBF for short), Matern , Linear , Quadratic , OneLayer and Periodic kernels. The figure shows the MAP estimates of their weights over an ensemble of 32 particles. All of the high likelihood particles gave a large weight to the Periodic component, low weights to Linear , Quadratic and OneLayer , and a large weight to either RBF or Matern .

By using WeightedSums as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the sum_of_products model (illustrated in the figure below) which first creates a pairwise product of two WeightedSums , and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 2 16 , since there are 16 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model.

We have found, however, that certain combinations of kernels (e.g., the product of Periodic and either the Matern or ExponentiatedQuadratic ) lead to overfitting on many datasets. To prevent this, we have defined model classes like sum_of_safe_shallow that exclude such products when performing structure discovery with WeightedSums .

For training, AutoBNN provides AutoBnnMapEstimator and AutoBnnMCMCEstimator to perform MAP and MCMC inference, respectively. Either estimator can be combined with any of the six likelihood functions , including four based on normal distributions with different noise characteristics for continuous data and two based on the negative binomial distribution for count data.

To fit a model like in the figure above, all it takes is the following 10 lines of code, using the scikit-learn –inspired estimator interface:

AutoBNN provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data. We invite the community to try the colab , and leverage this library to innovate and solve real-world challenges.

## Acknowledgements

AutoBNN was written by Colin Carroll, Thomas Colthurst, Urs Köster and Srinivas Vasudevan. We would like to thank Kevin Murphy, Brian Patton and Feras Saad for their advice and feedback.

## ORIGINAL RESEARCH article

Effectiveness of the acute stroke care map program in reducing in-hospital delay for acute ischemic stroke in a chinese urban area: an interrupted time series analysis.

- 1 Shenyang Tenth People's Hospital, Shenyang, China
- 2 Central Hospital Affiliated to Shenyang Medical College, Shenyang, Liaoning Province, China
- 3 Shenyang First People’s Hospital, Shenyang Brain Institute, Shenyang, Liaoning Province, China

The final, formatted version of the article will be published soon.

## Select one of your emails

You have multiple emails registered with Frontiers:

## Notify me on publication

Please enter your email address:

If you already have an account, please login

You don't have a Frontiers account ? You can register here

Abstract Background: Timely intravenous thrombolysis (IVT) is crucial for improving outcomes in acute ischemic stroke (AIS) patients. This study evaluates the effectiveness of the Acute Stroke Care Map (ASCaM) initiative in Shenyang, aimed at reducing door-to-needle times (DNT) and thus improving the timeliness of care for AIS patients. Methods: An retrospective cohort study was conducted from April 2019 to December 2021 in 30 hospitals participating in the ASCaM initiative in Shenyang. The ASCaM bundle included strategies such as EMS prenotification, rapid stroke triage, on-call stroke neurologists, immediate neuroimaging interpretation, and the innovative Pre-hospital Emergency Call and Location Identification feature. An interrupted time series analysis (ITSA) was used to assess the impact of ASCaM on DNT, comparing nine months pre-intervention with 24 months post-intervention. Results: Data from 9680 IVT-treated ischemic stroke patients were analyzed, including 2401 in the pre-intervention phase and 7279 post-intervention. The ITSA revealed a significant reduction in monthly DNT by -1.12 minutes and a level change of -5.727 minutes post-ASCaM implementation. Conclusion: The ASCaM initiative significantly reduced in-hospital delays for AIS patients, demonstrating its effectiveness as a comprehensive stroke care improvement strategy in urban settings. These findings highlight the potential of coordinated care interventions to enhance timely access to reperfusion therapies and overall stroke prognosis.

Keywords: acute stroke care map, In-hospital delay, interrupted time-series analysis, acute ischemic stroke, Program

Received: 23 Jan 2024; Accepted: 25 Mar 2024.

Copyright: © 2024 Wen, Wang, Bian, Zhu, Xiao, He, 王, Liu, Shi, Hong and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Miaoran Wang, Central Hospital Affiliated to Shenyang Medical College, Shenyang, Liaoning Province, China Wei Bian, Shenyang First People’s Hospital, Shenyang Brain Institute, Shenyang, 110041, Liaoning Province, China Haoyue Zhu, Shenyang First People’s Hospital, Shenyang Brain Institute, Shenyang, 110041, Liaoning Province, China Ying Xiao, Shenyang First People’s Hospital, Shenyang Brain Institute, Shenyang, 110041, Liaoning Province, China Qian He, Shenyang Tenth People's Hospital, Shenyang, China Xiaoqing Liu, Shenyang Tenth People's Hospital, Shenyang, China Yangdi Shi, Shenyang Tenth People's Hospital, Shenyang, China Zhe Hong, Shenyang First People’s Hospital, Shenyang Brain Institute, Shenyang, 110041, Liaoning Province, China Bing Xu, Shenyang Tenth People's Hospital, Shenyang, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

- Advanced Search
- Journal List
- v.7(11); 2017

## Application of time series analysis in modelling and forecasting emergency department visits in a medical centre in Southern Taiwan

Wang-chuan juang.

1 Department of Emergency, Kaohsiung Veterans General Hospital, Kaohsiung, Taiwan

2 Department of Information Management, Shu-Zen Junior College of Medicine and Management, Kaohsiung, Taiwan

## Sin-Jhih Huang

Fong-dee huang, pei-wen cheng.

3 Department of Medical Education and Research, Kaohsiung Veterans General Hospital, Kaohsiung, Taiwan

4 Department of Physical Therapy, Shu-Zen Junior College of Medicine and Management, Kaohsiung, Taiwan

## Shue-Ren Wann

Associated data.

bmjopen-2017-018628supp001.pdf

Emergency department (ED) overcrowding is acknowledged as an increasingly important issue worldwide. Hospital managers are increasingly paying attention to ED crowding in order to provide higher quality medical services to patients. One of the crucial elements for a good management strategy is demand forecasting. Our study sought to construct an adequate model and to forecast monthly ED visits.

We retrospectively gathered monthly ED visits from January 2009 to December 2016 to carry out a time series autoregressive integrated moving average (ARIMA) analysis. Initial development of the model was based on past ED visits from 2009 to 2016. A best-fit model was further employed to forecast the monthly data of ED visits for the next year (2016). Finally, we evaluated the predicted accuracy of the identified model with the mean absolute percentage error (MAPE). The software packages SAS/ETS V.9.4 and Office Excel 2016 were used for all statistical analyses.

A series of statistical tests showed that six models, including ARIMA (0, 0, 1), ARIMA (1, 0, 0), ARIMA (1, 0, 1), ARIMA (2, 0, 1), ARIMA (3, 0, 1) and ARIMA (5, 0, 1), were candidate models. The model that gave the minimum Akaike information criterion and Schwartz Bayesian criterion and followed the assumptions of residual independence was selected as the adequate model. Finally, a suitable ARIMA (0, 0, 1) structure, yielding a MAPE of 8.91%, was identified and obtained as Visit t =7111.161+(a t +0.37462 a t −1).

The ARIMA (0, 0, 1) model can be considered adequate for predicting future ED visits, and its forecast results can be used to aid decision-making processes.

## Strengths and limitations of this study

- Less than three studies have used the autoregressive integrated moving average (ARIMA) model to forecast emergency department (ED) visits in Taiwan. Our results revealed that ARIMA (0, 0, 1) could predict the upcoming month in 2017 with high accuracy.
- Compared to determining the model composition via the plot pattern, we also applied other algorithms to determine the model (eg, minimum information criterion, smallest canonical correlation and extended sample autocorrelation function).
- The model did not incorporate possible influencing factors related to ED visits, such as atmosphere changes (eg, humidity and temperature), epidemiological information (eg, epidemics) and political issues, including medical policy, resource allocation and healthcare funding strategies.

## Introduction

Emergency department (ED) crowding continues to be a major public health issue. Several studies have demonstrated that ED crowding is harmful to patients, as it leads to delays in treatment, a potential increase in medical errors 1–4 and, most importantly, nosocomial infections. Thus, hospital managers are increasingly paying greater attention to ED crowding in order to provide better quality service and treatment environments for patients.

Crowding in the ED is defined as having more patients than treatment rooms or more patients than staff should ideally care for, and overcrowding is defined as being dangerously crowded, with an extreme volume of patients in ED treatment areas that forces the ED to operate beyond its capacity. 5–7 The ED literature describes several employed ED crowding metrics, such as the rapid emergency admission to destination initiative, work score, emergency department work index and the national emergency department overcrowding scale. These metrics involve many factors (ie, the total number of patients in the ED occupying beds, the total number of patients on respirators, the total number of patients awaiting admission, the number of beds in the ED and so on). However, these metrics reflect only the real-time situation or the past degree of crowding, and they cannot forecast the future ED visits on the basis of statistical theory. Therefore, in order to adjust appropriate capacity in the ED and to arrange for ED resource deployment, we must forecast future ED visits for the next month or next year.

One of the most important and widely used time series models is the autoregressive integrated moving average (ARIMA) model. This model has been adopted in different research fields (ie, epidemiology, 8 economics 9 and earth science 10 ). With respect to the topic of ED visits forecast, Milner evaluated a 10-year follow-up ARIMA forecasts of ED attendances in 1986. 11 Several years later, researchers tried to estimate the association of ED visitor volume with other variables, such as climate, economic, environmental or demographical factors. For instance, Chen et al showed an application of ARIMA model to the forecasts of monthly patient visits and to investigate the association of ED visits with temperature and stock index fluctuation. 12 Sun et al showed that multivariate ARIMA method is applied to forecast of ED attendances and to the relationship between ED attendances and ambient air quality. 13 What is more, Mai et al demonstrated that the ED demand is predicted to exceed population growth via multivariate vector-ARIMA technique, especially in patients with complex care needs. 14 However, its application in ED visits forecasting in Taiwan’s medical institutions is quite limited. For this reason and because of curiosity, we decided to construct an adequate model to forecast monthly ED visits using ARIMA methodology.

## Materials and methods

We retrospectively collected ED visit data from the hospital information system of a medical centre in Southern Taiwan via the WebFOCUS (Information Builders, New York, New York, USA). We gathered monthly ED visit data for the period of January 2009 through December 2016 (96 months) to carry out the ARIMA model analysis. This modelling consisted of two important procedures: (1) identification of the model, in which the initial development of the model was based on the period from January 2009 to December 2015 and (2) validation of the fit model, in which the best-fit model was used to forecast ED visits during the period from January 2016 to December 2016. Forecasting accuracy was evaluated by comparison of the forecasted visits with the actual visits.

It was important to evaluate the stationarity of the series prior to any estimation. The augmented Dickey-Fuller (ADF) unit root test, carried out by IDENTIFY statement of the PROC ARIMA procedure, was used to evaluate the stationarity of series data (ie, the mean and covariance remained constant over a period of 96 months). The Ljung-Box test was used to evaluate whether a series of observations over time were random and independent (also called white noise). The next step in the ARIMA methodology is to examine the patterns of the plot of the autocorrelation function (ACF) and the partial autocorrelation function (PACF) to determine the components of ARIMA (p, d, q), where p represents the order of the AR part, d represents the order of regular differences performed and q represents the order of the MA part. 15 Mathematically, the pure ARIMA model was written as follows 16 : W t =μ+(θ(B)/ψ(B)) a t , where t represents the index time; W t represents the response series Y t or a difference of the response series (W t =(1−B) d Y t ); μ represents the mean term; B represents the backshift operator; that is, BX t =X t−1 ; θ(B) corresponds to the moving-average operator, represented as a polynomial in the backshift operator θ(B)=1−θ 1 (B)−…−θ q (B) q ; ψ(B) corresponds to the autoregressive operator, represented as a polynomial in the backshift operator ψ(B)=1−ψ 1 (B)−…−ψ p (B) p and a t represents the independent disturbance, also called the random error.

The ACF is a statistical tool that evaluates whether earlier values in the series are associated with later values. PACF captures the amount of correlation between a variable and a lag of the said variable that is not explained by correlation at all low-order lags. 17 In addition, we also used tentative order selection algorithms, including the minimum information criterion (MINIC), smallest canonical correlation (SCAN) and the extended sample autocorrelation function (ESACF), to assist us in filtration of the other tentative ARIMA models. Once completing the filtration, we compared the Akaike information criterion (AIC) and Schwartz Bayesian criterion (SBC) values 18 among the remaining candidate ARIMA models. The adequate forecast model was accompanied by the minimum values for AIC and SBC. After identification of the fitted model, a normality test and independence for model residuals were evaluated (also called autocorrelation to check for white noise). Subsequently, we estimated the parameters to build the ARIMA formula using the maximum likelihood method algorithm. Finally, the prediction accuracy of the model was verified based on the mean absolute percentage error (MAPE), which was defined as follows 19 :

where X I denotes the observed ED visits at month I, and X I ¯ denotes the predicted ED visits at month I. All statistical analyses were conducted using the PROC ARIMA procedure in SAS/ETS V.9.4 and Office Excel 2016.

## Temporal analysis

A total of 84 months (from January 2009 to December 2015) of data were imported into the SAS library. An ED visit plot of the original data is shown in figure 1 . The mean, SD and number of observations for the undifferenced series were 7108, 547 and 84, respectively. The observational plot showed a severe oscillation period followed by a mild oscillation period. Subsequently, a moderate oscillation period appeared. In addition, a dramatic elevation in ED visits occurred in January 2012.

Time series plot of emergency department (ED) visits from January 2009 to December 2015. The observational plot showed a severe oscillation period followed by a mild oscillation period. Subsequently, a moderate oscillation period appeared. In addition, a dramatic elevation in ED visits occurred in January 2012.

The ADF unit root test revealed that there were no unit roots in the characteristic polynomial of the model fitted to the ED visit data at lag 0, 1 or 2 (all P<0.001 for the single mean test and the trend test, see online supplementary table 1 ). In addition, the white noise hypothesis was supported (P>0.05), as shown in online supplementary table 2 . Since we assumed that the series data were stationary (ie, the mean, variance and autocorrelation were constant through time), the procedure of difference was ignorable, and the d order of ARIMA was equal to zero.

## Supplementary file 1

Model fitting.

The ACF plot in figure 2A shows a significantly positive spike at lag 1, revealing a non-seasonal MA component of one order (q=1). The PACF plot shown in figure 2B reveals a significantly positive spike at lag 1, suggesting a non-seasonal AR component of one order (p=1). In addition, we found no seasonal lag in the ED visit series data within the period examined. Therefore, the proposed model was ARIMA (1, 0, 1). In addition to relying on the pattern of the ACF and the PACF, we also used the following identification techniques to determine the order of MA and AR. (1) MINIC method: the minimum of this matrix was equal to 12.41575, as highlighted in online supplementary table 3 , corresponding to an AR of 0 and an MA of 1 location (0, 1). (2) SCAN method: there were two (maximal) rectangular regions in which all the elements were insignificant with a 95% CI, as shown in online supplementary table 4 ; this method revealed two vertexes at (0, 1) and (1, 0). (3) ESACF method: there were five triangular regions in which all elements were insignificant at the 5% level; the vertices of the triangles, including (0, 1), (1,1), (2,1), (3,1) and (5,1) are shown in online supplementary table 5 . The P values for the Ljung-Box test all exceeded 5% for all lag orders, implying that there was no significant departure from white noise for the residuals in the six candidate models.

Autocorrelation function (A) and partial autocorrelation function (B) for ED visit time series. The former shows a significantly positive spike at lag 1, revealing a non-seasonal MA component of one order. The latter shown in (B) reveals a significantly positive spike at lag 1, suggesting a non-seasonal AR component of one order (p=1). In addition, we found no seasonal lag in the ED visit series data within the period examined. ACF, autocorrelation function; AR, autoregressive; ED, emergency department; MA, moving average; PACF, partial autocorrelation function.

We then calculated the values of the relative quality measures AIC and SBC for six candidate ARIMA models, as shown in online supplementary table 6 . Hence, the final model selected for ED visit data was ARIMA (0, 0, 1). Thus, we next sought to verify the adequacy of this model. Figure 3 shows the residual correlation and white noise test plots; we accepted the null hypothesis that the residuals were uncorrelated. The normality plots also showed no departure from normality. Thus, we concluded that the ARIMA (0, 0, 1) was adequate for the change in ED visit series. The next step was parameter estimation, which made use of the ESTIMATE statement of the PROC ARIMA procedure. The results are reported in table 1 . We proceeded in our analysis to check whether all parameters contained in the models were significant. There were two parameters in the model. The mean term was labelled as MU; its estimated value was 7111.2, and the P value for MU indicated that this term significantly contributed to the model (P<0.0001). The moving-average parameter was labelled as MA1, 1; this was the coefficient of the lagged value of the change in ED visits, and its estimate was −0.37462 (P=0.0003). The ARIMA (0, 0, 1) was further selected to forecast ED visits, where the autoregressive term p=0, the difference term d=0 and the moving average term q=1. According to the dataset in table 1 , the mathematical formula of the fitted ARIMA (0, 0, 1) model was given by the following:

Graphical check of the residuals for the model (0, 0, 1). We accepted the null hypothesis that the residuals were uncorrelated. The normality plots also showed no departure from normality. ACF, autocorrelation function; IACF, interaural cross-correlation function; PACF, partial autocorrelation function.

Parameter estimates for the ARIMA (0, 0, 1) model using the algorithm of maximum likelihood.

ARIMA, autoregressive integrated moving average.

After transforming the back operator, the equation became

We used this equation to produce the values and plot of the forecast from January 2009 to December 2017 using the FORECAST statement of the PROC ARIMA procedure. The trend plot of the forecast performance is shown in figure 4 . By comparing the actual visits with the forecasted visits, the MAPE value, a measure of the prediction accuracy of forecasting methods in statistics, 20 was approximately 8.91%. Thus, the forecasting power of the model showed highly accurate forecasting. The model forecasted a slight increase in ED visits in the upcoming month in 2017, with constant visits observed based on the fitted model.

The trend plot of the forecast using the ARIMA (0, 0, 1) model. The solid vertical line in the forecast plot corresponds to January 2016, where the model ends and the forecasts begin. The model forecasted a slight increase in ED visits in the upcoming month in 2017, with constant visits observed based on the fitted model. ARIMA, autoregressive integrated moving average; ED, emergency department.

This study focused on the process and the results of a time series analysis of ED visits. Our results revealed that an ARIMA (0, 0, 1) model was appropriate for forecasting ED visits in the upcoming years 2016 and 2017. In addition, the model’s fitted amounts and the normality of the model residuals’ distribution confirmed the model’s fitness.

Although this was not the first study to forecast monthly ED visits via ARIMA methodology, it was the first forecasting of ED visits that showed the entire statistical process. Our result paragraphs clearly describe the entire process involved in constructing the model and predicting the data. The process and results of the ARIMA methodology played equally important roles. In addition to determining the model construct via the pattern of ACF and PACF, we employed other identification techniques, such as MINIC, SCAN and ESACF methods. To date, most studies have determined only the p and q terms in an ARIMA model via the plot pattern of ACF and PACF.

Forecasting future ED visits, as an important issue in ED surveillance systems, plays a key role in providing a comfortable medical environment, preventing ED crowding, helping with ED resource deployment and even avoiding nosocomial infections. In spite of highly accurate forecasting, we found that there were still two ED visit months (February 2011 and January 2012) that fell outside the 95% CI. There are two reasons for this result. The major reason is that an influenza-like illness was prevalent at that time; the Taiwan National Infectious Disease Statistics System showed that there were two dramatic increases during the fifth week of 2011 and the fourth week of 2012 (see online supplementary figure 1 ). The peaks located at these time points are in accordance with the peaks shown in figure 1 . The minor reason was that these peaks occurred during the traditional Chinese New Year period. Since most clinics and hospital outpatient departments are closed during this time, ED visits showed an upward trend.

To our knowledge, only two research teams have applied the ARIMA model for studying the variability of monthly ED visitors. 12 13 However, our study suffered from a few limitations. First, we employed the univariate ARIMA model to forecast ED visits. However, the Taiwan National Infectious Disease Statistics System demonstrated that epidemics clearly influence the quantity of ED visits. Additionally, the transmission of influenza virus is dependent on relative humidity and temperature, 21 and our model did not incorporate certain factors related to ED visits, such as atmosphere changes (eg, humidity and temperature) and epidemiological information (eg, epidemics). Second, several political issues, including medical policy, resource allocation and healthcare funding strategies, were worth a thought. However, our ARIMA model couldn’t reflect the influence of those. Third, this best-fit model with high accurate forecasting took possess of specificity. It can maximise the percentage of variations explained but limit the ability of the results to be generalised. 22 This should be considered when generalising the finding of the study to other hospitals. Final limitation was that regardless of the statistical prediction method, unexpected emergencies cannot be forecasted (eg, food poisoning in a community and burn disaster). Once an accident occurs, the medical personnel only depend on the corresponding guidelines of emergency recall. In the future, we will recruit the abovementioned factors to reach more accurate prediction via multiple ARIMA models.

Monthly ED visits were shown to follow the ARIMA model (0, 0, 1). This model can be considered adequate for predicting future ED visits at a hospital and can be employed to aid in decision-making processes. Our findings provide new insights into the reasons for overcrowding in the ED and may be helpful in further development of the ARIMA (0, 0, 1) model to forecast reasons for overcrowding in the ED.

## Supplementary Material

Acknowledgments.

The authors gratefully acknowledge the technical assistance and the invaluable input and support of S-JH.

Contributors: WC-J and SR-W conceived and designed the study. FD-H conducted most of the studies with assistance from PW-C. SJ-H wrote the paper with contributions from PW-C. All authors were involved in the revision of the manuscript for important intellectual content.

Funding: This work was supported by grants from the Kaohsiung Veterans General Hospital (VGHKS106-198, VGHKS106-031 and VGHKS106-168) provided to SR-W and WC-J.

Competing interests: None declared.

Ethics approval: The medical ethics committee of the Institutional Review Board of the Kaohsiung Veterans General Hospital agreed (Kaohsiung, Taiwan; IRB no VGHKS17-CT5-15) that the use of emergency department visits from the Disease Surveillance System did not involve personal private information, and the present study was retrospective without any biological experiments related to humans or animals. The committee therefore waived the need for ethics approval for utilisation of the data.

Provenance and peer review: Not commissioned; externally peer reviewed.

Data sharing statement: No additional data are available.

- Mobile Site
- Staff Directory
- Advertise with Ars

## Filter by topic

- Biz & IT
- Gaming & Culture

Front page layout

## THIS MEMORY-DEPENDENT PREFETCHER HAS TEETH —

Unpatchable vulnerability in apple chip leaks secret encryption keys, fixing newly discovered side channel will likely take a major toll on performance..

Dan Goodin - Mar 21, 2024 2:40 pm UTC

A newly discovered vulnerability baked into Apple’s M-series of chips allows attackers to extract secret keys from Macs when they perform widely used cryptographic operations, academic researchers have revealed in a paper published Thursday.

The flaw—a side channel allowing end-to-end key extractions when Apple chips run implementations of widely used cryptographic protocols—can’t be patched directly because it stems from the microarchitectural design of the silicon itself. Instead, it can only be mitigated by building defenses into third-party cryptographic software that could drastically degrade M-series performance when executing cryptographic operations, particularly on the earlier M1 and M2 generations. The vulnerability can be exploited when the targeted cryptographic operation and the malicious application with normal user system privileges run on the same CPU cluster.

## Beware of hardware optimizations

The threat resides in the chips’ data memory-dependent prefetcher, a hardware optimization that predicts the memory addresses of data that running code is likely to access in the near future. By loading the contents into the CPU cache before it’s actually needed, the DMP, as the feature is abbreviated, reduces latency between the main memory and the CPU, a common bottleneck in modern computing. DMPs are a relatively new phenomenon found only in M-series chips and Intel's 13th-generation Raptor Lake microarchitecture, although older forms of prefetchers have been common for years.

Security experts have long known that classical prefetchers open a side channel that malicious processes can probe to obtain secret key material from cryptographic operations. This vulnerability is the result of the prefetchers making predictions based on previous access patterns, which can create changes in state that attackers can exploit to leak information. In response, cryptographic engineers have devised constant-time programming, an approach that ensures that all operations take the same amount of time to complete, regardless of their operands . It does this by keeping code free of secret-dependent memory accesses or structures.

The breakthrough of the new research is that it exposes a previously overlooked behavior of DMPs in Apple silicon: Sometimes they confuse memory content, such as key material, with the pointer value that is used to load other data. As a result, the DMP often reads the data and attempts to treat it as an address to perform memory access. This “dereferencing” of “pointers”—meaning the reading of data and leaking it through a side channel—is a flagrant violation of the constant-time paradigm.

The team of researchers consists of:

- Boru Chen, University of Illinois Urbana-Champaign
- Yingchen Wang, University of Texas at Austin
- Pradyumna Shome, Georgia Institute of Technology
- Christopher W. Fletcher, University of California, Berkeley
- David Kohlbrenner, University of Washington
- Riccardo Paccagnella, Carnegie Mellon University
- Daniel Genkin, Georgia Institute of Technology

In an email, they explained:

Prefetchers usually look at addresses of accessed data (ignoring values of accessed data) and try to guess future addresses that might be useful. The DMP is different in this sense as in addition to addresses it also uses the data values in order to make predictions (predict addresses to go to and prefetch). In particular, if a data value “looks like” a pointer, it will be treated as an “address” (where in fact it's actually not!) and the data from this “address” will be brought to the cache. The arrival of this address into the cache is visible, leaking over cache side channels. Our attack exploits this fact. We cannot leak encryption keys directly, but what we can do is manipulate intermediate data inside the encryption algorithm to look like a pointer via a chosen input attack. The DMP then sees that the data value “looks like” an address, and brings the data from this “address” into the cache, which leaks the “address.” We don’t care about the data value being prefetched, but the fact that the intermediate data looked like an address is visible via a cache channel and is sufficient to reveal the secret key over time.

In Thursday’s paper, the team explained it slightly differently:

Our key insight is that while the DMP only dereferences pointers, an attacker can craft program inputs so that when those inputs mix with cryptographic secrets, the resulting intermediate state can be engineered to look like a pointer if and only if the secret satisfies an attacker-chosen predicate. For example, imagine that a program has secret s, takes x as input, and computes and then stores y = s ⊕ x to its program memory. The attacker can craft different x and infer partial (or even complete) information about s by observing whether the DMP is able to dereference y. We first use this observation to break the guarantees of a standard constant-time swap primitive recommended for use in cryptographic implementations. We then show how to break complete cryptographic implementations designed to be secure against chosen-input attacks.

## reader comments

Promoted comments.

Requires running a malicious application locally. Reminder to avoid running applications you’ve downloaded from anywhere but trusted sources.

An exploit like this needs local access. If the bad baby eating hackers crawling on every corner of the internets already have local access you're screwed anyway.

It’s a very difficult exploit that’s unlikely to affect you.

## Channel Ars Technica

## Suggested Searches

Climate Change

- Expedition 64
- Mars perseverance
- SpaceX Crew-2
- International Space Station
- View All Topics A-Z

Humans in Space

## Earth & Climate

The solar system, the universe, aeronautics, learning resources, news & events.

## NASA Astronaut Loral O’Hara, Expedition 70 Science Highlights

## NASA Data Shows How Drought Changes Wildfire Recovery in the West

## NASA’s Europa Clipper Survives and Thrives in ‘Outer Space’ on Earth

- Search All NASA Missions
- A to Z List of Missions
- Upcoming Launches and Landings
- Spaceships and Rockets
- Communicating with Missions
- James Webb Space Telescope
- Hubble Space Telescope
- Why Go to Space
- Astronauts Home
- Commercial Space
- Destinations
- Living in Space
- Explore Earth Science
- Earth, Our Planet
- Earth Science in Action
- Earth Multimedia
- Earth Science Researchers
- Pluto & Dwarf Planets
- Asteroids, Comets & Meteors
- The Kuiper Belt
- The Oort Cloud
- Skywatching
- The Search for Life in the Universe
- Black Holes
- The Big Bang
- Dark Energy & Dark Matter
- Earth Science
- Planetary Science
- Astrophysics & Space Science
- The Sun & Heliophysics
- Biological & Physical Sciences
- Lunar Science
- Citizen Science
- Astromaterials
- Aeronautics Research
- Human Space Travel Research
- Science in the Air
- NASA Aircraft
- Flight Innovation
- Supersonic Flight
- Air Traffic Solutions
- Green Aviation Tech
- Drones & You
- Technology Transfer & Spinoffs
- Space Travel Technology
- Technology Living in Space
- Manufacturing and Materials
- Science Instruments
- For Kids and Students
- For Educators
- For Colleges and Universities
- For Professionals
- Science for Everyone
- Requests for Exhibits, Artifacts, or Speakers
- STEM Engagement at NASA
- NASA's Impacts
- Centers and Facilities
- Directorates
- Organizations
- People of NASA
- Internships
- Our History
- Doing Business with NASA
- Get Involved
- Aeronáutica
- Ciencias Terrestres
- Sistema Solar
- All NASA News
- Video Series on NASA+
- Newsletters
- Social Media
- Media Resources
- Upcoming Launches & Landings
- Virtual Events
- Sounds and Ringtones
- Interactives
- STEM Multimedia

## NASA Names Finalists to Help Deal with Dust in Human Lander Challenge

## Langley Celebrates Women’s History Month: The Langley ASIA-AQ Team

## NASA’s Curiosity Searches for New Clues About Mars’ Ancient Water

## Diez maneras en que los estudiantes pueden prepararse para ser astronautas

## Optical Fiber Production

## Antarctic Sea Ice Near Historic Lows; Arctic Ice Continues Decline

## Early Adopters of NASA’s PACE Data to Study Air Quality, Ocean Health

## What’s Up: March 2024 Skywatching Tips from NASA

## March-April 2024: The Next Full Moon is the Crow, Crust, Sap, Sugar, or Worm Moon

## Planet Sizes and Locations in Our Solar System

## Hubble Finds a Field of Stars

## Three-Year Study of Young Stars with NASA’s Hubble Enters New Chapter

## Hubble Sees New Star Proclaiming Presence with Cosmic Lightshow

## ESA, NASA Solar Observatory Discovers Its 5,000th Comet

## ARMD Solicitations

## F-15D Support Aircraft

## University Teams Selected as Finalists to Envision New Aviation Responses to Natural Disasters

## David Woerner

## Tech Today: Cutting the Knee Surgery Cord

## NASA, Industry Improve Lidars for Exploration, Science

## Earth Day 2020: Posters and Wallpaper

## Earth Day 2021: Posters and Virtual Backgrounds

## Launch Week Event Details

## Women’s History Month: Meet Kari Alvarado

## Contribute to NASA Research on Eclipse Day – and Every Day

## NASA’s OSIRIS-REx Mission Awarded Collier Trophy

## Astronauta de la NASA Marcos Berríos

## Resultados científicos revolucionarios en la estación espacial de 2023

Five tips from nasa for photographing a total solar eclipse, mara johnson-groh.

A total solar eclipse creates stunning celestial views for people within the path of the Moon’s shadow. This astronomical event is a unique opportunity for scientists to study the Sun and its influence on Earth, but it’s also a perfect opportunity to capture unforgettable images. Whether you’re an amateur photographer or a selfie master, try out these tips for photographing the eclipse.

#1 – Safety First

Looking directly at the Sun is dangerous to your eyes and your camera. To take images when the Sun is partially eclipsed, you’ll need to use a special solar filter to protect your camera, just as you’ll need a pair of solar viewing glasses (also called eclipse glasses) to protect your eyes. However, at totality, when the Moon completely blocks the Sun, make sure to remove the filter so you can see the Sun’s outer atmosphere – the corona.

#2 – Any Camera Is a Good Camera

Taking a stunning photo has more to do with the photographer than the camera. Whether you have a high-end DLSR or a camera phone, you can take great photos during the eclipse; after all, the best piece of equipment you can have is a good eye and a vision for the image you want to create. If you don’t have a telephoto zoom lens, focus on taking landscape shots and capture the changing environment.

Having a few other pieces of equipment can also come in handy during the eclipse. Using a tripod can help you stabilize the camera and avoid taking blurry images when there is low lighting. Additionally, using a delayed shutter release timer will allow you to snap shots without jiggling the camera.

#3 – Look Up, Down, All Around

While the Sun is the most commanding element of a solar eclipse, remember to look around you. As the Moon slips in front of the Sun, the landscape will be bathed in eerie lighting and shadows. As light filters through the overlapping leaves of trees, it creates natural pinholes that project miniature eclipse replicas on the ground. Anywhere you can point your camera can yield exceptional imagery, so be sure to compose some wide-angle photos that can capture your eclipse experience.

NASA photographer Bill Ingalls recommends focusing on the human experience of watching the eclipse. “The real pictures are going to be of the people around you pointing, gawking, and watching it,” Ingalls noted. “Those are going to be some great moments to capture to show the emotion of the whole thing.”

#4 – Practice

Be sure you know the capabilities of your camera before eclipse day. Most cameras, and even some camera phones, have adjustable exposures, which can help you darken or lighten your image during the tricky eclipse lighting. Make sure you know how to manually focus the camera for crisp shots.

For DSLR cameras, the best way to determine the correct exposure is to test settings on the uneclipsed Sun beforehand. Using a fixed aperture of f/8 to f/16, try shutter speeds between 1/1000 to 1/4 second to find the optimal setting, which you can then use to take images during the partial stages of the eclipse. During totality, the corona has a wide range of brightness, so it’s best to use a fixed aperture and a range of exposures from approximately 1/1000 to 1 second.

#5 – Share!

Share your eclipse experience with friends and family afterwards. Tag @NASA to connect your photos on social media to those taken around the country and share them with NASA.

While you’re snapping those eclipse photos, don’t forget to stop and look at the eclipse with your own eyes. Just remember to wear your solar viewing glasses (also called eclipse glasses) for all stages of the eclipse before and after totality!

Related Links

- Learn more about the 2024 total solar eclipse
- Eclipse Photographers Will Help Study Sun During Its Disappearing Act

By Mara Johnson-Groh NASA’s Goddard Space Flight Center , Greenbelt, Md.

## Related Terms

- Science & Research
- 2017 Solar Eclipse
- 2024 Solar Eclipse
- Heliophysics
- Heliophysics Division
- Science Mission Directorate
- Solar Eclipses

## Explore More

Jupiter plows through the Pleiades on March 14, a chance to spot Mercury at month’s…

The next full moon is the Crow, Crust, Sap, Sugar, or Worm Moon; the Paschal…

## Unveiling the Sun: NASA’s Open Data Approach to Solar Eclipse Research

As the world eagerly anticipates the upcoming total solar eclipse on April 8, 2024, NASA…

## Discover Related Topics

Solar System

## IMAGES

## VIDEO

## COMMENTS

A time series is a sequence of measurements performed at successive points in time. Time series can be used to derive models to predict future values based on previously observed values ...

The length of time series can vary, but are generally at least 20 observations long, and many models require at least 50 observations for accurate estimation (McCleary et al., 1980, p. 20). More data is always preferable, but at the very least, a time series should be long enough to capture the phenomena of interest.

Time-series analysis is a statistical method of analyzing data from repeated observations on a single unit or individual at regular intervals over a large number of observations. Time-series ...

The Journal of Time Series Analysis is the leading mathematical statistics journal focused on the important field of time series analysis. We welcome papers on both fundamental theory and applications in fields such as neurophysiology, astrophysics, economic forecasting, the study of biological data, control systems, signal processing, and communications and vibrations engineering.

The current paper introduces time series analysis to psychological research, an analytic domain that has been essential for understanding and predicting the behavior of variables across many diverse fields. First, the characteristics of time series data are discussed.

Time series analysis is a valuable tool in epidemiology that complements the classical epidemiological models in two different ways: Prediction and forecast. ... Previous research has found that with time, immunity to SARS-CoV-2 natural infection is short-lived and leads to a risk of reinfection[58-60].

Time series classification (TSC) is very commonly used for modeling digital clinical measures. ... Web of Science, and SCOPUS databases using a range of search terms to retrieve peer-reviewed articles that report on the academic research about digital clinical measures from a five-year period between June 2016 and June 2021. We identified and ...

Time series analysis stands as a focal point within the data mining community, serving as a cornerstone for extracting valuable insights crucial to a myriad of real-world applications. Recent advancements in Foundation Models (FMs) have fundamentally reshaped the paradigm of model design for time series analysis, boosting various downstream tasks in practice. These innovative approaches often ...

The purpose of this research is to identify the most widely used definition of interval time series; classify existing research into mature research, current research focus, and research gaps within the defined framework; and recommend future directions for interval forecasting research. To achieve this goal, we have conducted a systematic ...

Accurate time series forecasting has been recognized as an essential task in many application domains. Real-world time series data often consist of non-linear patterns with complexities that prevent conventional forecasting techniques from accurate predictions. To forecast a given time series accurately, a hybrid model based on two deep learning methods, i.e., long short-term memory (LSTM) and ...

To promote these important advances, the current article introduces time series analysis for organizational research, a set of techniques that has proved essential in many disciplines for understanding dynamic change over time. We begin by describing the various characteristics and components of time series data.

The purpose of this article is to briefly discuss the importance of time-series methods in experimental research and to acquaint the reader with some statistical techniques that are easily accessible and can be employed when testing hypotheses with time-series data. Measuring Behavior as a Time Series. According to Daniel T. Kaplan and Leon ...

Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. This process is a fundamental stage in any machine learning project. In this paper we study the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among ...

1 Introduction. Time series analysis is a statistical technique that deals with trend analysis and time series. data. Time series analysis made its way in to medicine when the ﬁrst practical ...

Objective The principal aim of the Journal of Time Series Econometrics (JTSE) is to serve as an internationally recognized outlet for important new research in both theoretical and applied classical and Bayesian time series, spatial, and panel data econometrics. The scope of the journal includes papers dealing with estimation, testing, and other methodological aspects involved in the ...

Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website. ... The concept of coherence is also met in signal processing, wave physics, and time series. In the current article, the ...

Forecasting of non-linear and non-stationary time series | Explore the latest full-text research PDFs, articles, conference papers, preprints and more on TIME SERIES ANALYSIS. Find methods ...

Death Cases. This research explored the precision of diverse time-series models for COVID-19 epidemic detection in all the thirty-six different states and the Federal Capital Territory (FCT) in Nigeria with the maximum count of daily cumulative of confirmed, recovered and death cases as of 4 November 2020 of COVID-19 and populace of each state.

The figure below demonstrates the technique of structure discovery on the N374 (a time series of yearly financial data starting from 1949) from the M3 dataset. The six base structures were ExponentiatedQuadratic (which is the same as the Radial Basis Function kernel, or RBF for short), Matern, Linear, Quadratic, OneLayer and Periodic kernels. The figure shows the MAP estimates of their weights ...

Given a set of constraints, the validity measure evaluates the degree of data meeting the constraints, e.g., whether the values are in the specified range or fluctuate drastically over time in a series. It is worth noting that simply counting all the data points in violation to the constraints may over claim the data validity issue.

To gain a clearer understanding of the application fields of time series, we remove the two subjects with the highest number of matches, i.e., Engineering and Computer Science, both of which have a high total link strength; this can be attributed to the fact that these two subjects are often used in research as analysis tools for other domains. The 120,000 publications contain 161 unique level ...

Special Issue Information. Dear Colleagues, Time series data are probably among the most common forms of data, as they are connected to a systematic collection of observations over time. Time series analysis has two main fields of application. The first is the analysis of historical data. On the basis of these data, it is possible to draw ...

A time-series analysis was performed on Pharmaceutical Benefits Scheme/Repatriation Pharmaceutical Benefits Scheme data from January 2000-June 2020 using Facebook Prophet time-series methods in Python 3.9.14 to analyse and forecast trends to 2025. Ethical approval was not required for this research article as it used publicly available data ...

An interrupted time series analysis (ITSA) was used to assess the impact of ASCaM on DNT, comparing nine months pre-intervention with 24 months post-intervention. Results: Data from 9680 IVT-treated ischemic stroke patients were analyzed, including 2401 in the pre-intervention phase and 7279 post-intervention.

The study of loan guarantee schemes often mirrors crisis periods when governments expand their schemes to support credit to SMEs. This is reflected in the international representation of recent studies during the GFC and COVID-19 (Corredera-Catalán, di Pietro, and Trujillo-Ponce Citation 2021; Cowling, Liu, and Ledger Citation 2012; Taghizadeh-Hesary et al. Citation 2021; Xia and Gan Citation ...

Bournemouth University. Bournemouth. [email protected]. Abstract —The purpose of this study is to review time series forecasting methods and briefly explains the working of time series ...

One of the most important and widely used time series models is the autoregressive integrated moving average (ARIMA) model. This model has been adopted in different research fields (ie, epidemiology, 8 economics 9 and earth science 10). With respect to the topic of ED visits forecast, ...

A newly discovered vulnerability baked into Apple's M-series of chips allows attackers to extract secret keys from Macs when they perform widely used cryptographic operations, academic ...

Article A total solar eclipse creates stunning celestial views for people within the path of the Moon's shadow. This astronomical event is a unique opportunity for scientists to study the Sun and its influence on Earth, but it's also a perfect opportunity to capture unforgettable images.