Background Shape 01
Background Shape 01
Background Shape 02
Background Shape 02
Background Shape 02
Background Shape 02
Background Shape 03
Background Shape 03
Background Shape 03
Background Shape 03

Time GPT and the Quest for Foundation Forecasting

Time-series foundation models like TimeGPT and TimesFM deliver powerful zero-shot forecasting, but in finance—with heavy tails, volatility clustering, and regime shifts—they excel after fine-tuning. Key to reliable use: strong evaluation, contamination checks, and synthetic data for better tail modeling.

A robot beating the market
A robot beating the market
A robot beating the market
A robot beating the market

Foundational Models

Time-series foundation models (TSFMs) are trained once on very large, heterogeneous collections of sequences and then reused as strong general-purpose forecasters.

They are “foundational” in the sense that the pre-training phase learns broad regularities — seasonality, trend breaks, cross-scale patterns — that transfer to new series and even new domains with little or no task-specific data.

This mirrors what we learned in language and vision: large pre-trained models provide robust zero- or few-shot baselines that can later be adapted to a particular use case.

In time series, Google’s TimesFM is a decoder-only Transformer pre-trained on roughly 100 billion real-world time points and released with checkpoints for fine-tuning; TimeGPT-1 is a commercial TSFM positioned for out-of-the-box forecasting and anomaly detection.

In practice, the appeal is short time-to-value across thousands of series, flexible horizons and frequencies, and a clean path from baseline to fine-tuned model when needed.


Why Finance Is Different

Financial series are unlike most industrial or retail datasets.

Returns are heavy-tailed, show clustered volatility, and undergo frequent regime changes; the predictable component of returns at daily horizons is tiny, while their variance is strongly time-varying.

As a result, volatility models often outperform return-level forecasters when the target is risk, and methods must be tested for stability through turbulent periods.

A useful “violation of intuition” is that lagged values usually help forecasting in demand or energy, but linear autocorrelations of liquid asset returns are small or near zero, so naive “use yesterday to predict today” is weak; what persists is volatility, not mean.

These stylized facts shape both modelling targets (quantiles and tail metrics rather than means) and evaluation protocols (rolling origins, stress windows).


Two Finance Use Cases

1. Value-at-Risk (VaR)

Goel, Pasricha and Kanniainen evaluate TimesFM on S&P-100 daily returns over ~19 years with >8.5 years out-of-sample.

After fine-tuning, TimesFM delivers better coverage (actual-over-expected exceedances) than classical Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) and the Generalized Autoregressive Score (GAS) one-factor benchmark and achieves quantile-score performance comparable to the best econometric alternative.

The paper is careful to note that zero-shot use is not optimal for tails — adaptation matters — so the gain is not merely from scale but from targeted fitting.

Critiques apply: the study is on one market and frequency, expected shortfall is not the central focus, and GAS/GARCH remain competitive baselines that risk teams already trust.

2. Operations at Scale

Operations at scale represent another crucial aspect of applications of new analytic technologies, especially in sectors like banking, where the sheer number of assets to monitor often implies that viable models must remain computationally lean.

TradeSmith reports replacing legacy tree-based models with TimeGPT to forecast 22,000+ financial series daily and 100,000+ forecasts per month, citing higher accuracy and lower latency with minimal tuning.

As impressive as it sounds, from the point of view of Machine Learning operations, reality asks us to be a tad more conservative.

Experience from forecasting benchmarks cautions that simple or hybrid statistical methods are hard to beat, and claims for Transformers should be checked against strong baselines and careful data handling.


Practical Considerations

The pattern across studies is clear: pre-training helps, but most of the lift arrives after fine-tuning.

That raises two practitioner concerns:

  1. Data Contamination: If a pretrained TSFM has seen your evaluation period (or close proxies) during pretraining, backtests can be subtly biased upward. The Large Language Model (LLM) literature shows how benchmark overlap inflates scores; the same risk exists for TSFMs unless you enforce a pretraining cut-off date and test on instruments or windows demonstrably unseen by the model.

  2. Evaluation Hygiene: Prefer rolling-origin backtests, long out-of-sample spans that include stress, and risk-first diagnostics (coverage tests for VaR/ES). To guard against research debt, apply backtest-overfitting checks such as the probability of backtest overfitting or deflated Sharpe-style adjustments when strategies depend on model forecasts.

These steps do not negate the value of TSFMs; they make their evaluation comparable to robust econometric baselines and keep governance tight when models are used in risk.


Where Synthetic Data Fits in the Picture

Foundation models excel at generalisation, but specialisation requires fine-tuning.

For instance, financial time-series often display:

  • Heavy tails: Extreme returns that standard models underpredict

  • Volatility clustering: Calm periods followed by wild swings

  • Asymmetry and skewness: Deviations from normal distributions

Using high-quality synthetic data tailored to embed models of real market dynamics, Time GPT and TimeFM can be fine-tuned to learn the unique statistical signatures of your domain on a virtually unlimited number of examples and scenarios.


Suggested Reads

This post is based on the following papers and industry study, which we strongly recommend to the reader:

  • A Decoder-Only Foundation Model for Time-Series Forecasting. Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou. ICML 2024 (arXiv:2310.10688). (arXiv)

  • A Decoder-Only Foundation Model for Time-Series Forecasting (Google Research blog). 2024. (Google Research)

  • TimeGPT-1. Azul Garza, Cristian Challu, Max Mergenthaler-Canseco. arXiv:2310.03589, 2023. (arXiv)

  • Time-Series Foundation AI Model for Value-at-Risk Forecasting. Anubha Goel, Puneet Pasricha, Juho Kanniainen. arXiv:2410.11773, 2024–2025. (arXiv)

  • Generalized Autoregressive Score Models with Applications. Drew Creal, Siem Jan Koopman, André Lucas. Journal of Applied Econometrics, 2013. (Wiley Online Library)

  • The M4 Competition: 100,000 Time Series and 61 Forecasting Methods. Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos. International Journal of Forecasting, 2020. (ScienceDirect)

  • M5 Accuracy Competition: Results, Findings, and Conclusions. Spyros Makridakis et al. International Journal of Forecasting, 2022. (ScienceDirect)

  • Are Transformers Effective for Time Series Forecasting? Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu. arXiv/AAAI, 2022–2023. (arXiv)

  • AI-Powered Investment Forecasting for Data-Driven Decisions (TradeSmith use case). Nixtla success story, 2025. (Nixtla)

  • Unveiling the Spectrum of Data Contamination in Large Language Models. Chengpeng Deng et al. Findings of ACL, 2024.

  • Investigating Data Contamination for Pre-training Language Models. Minhao Jiang et al. arXiv:2401.06059, 2024. (arXiv)

  • Assessing Look-Ahead Bias in Stock Return Predictions with Large Language Models. Paul Glasserman et al. arXiv:2309.17322, 2023. (arXiv)


CTA Image
Research Infrastructure for Markets Beyond Historical Data

Diffusion-based generative models that simulate realistic cross-asset market environments, enabling robust strategy validation beyond the limits of history.

CTA Image
CTA Image
Research Infrastructure for Markets Beyond Historical Data

Diffusion-based generative models that simulate realistic cross-asset market environments, enabling robust strategy validation beyond the limits of history.

CTA Image
Research Infrastructure for Markets Beyond Historical Data

Diffusion-based generative models that simulate realistic cross-asset market environments, enabling robust strategy validation beyond the limits of history.

CTA Image
Research Infrastructure for Markets Beyond Historical Data

Diffusion-based generative models that simulate realistic cross-asset market environments, enabling robust strategy validation beyond the limits of history.

CTA Image

Institutional research infrastructure for robust strategy validation beyond historical data.

Linkedin

Copyright © 2026 Ahead Innovation Laboratories GmbH. All Rights Reserved

Institutional research infrastructure for robust strategy validation beyond historical data.

Linkedin

Copyright © 2026 Ahead Innovation Laboratories GmbH. All Rights Reserved

Institutional research infrastructure for robust strategy validation beyond historical data.

Linkedin

Copyright © 2026 Ahead Innovation Laboratories GmbH. All Rights Reserved

Institutional research infrastructure for robust strategy validation beyond historical data.

Linkedin

Copyright © 2026 Ahead Innovation Laboratories GmbH. All Rights Reserved