Time GPT and the Quest for Foundation Forecasting

Time-series foundation models like TimeGPT and TimesFM deliver powerful zero-shot forecasting, but in finance—with heavy tails, volatility clustering, and regime shifts—they excel after fine-tuning. Key to reliable use: strong evaluation, contamination checks, and synthetic data for better tail modeling.

Foundational Models

Time-series foundation models (TSFMs) are trained once on very large, heterogeneous collections of sequences and then reused as strong general-purpose forecasters.

They are “foundational” in the sense that the pre-training phase learns broad regularities — seasonality, trend breaks, cross-scale patterns — that transfer to new series and even new domains with little or no task-specific data.

This mirrors what we learned in language and vision: large pre-trained models provide robust zero- or few-shot baselines that can later be adapted to a particular use case.

In time series, Google’s TimesFM is a decoder-only Transformer pre-trained on roughly 100 billion real-world time points and released with checkpoints for fine-tuning; TimeGPT-1 is a commercial TSFM positioned for out-of-the-box forecasting and anomaly detection.

In practice, the appeal is short time-to-value across thousands of series, flexible horizons and frequencies, and a clean path from baseline to fine-tuned model when needed.

Why Finance Is Different

Financial series are unlike most industrial or retail datasets.

Returns are heavy-tailed, show clustered volatility, and undergo frequent regime changes; the predictable component of returns at daily horizons is tiny, while their variance is strongly time-varying.

As a result, volatility models often outperform return-level forecasters when the target is risk, and methods must be tested for stability through turbulent periods.

A useful “violation of intuition” is that lagged values usually help forecasting in demand or energy, but linear autocorrelations of liquid asset returns are small or near zero, so naive “use yesterday to predict today” is weak; what persists is volatility, not mean.

These stylized facts shape both modelling targets (quantiles and tail metrics rather than means) and evaluation protocols (rolling origins, stress windows).

Two Finance Use Cases

1. Value-at-Risk (VaR)

Goel, Pasricha and Kanniainen evaluate TimesFM on S&P-100 daily returns over ~19 years with >8.5 years out-of-sample.

After fine-tuning, TimesFM delivers better coverage (actual-over-expected exceedances) than classical Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) and the Generalized Autoregressive Score (GAS) one-factor benchmark and achieves quantile-score performance comparable to the best econometric alternative.

The paper is careful to note that zero-shot use is not optimal for tails — adaptation matters — so the gain is not merely from scale but from targeted fitting.

Critiques apply: the study is on one market and frequency, expected shortfall is not the central focus, and GAS/GARCH remain competitive baselines that risk teams already trust.

2. Operations at Scale

Operations at scale represent another crucial aspect of applications of new analytic technologies, especially in sectors like banking, where the sheer number of assets to monitor often implies that viable models must remain computationally lean.

TradeSmith reports replacing legacy tree-based models with TimeGPT to forecast 22,000+ financial series daily and 100,000+ forecasts per month, citing higher accuracy and lower latency with minimal tuning.

As impressive as it sounds, from the point of view of Machine Learning operations, reality asks us to be a tad more conservative.

Experience from forecasting benchmarks cautions that simple or hybrid statistical methods are hard to beat, and claims for Transformers should be checked against strong baselines and careful data handling.

Practical Considerations

The pattern across studies is clear: pre-training helps, but most of the lift arrives after fine-tuning.

That raises two practitioner concerns:

Data Contamination: If a pretrained TSFM has seen your evaluation period (or close proxies) during pretraining, backtests can be subtly biased upward. The Large Language Model (LLM) literature shows how benchmark overlap inflates scores; the same risk exists for TSFMs unless you enforce a pretraining cut-off date and test on instruments or windows demonstrably unseen by the model.
Evaluation Hygiene: Prefer rolling-origin backtests, long out-of-sample spans that include stress, and risk-first diagnostics (coverage tests for VaR/ES). To guard against research debt, apply backtest-overfitting checks such as the probability of backtest overfitting or deflated Sharpe-style adjustments when strategies depend on model forecasts.

These steps do not negate the value of TSFMs; they make their evaluation comparable to robust econometric baselines and keep governance tight when models are used in risk.

Where Synthetic Data Fits in the Picture

Foundation models excel at generalisation, but specialisation requires fine-tuning.

For instance, financial time-series often display:

Heavy tails: Extreme returns that standard models underpredict
Volatility clustering: Calm periods followed by wild swings
Asymmetry and skewness: Deviations from normal distributions

Using high-quality synthetic data tailored to embed models of real market dynamics, Time GPT and TimeFM can be fine-tuned to learn the unique statistical signatures of your domain on a virtually unlimited number of examples and scenarios.

Time GPT and the Quest for Foundation Forecasting

Foundational Models

Why Finance Is Different

Two Finance Use Cases

1. Value-at-Risk (VaR)

2. Operations at Scale

Practical Considerations

Where Synthetic Data Fits in the Picture

Suggested Reads

Research Infrastructure for Markets Beyond Historical Data

Research Infrastructure for Markets Beyond Historical Data

Research Infrastructure for Markets Beyond Historical Data

Research Infrastructure for Markets Beyond Historical Data