Time Series Modelling Paradigms

Multiple neural architecture families have been previously explored for encoding physiological time series, with a particular focus on EEG. The table below covers a large range of models, from small supervised baselines to large foundation models pre-trained on thousands of hours of unlabelled EEG.

Overview

Model	Year	Paradigm	Pre-training Objective	Scale	Key Contribution
EEGNet	2018	CNN	Supervised	-	Compact depthwise-separable CNN baseline
GREEN	2024	CNN + Riemannian	Supervised	-	Learnable Gabor wavelets; interpretable frequency bands
EEGNeX	2023	CNN	Supervised	-	Strong reproducible ConvNet baseline
EEG-NeXt	2022	CNN	Supervised	-	ConvNeXt design principles applied to EEG
LMU	2019	Recurrent	Supervised	-	Legendre polynomial long-range memory
xLSTM	2024	Recurrent	Supervised	-	Extended LSTM with linear complexity
BENDR	2021	Conv + Transformer	Contrastive SSL	Moderate	First EEG contrastive SSL; cross-dataset transfer
LEAD	2025	Transformer	Contrastive SSL	Large	Dual-level contrastive; 813-subject Alzheimer's corpus
LaBraM	2024	Transformer	Masked spectrum prediction	Large (~2500 h)	First large-scale EEG FM; patch tokenisation
CBraMod	2024	Transformer	Masked reconstruction	Large	Criss-cross attention; 10 downstream tasks
EEGPT	2024	Transformer	Autoregressive	-	GPT-style; universal BCI representations
BIOT	2023	Transformer	Cross-data pre-training	Moderate	Multi-modality (EEG, ECG, EMG) in one model
GEFM	2024	GNN + Masked AE	Masked reconstruction	Moderate	First GNN-MAE hybrid for EEG
FoME	2024	Transformer	Masked reconstruction	Very large (1.7 TB)	745M params; scalp + intracranial EEG
Beatrix	2024	Transformer	Spectral tokenisation	Moderate	OoD generalisation via invariant contrastive fine-tuning
CEReBrO	2025	Transformer	Masked reconstruction	Large (20k+ h)	Alternating attention; 3.6M-85M params
CodeBrain	2025	SSM + Transformer	Masked reconstruction	Large	TFDual-Tokenizer; brain small-world topology
UniEEG	2025	Transformer	Electrode-wise TF masking	Large (~20 datasets)	Electrode-wise time-frequency masking
LUNA	2025	Transformer	Masked reconstruction	Very large (21k+ h)	Topology-agnostic; 300x fewer FLOPs
DIVER-0	2025	Transformer	RoPE + binary attention	Moderate	Fully channel-equivariant; STCPE
S-JEPA	2024	JEPA	Joint-embedding prediction	Moderate	First JEPA for EEG; spatial block masking
FEMBA	2025	Mamba (SSM)	Self-supervised SSM	Large (21k+ h)	Linear complexity; 7.8M edge variant
PhysioWave	2025	Wavelet-Transformer	Multi-scale wavelet SSL	Large	Multi-modal (EMG, ECG, EEG); learnable fusion
Brant-X	2024	EEG FM + alignment	Two-level alignment	Large	Cross-modal alignment: EEG to ECG/EMG/eye
Fractal-SNN	-	Spiking	Supervised	-	Fractal dynamics; emotion recognition
SLRC	2023	Spiking	Supervised	-	Legendre reservoir computing + spiking
SGLNet	2023	Spiking + GCN	Supervised	-	Graph + spiking dynamics for BCI
CTM	2025	Recurrent	Supervised	-	Internal deliberation steps

Convolutional Architectures

Convolutional networks apply learnable filters directly to raw time series, making them efficient at extracting local temporal and spectral features. They tend to be fast, interpretable, and strong supervised baselines, though they do not scale as easily to large self-supervised pre-training.

EEGNet

Lawhern et al. (2018) - Paper · aliasvishnu/EEGNet · braindecode.models.EEGNet

A compact CNN designed specifically for BCI classification. Uses depthwise separable convolutions to reduce parameter count while maintaining strong performance across diverse EEG paradigms. EEGNet's small footprint makes it a reliable baseline and a practical choice for edge deployment.

EEGNet architecture

GREEN (Gabor Riemann EEG Net)

Paper · Roche/neuro-green

Combines learnable Gabor wavelets with Riemannian geometry. The front-end applies parametrised Gabor filters whose carrier frequency \(f\) and Gaussian width \(\sigma_t\) are both learned end-to-end:

\[\phi_f(t) = \frac{1}{\sqrt{2\pi}\sigma_t} \exp\left(\frac{-t^2}{2\sigma_t^2}\right) \exp(2i\pi ft)\]

GREEN accepts windowed input of shape (floor(T/t), C, t) and learns sparse, interpretable frequency representations. Its interpretability makes it attractive for understanding which frequency bands drive predictions.

GREEN architecture

EEGNeX

Paper · chenxiachan/EEGNeX · braindecode.models.EEGNeX

A benchmark-oriented ConvNet for reliable EEG signal decoding, designed to provide a strong and reproducible baseline across multiple tasks.

EEGNeX architecture

EEG-NeXt

Paper

A modernised ConvNet bringing ConvNeXt design principles (large kernels, LayerNorm, GELU activations, inverted bottleneck blocks) to EEG classification.

Recurrent & State-Space Architectures

Recurrent networks maintain a hidden state over time, making them well-suited to capturing long-range temporal dependencies. State-space models (SSMs) generalise this with continuous-time dynamics and, in the case of Mamba, achieve linear rather than quadratic complexity. Both families are inherently sequence-length agnostic.

Legendre Memory Units (LMUs)

Paper · hrsztv/pytorch-lmu

LMUs compress long-range history into a fixed-size state using Legendre polynomial projections, enabling theoretically infinite memory with bounded computation. The state update is governed by a continuous-time ODE, making LMUs naturally suited to irregularly sampled signals.

Fourier Recurrent Units (FRUs)

Paper · limbo018/FRU

FRUs learn long-term dependencies by parameterising the recurrent transition matrix in the Fourier domain, improving gradient flow over very long sequences by keeping eigenvalues near the unit circle.

Continuous Recurrent Units (CRUs)

Paper · boschresearch/Continuous-Recurrent-Units

CRUs treat the hidden state as a continuous dynamical system, making them naturally suited to irregular time series where observations are not uniformly spaced.

Light Recurrent Units (LRUs)

Paper · lucidrains/light-recurrent-unit-pytorch

A lightweight, interpretable RNN designed for long-range dependency modelling with a minimal parameterisation.

xLSTM

Beck et al. (NeurIPS 2024) - Paper · NX-AI/xlstm

Extended Long Short-Term Memory extends traditional LSTMs with exponential gating and matrix memory cells, achieving transformer-competitive performance on long sequences with linear-time complexity and no attention mechanism.

xLSTM architecture

FEMBA

arXiv:2502.06438

FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model. The first Mamba-based (state-space model) EEG foundation model. The bidirectional SSM processes EEG in both forward and backward temporal directions with linear time and memory complexity — directly addressing the quadratic scaling of attention-based transformers on long recordings.

Pre-trained on 21,000+ hours of unlabelled EEG. A tiny 7.8M-parameter variant is specifically designed for wearable and resource-constrained deployment. Achieves 81.82% balanced accuracy (0.8921 AUROC) on TUAB abnormality detection and 0.949 AUROC on TUAR artefact detection.

Contrastive Self-Supervised Learning

Contrastive objectives train the encoder to produce similar representations for different augmented views of the same recording and dissimilar representations for different recordings. No reconstruction decoder is needed; the objective directly shapes the geometry of the latent space.

BENDR

Kostas, Aroca-Ouellette & Rudzicz (2021) - arXiv:2101.12037

BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. The first model to adapt language-modelling techniques — specifically the wav2vec 2.0 contrastive SSL approach from automatic speech recognition — to EEG. A convolutional feature encoder compresses raw EEG into latent representations, and a transformer contextualises these. Pre-training uses a contrastive objective: the model must identify the true future representation from a set of distractors.

A single pre-trained BENDR model was shown to generalise to novel EEG sequences recorded with different hardware and different subjects performing different tasks — the first demonstration of cross-dataset EEG transfer.

LEAD

arXiv:2502.01678

LEAD: Large Foundation Model for EEG-Based Alzheimer's Disease Detection. Pre-trained on 11 EEG datasets using dual-level contrastive pre-training (sample-level and subject-level), with unified channel-aligned fine-tuning. A world's-largest EEG-AD corpus of 813 subjects was curated for evaluation. Achieves 9.86% F1 improvement (sample-level) and 9.31% (subject-level) versus the previous state of the art.

Masked Transformer Foundation Models

These large transformer-based models are the primary candidates for Brain FM pre-training. The dominant approach is the masked autoencoder paradigm: randomly mask patches of the input signal and train the encoder-decoder to reconstruct them, forcing the encoder to learn semantically rich representations from unmasked context alone.

LaBraM

Jiang et al. (2024) - arXiv:2405.18765 · 935963004/LaBraM · braindecode.models.Labram

Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI. The first large-scale EEG foundation model. It segments EEG into channel patches and pre-trains on approximately 2,500 hours from ~20 public datasets using vector-quantized neural spectrum prediction — the model predicts frequency-domain tokens of masked patches rather than raw signal values, directly encouraging band-power-relevant representations.

LaBraM Neural Transformer architecture

Achieves state-of-the-art across abnormal detection, event type classification, emotion recognition, and gait prediction.

CBraMod

Wang et al. (ICLR 2025) - arXiv:2412.07236

CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding. Introduces a criss-cross transformer with separate spatial attention heads (across channels at each time step) and temporal attention heads (across time steps for each channel), run in parallel and combined. Asymmetric conditional positional encoding handles variable channel counts and sequence lengths via separate per-axis encodings. Evaluated on 10 downstream tasks across 12 public datasets with state-of-the-art performance.

EEGPT

Wang et al. (NeurIPS 2024) - Paper

A GPT-style autoregressive pre-trained transformer for EEG, producing universal representations that transfer well across diverse BCI paradigms. Unlike masked models, EEGPT predicts the next patch rather than a randomly masked patch — aligning the pre-training objective with the causal structure of temporal data.

BIOT

Yang et al. (NeurIPS 2023) - Paper · ycq091044/BIOT · braindecode.models.BIOT

BIOT: Cross-data Biosignal Learning in the Wild. Tokenises biosignal channels into fixed-length segments forming biosignal "sentences", enabling joint pre-training across datasets with different modalities, channel counts, and sequence lengths. Supports EEG, ECG, and human activity sensors within a single architecture, achieving 3-4% improvement over baselines on CHB-MIT seizure detection.

BIOT architecture

BIOT is the primary reference for the Multimodal Learning design goal of handling heterogeneous biosignal modalities.

GEFM

arXiv:2411.19507

GEFM: Graph-Enhanced EEG Foundation Model. Integrates Graph Neural Networks (GNNs) with a masked autoencoder to capture inter-channel relational structures alongside temporal dynamics. The GNN models the functional connectivity graph of EEG electrodes, while the masked autoencoder provides the pre-training objective. GCN with optimised configurations performs best across three downstream tasks.

FoME

arXiv:2409.12454

FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention Scaling. One of the largest EEG foundation models at 745M parameters, trained on 1.7 TB of EEG data over 1,096k training steps. Introduces the ATLAS (Adaptive Time-Lateral Attention Scaling) mechanism for robust multi-channel modelling that adapts attention patterns to varying signal characteristics. Handles both scalp and intracranial EEG recordings.

Beatrix

Beatrix: Out-of-Distribution Generalisation of Large EEG Model via Invariant Contrastive Fine-Tuning.

A spectral EEG foundation model with a multi-view transformer integrating spectral and temporal information. Pre-training uses analytic wavelet spectral tokenisation — non-stationary dynamics are captured by decomposing signals via analytic wavelets before tokenisation.

The key contribution is Contrastive Invariant Fine-Tuning (CIFT): a fine-tuning procedure that enforces representation invariance across environments without explicit environment labels, substantially improving out-of-distribution generalisation for seizure detection, auditory neural decoding, and motor imagery.

CEReBrO

arXiv:2501.10885

CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention. Addresses efficiency through alternating attention: alternating between intra-channel temporal attention (within each electrode across time) and inter-channel spatial attention (across electrodes at each time step). This achieves 2x speed improvement and 6x memory reduction versus standard multi-head attention. Available in multiple sizes from 3.6M to 85M parameters, pre-trained on 20,000+ hours of public scalp EEG.

CodeBrain

arXiv:2506.09110

CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model. Uses a two-stage architecture: a TFDual-Tokenizer that independently tokenises temporal and frequency components of EEG signals (quadratically expanding the discrete representation space), followed by EEGSSM — a state-space model combining global convolution and sliding window attention designed to reflect the brain's small-world topology. Demonstrated on 10 public EEG datasets.

UniEEG

UniEEG: Advancing Universal EEG Representation with Electrode-Wise Time-Frequency Pretraining.

Introduces electrode-wise time-frequency masking: each individual electrode is processed independently with time-frequency transform masking. This makes the model naturally compatible with diverse electrode configurations — any number of electrodes can be passed independently without alignment. Trained on ~20 public EEG datasets and evaluated on 6 distinct EEG task types.

LUNA

Döner et al. - ICML Workshop on Foundation Models for Structured Data

LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis. Achieves topology agnosticism by reconciling disparate electrode geometries through a linear-scaling attention mechanism (avoiding quadratic complexity). The model handles variable montages and channel configurations without any preprocessing for channel alignment. Pre-trained on TUEG + Siena (>21,000 hours).

Key efficiency figures: 300x fewer FLOPs and 10x less GPU memory than attention-based baselines, with 0.921 AUROC on TUAR artefact detection.

DIVER-0

Han et al. (ICML 2025 Workshop on GenBio) - arXiv:2507.14141

DIVER-0: A Fully Channel Equivariant EEG Foundation Model. The first fully channel-equivariant EEG foundation model, maintaining both temporal translation equivariance and channel permutation equivariance. Key contributions:

Sliding Temporal Conditional Positional Encoding (STCPE): enables arbitrary electrode configurations without position lookup tables.
Rotary Position Embedding (RoPE) combined with binary attention biases for full spatio-temporal attention.
Achieves competitive performance with only 10% of pre-training data by leveraging its strong inductive biases.

Output representations are consistent across all channel permutation conditions, making DIVER-0 the reference architecture for the Channel Topology Agnosticism design goal.

JEPA (Joint-Embedding Predictive Architecture)

Rather than reconstructing raw signal values or frequency coefficients, JEPA trains the encoder to predict the latent representation of masked patches. A separate target encoder (exponential moving average of the context encoder) provides the prediction targets. This avoids reconstructing high-frequency noise and encourages semantic rather than low-level representations.

S-JEPA

Guetschel, Moreau & Tangermann (2024) - arXiv:2403.11772 · braindecode.models.SignalJEPA

S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention. Applies the JEPA self-supervised objective to EEG for the first time. Instead of reconstructing masked signal values, the model predicts the latent representation of masked patches from unmasked context. The domain-specific spatial block masking strategy improves cross-dataset transfer by forcing the model to learn representations that generalise across electrode subsets. Evaluated on motor imagery, ERP, and SSVEP paradigms.

Multimodal Foundation Models

These models extend the Brain FM paradigm beyond EEG to multiple physiological modalities, either by training a single model on heterogeneous signals or by aligning modality-specific encoders into a shared latent space.

PhysioWave

arXiv:2506.10351

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation. Extends the BFM paradigm to EMG and ECG in addition to EEG, providing the first large-scale pre-trained models for EMG and ECG. Each modality is processed by a dedicated branch using multi-scale wavelet decomposition for time-frequency feature extraction; a learnable weighted fusion mechanism combines modality-specific representations. Addresses low SNR and device mismatch through wavelet-based preprocessing.

Brant-X

Zhang et al. (SIGKDD 2024) - arXiv:2409.00122

Brant-X: A Unified Physiological Signal Alignment Framework. Uses a pre-trained EEG foundation model as a backbone and aligns other physiological signals (ECG, EMG, eye movements, etc.) into the same latent space via a two-level semantic alignment framework:

Sample-level alignment - paired windows recorded simultaneously are aligned to produce similar representations.
Semantic-level alignment - windows labelled with the same cognitive state (even if not recorded simultaneously) are aligned across modalities.

This addresses the scarcity of simultaneously collected paired data. Downstream tasks include sleep stage classification, emotion recognition, freezing of gait detection, eye movement communication, and arrhythmia detection — all benefiting from EEG knowledge transfer.

Brant-X is the primary reference for the Cross-Modal Alignment research direction.

Spiking (Neuromorphic) Architectures

Spiking Neural Networks (SNNs) are biologically inspired models where neurons communicate via discrete binary spikes rather than continuous activations, offering potential energy efficiency advantages for edge deployment on neuromorphic hardware.

Temporal Recurrent SNNs

Paper · NECOTIS/TemporalLearningInRSNNs

Recurrent SNNs with improved temporal learning rules for neuromorphic hardware applications.

Fractal-SNN

Paper

An SNN with fractal dynamics applied to EEG-based emotion recognition.

Fractal-SNN architecture

Spiking Legendre Reservoir Computing (SLRC)

Paper · R-Gaurav/spiking-models-for-TSC

Combines reservoir computing with Legendre memory units in a spiking framework for time series classification.

SLRC architecture

Legendre Spiking Neural Network (LSNN)

Paper · R-Gaurav/spiking-models-for-TSC

LSNN architecture

SGLNet

Paper

An SNN with adaptive graph convolution and LSTM components for BCI tasks, modelling both spiking dynamics and spatial inter-electrode relationships.

SGLNet architecture

Continuous Thought Machines (CTMs)

Darlow et al. (2025) - Paper · SakanaAI/continuous-thought-machines

CTMs model internal recurrent "thinking" steps where the network processes the same input over multiple internal time steps before producing an output, analogous to internal deliberation in biological neurons.

Continuous Thought Machine architecture

Alternative Approaches

Energy-Based Models

Song & Kingma (2021) - How to Train EBMs

Energy-based models define a scalar energy function over the input-label joint space; low energy corresponds to plausible configurations. Attractive for EEG because they do not require normalisation across the input space and can express complex multimodal distributions.

Flow Matching

Lipman et al. (2022) - Paper

A generative modelling approach that learns continuous normalising flows by regressing vector fields rather than maximising likelihood. Applicable to EEG data augmentation and synthesis, expanding labelled training sets.

Architecture Comparison

Architecture	Parallelisable	Long-range Deps	SSL-friendly	Edge-efficient
CNN (EEGNet, GREEN)	Yes	Limited	Moderate	Yes
Recurrent (xLSTM, LMU)	Partial	Strong	Moderate	Yes
Mamba SSM (FEMBA)	Yes	Strong	Good	Yes
Contrastive Transformer (BENDR, LEAD)	Yes	Strong	Excellent	No
Masked Transformer (LaBraM, CEReBrO)	Yes	Strong	Excellent	No
JEPA (S-JEPA)	Yes	Strong	Excellent	No
Graph + MAE (GEFM)	Partial	Moderate	Good	No
Spiking (SLRC, SGLNet)	Partial	Moderate	Emerging	Excellent