Time Series Modelling Paradigms
Multiple neural architecture families have been previously explored for encoding physiological time series, with a particular focus on EEG. The table below covers a large range of models, from small supervised baselines to large foundation models pre-trained on thousands of hours of unlabelled EEG.
Overview
| Model | Year | Paradigm | Pre-training Objective | Scale | Key Contribution |
|---|---|---|---|---|---|
| EEGNet | 2018 | CNN | Supervised | - | Compact depthwise-separable CNN baseline |
| GREEN | 2024 | CNN + Riemannian | Supervised | - | Learnable Gabor wavelets; interpretable frequency bands |
| EEGNeX | 2023 | CNN | Supervised | - | Strong reproducible ConvNet baseline |
| EEG-NeXt | 2022 | CNN | Supervised | - | ConvNeXt design principles applied to EEG |
| LMU | 2019 | Recurrent | Supervised | - | Legendre polynomial long-range memory |
| xLSTM | 2024 | Recurrent | Supervised | - | Extended LSTM with linear complexity |
| BENDR | 2021 | Conv + Transformer | Contrastive SSL | Moderate | First EEG contrastive SSL; cross-dataset transfer |
| LEAD | 2025 | Transformer | Contrastive SSL | Large | Dual-level contrastive; 813-subject Alzheimer's corpus |
| LaBraM | 2024 | Transformer | Masked spectrum prediction | Large (~2500 h) | First large-scale EEG FM; patch tokenisation |
| CBraMod | 2024 | Transformer | Masked reconstruction | Large | Criss-cross attention; 10 downstream tasks |
| EEGPT | 2024 | Transformer | Autoregressive | - | GPT-style; universal BCI representations |
| BIOT | 2023 | Transformer | Cross-data pre-training | Moderate | Multi-modality (EEG, ECG, EMG) in one model |
| GEFM | 2024 | GNN + Masked AE | Masked reconstruction | Moderate | First GNN-MAE hybrid for EEG |
| FoME | 2024 | Transformer | Masked reconstruction | Very large (1.7 TB) | 745M params; scalp + intracranial EEG |
| Beatrix | 2024 | Transformer | Spectral tokenisation | Moderate | OoD generalisation via invariant contrastive fine-tuning |
| CEReBrO | 2025 | Transformer | Masked reconstruction | Large (20k+ h) | Alternating attention; 3.6M-85M params |
| CodeBrain | 2025 | SSM + Transformer | Masked reconstruction | Large | TFDual-Tokenizer; brain small-world topology |
| UniEEG | 2025 | Transformer | Electrode-wise TF masking | Large (~20 datasets) | Electrode-wise time-frequency masking |
| LUNA | 2025 | Transformer | Masked reconstruction | Very large (21k+ h) | Topology-agnostic; 300x fewer FLOPs |
| DIVER-0 | 2025 | Transformer | RoPE + binary attention | Moderate | Fully channel-equivariant; STCPE |
| S-JEPA | 2024 | JEPA | Joint-embedding prediction | Moderate | First JEPA for EEG; spatial block masking |
| FEMBA | 2025 | Mamba (SSM) | Self-supervised SSM | Large (21k+ h) | Linear complexity; 7.8M edge variant |
| PhysioWave | 2025 | Wavelet-Transformer | Multi-scale wavelet SSL | Large | Multi-modal (EMG, ECG, EEG); learnable fusion |
| Brant-X | 2024 | EEG FM + alignment | Two-level alignment | Large | Cross-modal alignment: EEG to ECG/EMG/eye |
| Fractal-SNN | - | Spiking | Supervised | - | Fractal dynamics; emotion recognition |
| SLRC | 2023 | Spiking | Supervised | - | Legendre reservoir computing + spiking |
| SGLNet | 2023 | Spiking + GCN | Supervised | - | Graph + spiking dynamics for BCI |
| CTM | 2025 | Recurrent | Supervised | - | Internal deliberation steps |
Convolutional Architectures
Convolutional networks apply learnable filters directly to raw time series, making them efficient at extracting local temporal and spectral features. They tend to be fast, interpretable, and strong supervised baselines, though they do not scale as easily to large self-supervised pre-training.
EEGNet
Lawhern et al. (2018) - Paper ·
aliasvishnu/EEGNet·braindecode.models.EEGNet
A compact CNN designed specifically for BCI classification. Uses depthwise separable convolutions to reduce parameter count while maintaining strong performance across diverse EEG paradigms. EEGNet's small footprint makes it a reliable baseline and a practical choice for edge deployment.

GREEN (Gabor Riemann EEG Net)
Combines learnable Gabor wavelets with Riemannian geometry. The front-end applies parametrised Gabor filters whose carrier frequency \(f\) and Gaussian width \(\sigma_t\) are both learned end-to-end:
GREEN accepts windowed input of shape (floor(T/t), C, t) and learns sparse, interpretable frequency representations. Its interpretability makes it attractive for understanding which frequency bands drive predictions.

EEGNeX
Paper ·
chenxiachan/EEGNeX·braindecode.models.EEGNeX
A benchmark-oriented ConvNet for reliable EEG signal decoding, designed to provide a strong and reproducible baseline across multiple tasks.

EEG-NeXt
A modernised ConvNet bringing ConvNeXt design principles (large kernels, LayerNorm, GELU activations, inverted bottleneck blocks) to EEG classification.
Recurrent & State-Space Architectures
Recurrent networks maintain a hidden state over time, making them well-suited to capturing long-range temporal dependencies. State-space models (SSMs) generalise this with continuous-time dynamics and, in the case of Mamba, achieve linear rather than quadratic complexity. Both families are inherently sequence-length agnostic.
Legendre Memory Units (LMUs)
LMUs compress long-range history into a fixed-size state using Legendre polynomial projections, enabling theoretically infinite memory with bounded computation. The state update is governed by a continuous-time ODE, making LMUs naturally suited to irregularly sampled signals.
Fourier Recurrent Units (FRUs)
FRUs learn long-term dependencies by parameterising the recurrent transition matrix in the Fourier domain, improving gradient flow over very long sequences by keeping eigenvalues near the unit circle.
Continuous Recurrent Units (CRUs)
CRUs treat the hidden state as a continuous dynamical system, making them naturally suited to irregular time series where observations are not uniformly spaced.
Light Recurrent Units (LRUs)
A lightweight, interpretable RNN designed for long-range dependency modelling with a minimal parameterisation.
xLSTM
Beck et al. (NeurIPS 2024) - Paper ·
NX-AI/xlstm
Extended Long Short-Term Memory extends traditional LSTMs with exponential gating and matrix memory cells, achieving transformer-competitive performance on long sequences with linear-time complexity and no attention mechanism.

FEMBA
FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model. The first Mamba-based (state-space model) EEG foundation model. The bidirectional SSM processes EEG in both forward and backward temporal directions with linear time and memory complexity — directly addressing the quadratic scaling of attention-based transformers on long recordings.
Pre-trained on 21,000+ hours of unlabelled EEG. A tiny 7.8M-parameter variant is specifically designed for wearable and resource-constrained deployment. Achieves 81.82% balanced accuracy (0.8921 AUROC) on TUAB abnormality detection and 0.949 AUROC on TUAR artefact detection.
Contrastive Self-Supervised Learning
Contrastive objectives train the encoder to produce similar representations for different augmented views of the same recording and dissimilar representations for different recordings. No reconstruction decoder is needed; the objective directly shapes the geometry of the latent space.
BENDR
Kostas, Aroca-Ouellette & Rudzicz (2021) - arXiv:2101.12037
BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. The first model to adapt language-modelling techniques — specifically the wav2vec 2.0 contrastive SSL approach from automatic speech recognition — to EEG. A convolutional feature encoder compresses raw EEG into latent representations, and a transformer contextualises these. Pre-training uses a contrastive objective: the model must identify the true future representation from a set of distractors.
A single pre-trained BENDR model was shown to generalise to novel EEG sequences recorded with different hardware and different subjects performing different tasks — the first demonstration of cross-dataset EEG transfer.
LEAD
LEAD: Large Foundation Model for EEG-Based Alzheimer's Disease Detection. Pre-trained on 11 EEG datasets using dual-level contrastive pre-training (sample-level and subject-level), with unified channel-aligned fine-tuning. A world's-largest EEG-AD corpus of 813 subjects was curated for evaluation. Achieves 9.86% F1 improvement (sample-level) and 9.31% (subject-level) versus the previous state of the art.
Masked Transformer Foundation Models
These large transformer-based models are the primary candidates for Brain FM pre-training. The dominant approach is the masked autoencoder paradigm: randomly mask patches of the input signal and train the encoder-decoder to reconstruct them, forcing the encoder to learn semantically rich representations from unmasked context alone.
LaBraM
Jiang et al. (2024) - arXiv:2405.18765 ·
935963004/LaBraM·braindecode.models.Labram
Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI. The first large-scale EEG foundation model. It segments EEG into channel patches and pre-trains on approximately 2,500 hours from ~20 public datasets using vector-quantized neural spectrum prediction — the model predicts frequency-domain tokens of masked patches rather than raw signal values, directly encouraging band-power-relevant representations.

Achieves state-of-the-art across abnormal detection, event type classification, emotion recognition, and gait prediction.
CBraMod
Wang et al. (ICLR 2025) - arXiv:2412.07236
CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding. Introduces a criss-cross transformer with separate spatial attention heads (across channels at each time step) and temporal attention heads (across time steps for each channel), run in parallel and combined. Asymmetric conditional positional encoding handles variable channel counts and sequence lengths via separate per-axis encodings. Evaluated on 10 downstream tasks across 12 public datasets with state-of-the-art performance.
EEGPT
Wang et al. (NeurIPS 2024) - Paper
A GPT-style autoregressive pre-trained transformer for EEG, producing universal representations that transfer well across diverse BCI paradigms. Unlike masked models, EEGPT predicts the next patch rather than a randomly masked patch — aligning the pre-training objective with the causal structure of temporal data.
BIOT
Yang et al. (NeurIPS 2023) - Paper ·
ycq091044/BIOT·braindecode.models.BIOT
BIOT: Cross-data Biosignal Learning in the Wild. Tokenises biosignal channels into fixed-length segments forming biosignal "sentences", enabling joint pre-training across datasets with different modalities, channel counts, and sequence lengths. Supports EEG, ECG, and human activity sensors within a single architecture, achieving 3-4% improvement over baselines on CHB-MIT seizure detection.

BIOT is the primary reference for the Multimodal Learning design goal of handling heterogeneous biosignal modalities.
GEFM
GEFM: Graph-Enhanced EEG Foundation Model. Integrates Graph Neural Networks (GNNs) with a masked autoencoder to capture inter-channel relational structures alongside temporal dynamics. The GNN models the functional connectivity graph of EEG electrodes, while the masked autoencoder provides the pre-training objective. GCN with optimised configurations performs best across three downstream tasks.
FoME
FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention Scaling. One of the largest EEG foundation models at 745M parameters, trained on 1.7 TB of EEG data over 1,096k training steps. Introduces the ATLAS (Adaptive Time-Lateral Attention Scaling) mechanism for robust multi-channel modelling that adapts attention patterns to varying signal characteristics. Handles both scalp and intracranial EEG recordings.
Beatrix
Beatrix: Out-of-Distribution Generalisation of Large EEG Model via Invariant Contrastive Fine-Tuning.
A spectral EEG foundation model with a multi-view transformer integrating spectral and temporal information. Pre-training uses analytic wavelet spectral tokenisation — non-stationary dynamics are captured by decomposing signals via analytic wavelets before tokenisation.
The key contribution is Contrastive Invariant Fine-Tuning (CIFT): a fine-tuning procedure that enforces representation invariance across environments without explicit environment labels, substantially improving out-of-distribution generalisation for seizure detection, auditory neural decoding, and motor imagery.
CEReBrO
CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention. Addresses efficiency through alternating attention: alternating between intra-channel temporal attention (within each electrode across time) and inter-channel spatial attention (across electrodes at each time step). This achieves 2x speed improvement and 6x memory reduction versus standard multi-head attention. Available in multiple sizes from 3.6M to 85M parameters, pre-trained on 20,000+ hours of public scalp EEG.
CodeBrain
CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model. Uses a two-stage architecture: a TFDual-Tokenizer that independently tokenises temporal and frequency components of EEG signals (quadratically expanding the discrete representation space), followed by EEGSSM — a state-space model combining global convolution and sliding window attention designed to reflect the brain's small-world topology. Demonstrated on 10 public EEG datasets.
UniEEG
UniEEG: Advancing Universal EEG Representation with Electrode-Wise Time-Frequency Pretraining.
Introduces electrode-wise time-frequency masking: each individual electrode is processed independently with time-frequency transform masking. This makes the model naturally compatible with diverse electrode configurations — any number of electrodes can be passed independently without alignment. Trained on ~20 public EEG datasets and evaluated on 6 distinct EEG task types.
LUNA
Döner et al. - ICML Workshop on Foundation Models for Structured Data
LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis. Achieves topology agnosticism by reconciling disparate electrode geometries through a linear-scaling attention mechanism (avoiding quadratic complexity). The model handles variable montages and channel configurations without any preprocessing for channel alignment. Pre-trained on TUEG + Siena (>21,000 hours).
Key efficiency figures: 300x fewer FLOPs and 10x less GPU memory than attention-based baselines, with 0.921 AUROC on TUAR artefact detection.
DIVER-0
Han et al. (ICML 2025 Workshop on GenBio) - arXiv:2507.14141
DIVER-0: A Fully Channel Equivariant EEG Foundation Model. The first fully channel-equivariant EEG foundation model, maintaining both temporal translation equivariance and channel permutation equivariance. Key contributions:
- Sliding Temporal Conditional Positional Encoding (STCPE): enables arbitrary electrode configurations without position lookup tables.
- Rotary Position Embedding (RoPE) combined with binary attention biases for full spatio-temporal attention.
- Achieves competitive performance with only 10% of pre-training data by leveraging its strong inductive biases.
Output representations are consistent across all channel permutation conditions, making DIVER-0 the reference architecture for the Channel Topology Agnosticism design goal.
JEPA (Joint-Embedding Predictive Architecture)
Rather than reconstructing raw signal values or frequency coefficients, JEPA trains the encoder to predict the latent representation of masked patches. A separate target encoder (exponential moving average of the context encoder) provides the prediction targets. This avoids reconstructing high-frequency noise and encourages semantic rather than low-level representations.
S-JEPA
Guetschel, Moreau & Tangermann (2024) - arXiv:2403.11772 ·
braindecode.models.SignalJEPA
S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention. Applies the JEPA self-supervised objective to EEG for the first time. Instead of reconstructing masked signal values, the model predicts the latent representation of masked patches from unmasked context. The domain-specific spatial block masking strategy improves cross-dataset transfer by forcing the model to learn representations that generalise across electrode subsets. Evaluated on motor imagery, ERP, and SSVEP paradigms.
Multimodal Foundation Models
These models extend the Brain FM paradigm beyond EEG to multiple physiological modalities, either by training a single model on heterogeneous signals or by aligning modality-specific encoders into a shared latent space.
PhysioWave
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation. Extends the BFM paradigm to EMG and ECG in addition to EEG, providing the first large-scale pre-trained models for EMG and ECG. Each modality is processed by a dedicated branch using multi-scale wavelet decomposition for time-frequency feature extraction; a learnable weighted fusion mechanism combines modality-specific representations. Addresses low SNR and device mismatch through wavelet-based preprocessing.
Brant-X
Zhang et al. (SIGKDD 2024) - arXiv:2409.00122
Brant-X: A Unified Physiological Signal Alignment Framework. Uses a pre-trained EEG foundation model as a backbone and aligns other physiological signals (ECG, EMG, eye movements, etc.) into the same latent space via a two-level semantic alignment framework:
- Sample-level alignment - paired windows recorded simultaneously are aligned to produce similar representations.
- Semantic-level alignment - windows labelled with the same cognitive state (even if not recorded simultaneously) are aligned across modalities.
This addresses the scarcity of simultaneously collected paired data. Downstream tasks include sleep stage classification, emotion recognition, freezing of gait detection, eye movement communication, and arrhythmia detection — all benefiting from EEG knowledge transfer.
Brant-X is the primary reference for the Cross-Modal Alignment research direction.
Spiking (Neuromorphic) Architectures
Spiking Neural Networks (SNNs) are biologically inspired models where neurons communicate via discrete binary spikes rather than continuous activations, offering potential energy efficiency advantages for edge deployment on neuromorphic hardware.
Temporal Recurrent SNNs
Recurrent SNNs with improved temporal learning rules for neuromorphic hardware applications.
Fractal-SNN
An SNN with fractal dynamics applied to EEG-based emotion recognition.

Spiking Legendre Reservoir Computing (SLRC)
Combines reservoir computing with Legendre memory units in a spiking framework for time series classification.

Legendre Spiking Neural Network (LSNN)

SGLNet
An SNN with adaptive graph convolution and LSTM components for BCI tasks, modelling both spiking dynamics and spatial inter-electrode relationships.

Continuous Thought Machines (CTMs)
Darlow et al. (2025) - Paper ·
SakanaAI/continuous-thought-machines
CTMs model internal recurrent "thinking" steps where the network processes the same input over multiple internal time steps before producing an output, analogous to internal deliberation in biological neurons.

Alternative Approaches
Energy-Based Models
Song & Kingma (2021) - How to Train EBMs
Energy-based models define a scalar energy function over the input-label joint space; low energy corresponds to plausible configurations. Attractive for EEG because they do not require normalisation across the input space and can express complex multimodal distributions.
Flow Matching
Lipman et al. (2022) - Paper
A generative modelling approach that learns continuous normalising flows by regressing vector fields rather than maximising likelihood. Applicable to EEG data augmentation and synthesis, expanding labelled training sets.
Architecture Comparison
| Architecture | Parallelisable | Long-range Deps | SSL-friendly | Edge-efficient |
|---|---|---|---|---|
| CNN (EEGNet, GREEN) | Yes | Limited | Moderate | Yes |
| Recurrent (xLSTM, LMU) | Partial | Strong | Moderate | Yes |
| Mamba SSM (FEMBA) | Yes | Strong | Good | Yes |
| Contrastive Transformer (BENDR, LEAD) | Yes | Strong | Excellent | No |
| Masked Transformer (LaBraM, CEReBrO) | Yes | Strong | Excellent | No |
| JEPA (S-JEPA) | Yes | Strong | Excellent | No |
| Graph + MAE (GEFM) | Partial | Moderate | Good | No |
| Spiking (SLRC, SGLNet) | Partial | Moderate | Emerging | Excellent |