Sequence Length Agnosticism

The Problem

Much of the physiological data collected in cognitive monitoring spans long, continuous recording durations. The relationship between signal duration and cognitive state is unclear:

Does the cognitive state at time \(t\) reflect the last 2 seconds of EEG, the last 10 seconds, or the last minute?
Slow cognitive drift (fatigue building over hours) requires long context windows; rapid state changes (a sudden high-workload event) may require short windows.
Fixed-length window approaches force a choice that may be suboptimal across different cognitive phenomena.

Additionally, different datasets use different recording durations and windowing conventions. A model that fixes an input length cannot be directly applied to datasets with different conventions without additional preprocessing.

What Sequence Length Agnosticism Means

A sequence length-agnostic model accepts input windows of any temporal length without architectural modifications or interpolation of positional embeddings. The encoder produces a fixed-size latent vector regardless of whether the input is a 2-second or 60-second window.

This requires:

Position embeddings that generalise to unseen lengths - learnt fixed sinusoidal or rotary position embeddings (RoPE) extrapolate beyond the training length range.
Architectures without fixed-length inductive biases - attention mechanisms that do not assume a fixed sequence length, and pooling operations that aggregate variable-length sequences.
Evaluation across window lengths - the model should be benchmarked at multiple window lengths to characterise performance as a function of temporal context.

Approaches

Rotary Position Embedding (RoPE)

RoPE encodes position as a rotation applied to the query and key vectors in self-attention, rather than as an additive positional vector. The rotation extrapolates naturally to positions not seen during training, enabling generalisation to longer sequences.

DIVER-0 uses a generalisation of RoPE called Sliding Temporal Conditional Positional Encoding (STCPE) to handle both temporal translation equivariance and variable sequence lengths simultaneously.

Relative Position Encodings

Rather than encoding absolute position in the sequence, relative position encodings encode the distance between two tokens. Because distances can be computed for any pair of positions, relative encodings naturally extend to any sequence length without modification.

Learnable Sinusoidal Encodings

Sinusoidal position embeddings (Vaswani et al., 2017) are defined for any integer position and can be evaluated at positions beyond the training range. Making the frequency components learnable adapts them to the frequency structure of EEG signals.

State Space Models (Mamba / FEMBA)

State-space models process sequences recurrently: the hidden state evolves as a function of each new input token. Because the recurrent computation does not reference a global position index, it is inherently length-agnostic. FEMBA's bidirectional Mamba architecture processes EEG sequences of arbitrary length with linear time and memory complexity.

Window Aggregation Strategies

At inference time, a long recording can be processed in overlapping windows, and the window-level predictions can be aggregated:

Mean pooling across window representations before decoding.
Attention-weighted pooling where a learned attention mechanism determines which windows are most informative.
Temporal hierarchical encoding where a second-level encoder aggregates window representations across time.

Relevance to Cognitive State Monitoring

For cognitive workload monitoring, research suggests that the relevant timescale is on the order of 10–30 seconds - long enough to capture a cognitive event but short enough that the label is approximately stationary within the window. However, this varies across:

Workload - tends to evolve relatively slowly; longer windows may capture more stable features.
Stress - can change rapidly in response to specific events; shorter windows may be more appropriate.
Fatigue - accumulates over hours; very long windows or explicit temporal trend modelling may be required.

A sequence length-agnostic architecture allows the window length to be treated as a hyperparameter tuned for each cognitive state target, rather than a fixed architectural constraint.