Multimodal Representation Learning

The Brain Foundation Model is designed to work not just with EEG, but with the full range of physiological signals available in cognitive monitoring: PPG, ECG, eye gaze, pupillometry, and speech. Treating these modalities jointly requires strategies for learning both modality-specific and modality-invariant representations.

The Multimodal Challenge

Each physiological modality captures a different facet of the underlying cognitive state:

Modality	What It Reflects	Temporal Scale	Data Availability
EEG	Direct neural activity	Milliseconds	Large corpora available
ECG	Autonomic cardiac response	Seconds	Moderate
PPG	Peripheral blood volume	Seconds	Small
Eye gaze	Attentional allocation	Sub-second	Small
Pupillometry	Arousal and cognitive effort	Seconds	Small
Speech	Vocal stress and load	Continuous	Varies

A key challenge is modality imbalance: large pre-training corpora (tens of thousands of hours) exist primarily for EEG, while PPG and eye tracking datasets contain far less data. This makes it difficult to pre-train strong encoders for underrepresented modalities from scratch.

Two Multimodal Goals

Learn a shared latent space where representations of the same cognitive state, captured by different modalities, are close together - regardless of which sensor produced the signal.

This is useful for:

Multimodal fusion - combining EEG + PPG representations for more robust cognitive state estimation.
Missing modality handling - if one sensor fails or is unavailable, the representation from another modality can be used as a substitute.

See Cross-Modal Alignment.

2. Bootstrapping Underrepresented Modalities

Use the large, well-trained EEG encoder to bootstrap encoders for data-scarce modalities via cross-modal knowledge transfer. A PPG encoder trained on small PPG datasets can be improved by aligning it with the EEG encoder in the shared latent space, effectively transferring knowledge from the EEG-rich domain.

See Underrepresented Modalities.

Unified vs. Modality-Specific Encoders

Two architectural philosophies are under investigation:

Unified Encoder (BIOT / PhysioWave style)

A single transformer encoder ingests all modalities. Signals are tokenised into a common format (fixed-length segments) and prefixed with modality identity tokens. The same attention mechanism processes EEG tokens and PPG tokens in the same forward pass.

Advantages: Maximum weight sharing; implicit alignment through shared parameters; single model to maintain.

Disadvantages: Must balance training signals across modalities; lower-resource modalities may be dominated by EEG.

Modality-Specific Encoders + Alignment

Separate encoders are trained per modality (one large EEG encoder, smaller encoders for PPG, ECG, etc.), then aligned into a shared latent space via contrastive or reconstruction-based losses.

Advantages: Each encoder is optimised for its modality's specific properties (sampling rate, signal statistics, noise structure); the EEG encoder can be pre-trained independently at scale.

Disadvantages: Requires paired cross-modal data for alignment training; more components to maintain.

Modality-Invariant and Modality-Specific Representations

For tasks that involve multiple modalities simultaneously, it is useful to decompose representations into:

Modality-invariant features - shared cognitive state information that is common across all modalities.
Modality-specific features - information unique to each sensor (e.g. cardiac rhythm features from ECG that are not present in EEG).

The MISA framework (Hazarika et al., 2020) formalises this decomposition for sentiment analysis; the same principle extends to physiological multimodal learning.

See Cross-Modal Alignment for the formal objective.