Multimodal Representation Learning
The Brain Foundation Model is designed to work not just with EEG, but with the full range of physiological signals available in cognitive monitoring: PPG, ECG, eye gaze, pupillometry, and speech. Treating these modalities jointly requires strategies for learning both modality-specific and modality-invariant representations.
The Multimodal Challenge
Each physiological modality captures a different facet of the underlying cognitive state:
| Modality | What It Reflects | Temporal Scale | Data Availability |
|---|---|---|---|
| EEG | Direct neural activity | Milliseconds | Large corpora available |
| ECG | Autonomic cardiac response | Seconds | Moderate |
| PPG | Peripheral blood volume | Seconds | Small |
| Eye gaze | Attentional allocation | Sub-second | Small |
| Pupillometry | Arousal and cognitive effort | Seconds | Small |
| Speech | Vocal stress and load | Continuous | Varies |
A key challenge is modality imbalance: large pre-training corpora (tens of thousands of hours) exist primarily for EEG, while PPG and eye tracking datasets contain far less data. This makes it difficult to pre-train strong encoders for underrepresented modalities from scratch.
Two Multimodal Goals
1. Cross-Modal Alignment
Learn a shared latent space where representations of the same cognitive state, captured by different modalities, are close together - regardless of which sensor produced the signal.
This is useful for:
- Multimodal fusion - combining EEG + PPG representations for more robust cognitive state estimation.
- Missing modality handling - if one sensor fails or is unavailable, the representation from another modality can be used as a substitute.
2. Bootstrapping Underrepresented Modalities
Use the large, well-trained EEG encoder to bootstrap encoders for data-scarce modalities via cross-modal knowledge transfer. A PPG encoder trained on small PPG datasets can be improved by aligning it with the EEG encoder in the shared latent space, effectively transferring knowledge from the EEG-rich domain.
See Underrepresented Modalities.
Unified vs. Modality-Specific Encoders
Two architectural philosophies are under investigation:
Unified Encoder (BIOT / PhysioWave style)
A single transformer encoder ingests all modalities. Signals are tokenised into a common format (fixed-length segments) and prefixed with modality identity tokens. The same attention mechanism processes EEG tokens and PPG tokens in the same forward pass.
Advantages: Maximum weight sharing; implicit alignment through shared parameters; single model to maintain.
Disadvantages: Must balance training signals across modalities; lower-resource modalities may be dominated by EEG.
Modality-Specific Encoders + Alignment
Separate encoders are trained per modality (one large EEG encoder, smaller encoders for PPG, ECG, etc.), then aligned into a shared latent space via contrastive or reconstruction-based losses.
Advantages: Each encoder is optimised for its modality's specific properties (sampling rate, signal statistics, noise structure); the EEG encoder can be pre-trained independently at scale.
Disadvantages: Requires paired cross-modal data for alignment training; more components to maintain.
Modality-Invariant and Modality-Specific Representations
For tasks that involve multiple modalities simultaneously, it is useful to decompose representations into:
- Modality-invariant features - shared cognitive state information that is common across all modalities.
- Modality-specific features - information unique to each sensor (e.g. cardiac rhythm features from ECG that are not present in EEG).
The MISA framework (Hazarika et al., 2020) formalises this decomposition for sentiment analysis; the same principle extends to physiological multimodal learning.
See Cross-Modal Alignment for the formal objective.