Cross-Modal Alignment
Cross-modal alignment learns a shared latent space in which representations of the same cognitive state, derived from different physiological modalities, are geometrically close - regardless of which sensor produced them.
Motivation
A supervisor monitoring an operator may have access to EEG, PPG, and speech simultaneously. Rather than making three independent predictions from three separate models and combining them heuristically, cross-modal alignment enables the system to:
- Project all modalities into the same latent space.
- Fuse them at the representation level before decoding.
- Fall back gracefully if one modality is unavailable (sensor failure, consent withdrawal).
Additionally, paired multi-modal recordings are rare - most datasets record either EEG or PPG, not both simultaneously. Cross-modal alignment can leverage separately collected datasets by aligning the latent spaces of independently trained encoders.
Modality-Invariant and Modality-Specific Decomposition
MISA Framework
Hazarika, Zimmermann & Poria (ACM MM 2020) - Paper
MISA (Modality-Invariant and -Specific Representations) decomposes each modality's encoding into two orthogonal subspaces:
where:
- \(\mathbf{z}_m^{\text{inv}}\) captures the cognitive state information shared across modalities (invariant representation).
- \(\mathbf{z}_m^{\text{spec}}\) captures modality-unique characteristics (specific representation).
The invariant representations from all modalities are aligned to be close in the shared latent space (via a domain adversarial or contrastive loss), while the specific representations are encouraged to be orthogonal to the invariant ones.
Advantage: Fusion uses only the invariant representations; modality-specific information is preserved separately and can be used as auxiliary features when available.
Towards Robust Multimodal Physiological Foundation Models
Jiang et al. (2025) - arXiv:2504.19596
This recent work extends the MISA approach to handle arbitrary missing modalities at test time. The model learns a joint representation that is robust when any subset of modalities is absent - critical for real-world deployment where sensor availability is unpredictable.
Two-Level Semantic Alignment (Brant-X)
Zhang et al. (SIGKDD 2024) - arXiv:2409.00122
Brant-X provides a concrete implementation of cross-modal alignment for physiological signals. Using a pre-trained EEG foundation model as the anchor, it aligns other biosignals (ECG, EMG, eye movements, etc.) via a two-level semantic alignment framework:
- Sample-level alignment - paired windows of different modalities recorded simultaneously are aligned to produce similar representations.
- Semantic-level alignment - windows labelled with the same cognitive state (even if not simultaneously recorded) are aligned across modalities.
This two-level approach addresses the scarcity of simultaneously collected paired data by also using separately labelled datasets for alignment.
Downstream tasks where Brant-X demonstrated cross-modal transfer:
- Sleep stage classification
- Emotion recognition
- Freezing of gait detection
- Eye movement-based communication
- Arrhythmia detection (ECG)
Contrastive Cross-Modal Objectives
For any pair of modalities \((m_1, m_2)\), contrastive alignment minimises the distance between representations of matched pairs and maximises distance between unmatched pairs:
where \(\text{sim}(\cdot, \cdot)\) is cosine similarity and \(\tau\) is a temperature hyperparameter. This is the InfoNCE / NT-Xent contrastive loss used in CLIP and SimCLR, adapted to physiological modality pairs.
Missing Modality Handling
At test time, one or more modalities may be unavailable. Strategies:
| Strategy | Description | Limitation |
|---|---|---|
| Zero imputation | Set missing modality embedding to zero | May shift decision boundary |
| Mean imputation | Replace with training-set mean embedding | Assumes mean is a plausible representation |
| Cross-modal reconstruction | Predict the missing modality's embedding from available modalities | Requires reconstruction head per modality pair |
| Learned masking | Train with randomly masked modalities; model learns to use available subset | Most general; requires training with masking |
The approach in Jiang et al. (2025) trains with random modality masking during pre-training, making the decoder robust to any missing subset at inference.
Relationship to Underrepresented Modalities
When a modality has insufficient data for independent pre-training, cross-modal alignment from EEG to that modality bootstraps a useful encoder without requiring a separate large pre-training corpus. See Underrepresented Modalities.