Skip to content

Cross-Modal Alignment

Cross-modal alignment learns a shared latent space in which representations of the same cognitive state, derived from different physiological modalities, are geometrically close - regardless of which sensor produced them.

Motivation

A supervisor monitoring an operator may have access to EEG, PPG, and speech simultaneously. Rather than making three independent predictions from three separate models and combining them heuristically, cross-modal alignment enables the system to:

  1. Project all modalities into the same latent space.
  2. Fuse them at the representation level before decoding.
  3. Fall back gracefully if one modality is unavailable (sensor failure, consent withdrawal).

Additionally, paired multi-modal recordings are rare - most datasets record either EEG or PPG, not both simultaneously. Cross-modal alignment can leverage separately collected datasets by aligning the latent spaces of independently trained encoders.

Modality-Invariant and Modality-Specific Decomposition

MISA Framework

Hazarika, Zimmermann & Poria (ACM MM 2020) - Paper

MISA (Modality-Invariant and -Specific Representations) decomposes each modality's encoding into two orthogonal subspaces:

\[\mathbf{z}_m = [\mathbf{z}_m^{\text{inv}}, \mathbf{z}_m^{\text{spec}}]\]

where:

  • \(\mathbf{z}_m^{\text{inv}}\) captures the cognitive state information shared across modalities (invariant representation).
  • \(\mathbf{z}_m^{\text{spec}}\) captures modality-unique characteristics (specific representation).

The invariant representations from all modalities are aligned to be close in the shared latent space (via a domain adversarial or contrastive loss), while the specific representations are encouraged to be orthogonal to the invariant ones.

Advantage: Fusion uses only the invariant representations; modality-specific information is preserved separately and can be used as auxiliary features when available.

Towards Robust Multimodal Physiological Foundation Models

Jiang et al. (2025) - arXiv:2504.19596

This recent work extends the MISA approach to handle arbitrary missing modalities at test time. The model learns a joint representation that is robust when any subset of modalities is absent - critical for real-world deployment where sensor availability is unpredictable.

Two-Level Semantic Alignment (Brant-X)

Zhang et al. (SIGKDD 2024) - arXiv:2409.00122

Brant-X provides a concrete implementation of cross-modal alignment for physiological signals. Using a pre-trained EEG foundation model as the anchor, it aligns other biosignals (ECG, EMG, eye movements, etc.) via a two-level semantic alignment framework:

  1. Sample-level alignment - paired windows of different modalities recorded simultaneously are aligned to produce similar representations.
  2. Semantic-level alignment - windows labelled with the same cognitive state (even if not simultaneously recorded) are aligned across modalities.

This two-level approach addresses the scarcity of simultaneously collected paired data by also using separately labelled datasets for alignment.

Downstream tasks where Brant-X demonstrated cross-modal transfer:

  • Sleep stage classification
  • Emotion recognition
  • Freezing of gait detection
  • Eye movement-based communication
  • Arrhythmia detection (ECG)

Contrastive Cross-Modal Objectives

For any pair of modalities \((m_1, m_2)\), contrastive alignment minimises the distance between representations of matched pairs and maximises distance between unmatched pairs:

\[\mathcal{L}_{\text{align}} = -\log \frac{\exp(\text{sim}(\mathbf{z}_{m_1}^{(i)}, \mathbf{z}_{m_2}^{(i)}) / \tau)}{\sum_{j} \exp(\text{sim}(\mathbf{z}_{m_1}^{(i)}, \mathbf{z}_{m_2}^{(j)}) / \tau)}\]

where \(\text{sim}(\cdot, \cdot)\) is cosine similarity and \(\tau\) is a temperature hyperparameter. This is the InfoNCE / NT-Xent contrastive loss used in CLIP and SimCLR, adapted to physiological modality pairs.

Missing Modality Handling

At test time, one or more modalities may be unavailable. Strategies:

Strategy Description Limitation
Zero imputation Set missing modality embedding to zero May shift decision boundary
Mean imputation Replace with training-set mean embedding Assumes mean is a plausible representation
Cross-modal reconstruction Predict the missing modality's embedding from available modalities Requires reconstruction head per modality pair
Learned masking Train with randomly masked modalities; model learns to use available subset Most general; requires training with masking

The approach in Jiang et al. (2025) trains with random modality masking during pre-training, making the decoder robust to any missing subset at inference.

Relationship to Underrepresented Modalities

When a modality has insufficient data for independent pre-training, cross-modal alignment from EEG to that modality bootstraps a useful encoder without requiring a separate large pre-training corpus. See Underrepresented Modalities.