Skip to content

Speech

Speech is a uniquely convenient physiological modality for cognitive monitoring: it requires no body-worn sensor, only a microphone.

Cognitive Relevance

Stress and cognitive load manifest in measurable acoustic characteristics of speech - changes that occur involuntarily in the speaker's voice regardless of the verbal content:

Acoustic Feature Mechanism Cognitive / Physiological Association
Fundamental frequency (F0 / pitch) Increased vocal fold tension under sympathetic arousal Rises with stress and emotional arousal
Speech rate Accelerates under time pressure (temporal demand); slows with cognitive fatigue Faster rate → high temporal demand; slower rate → fatigue or confusion
Energy / loudness Increased muscular effort under arousal Increases with stress and urgency
MFCC features Mel-Frequency Cepstral Coefficients capture vocal tract shape Sensitive to vocal tract tension patterns under stress
Jitter Micro-perturbations in pitch (cycle-to-cycle frequency variation) Increases with physiological stress and fatigue
Shimmer Micro-perturbations in amplitude Increases with physiological stress
Vocal tremor Low-frequency oscillations in pitch and amplitude Stress, anxiety, and extreme cognitive load
Pause duration and frequency Hesitations reflect working memory load and planning difficulty Longer, more frequent pauses → higher cognitive demand

These para-linguistic features are distinct from the content of speech - what matters is how the words are spoken, not what is said. This separation allows cognitive monitoring from speech even in contexts where content cannot be recorded for privacy reasons.

Role in Brain Foundation Models

Speech is a hybrid modality for the Brain FM: large general-purpose speech pre-training corpora exist (LibriSpeech, Common Voice), but these contain no cognitive state labels and may differ acoustically from domain-specific operational communications. Direct SSL pre-training on in-domain speech is feasible but requires collecting domain-specific labelled data.

The most tractable approach is:

  1. Pre-train a speech encoder on large general ASR corpora (wav2vec 2.0, HuBERT) to extract robust acoustic representations.
  2. Fine-tune the encoder on labelled in-domain speech segments with NASA-TLX or stress labels.
  3. Optionally align the speech encoder with the EEG encoder in the shared latent space for cross-modal fusion.