Speech
Speech is a uniquely convenient physiological modality for cognitive monitoring: it requires no body-worn sensor, only a microphone.
Cognitive Relevance
Stress and cognitive load manifest in measurable acoustic characteristics of speech - changes that occur involuntarily in the speaker's voice regardless of the verbal content:
| Acoustic Feature | Mechanism | Cognitive / Physiological Association |
|---|---|---|
| Fundamental frequency (F0 / pitch) | Increased vocal fold tension under sympathetic arousal | Rises with stress and emotional arousal |
| Speech rate | Accelerates under time pressure (temporal demand); slows with cognitive fatigue | Faster rate → high temporal demand; slower rate → fatigue or confusion |
| Energy / loudness | Increased muscular effort under arousal | Increases with stress and urgency |
| MFCC features | Mel-Frequency Cepstral Coefficients capture vocal tract shape | Sensitive to vocal tract tension patterns under stress |
| Jitter | Micro-perturbations in pitch (cycle-to-cycle frequency variation) | Increases with physiological stress and fatigue |
| Shimmer | Micro-perturbations in amplitude | Increases with physiological stress |
| Vocal tremor | Low-frequency oscillations in pitch and amplitude | Stress, anxiety, and extreme cognitive load |
| Pause duration and frequency | Hesitations reflect working memory load and planning difficulty | Longer, more frequent pauses → higher cognitive demand |
These para-linguistic features are distinct from the content of speech - what matters is how the words are spoken, not what is said. This separation allows cognitive monitoring from speech even in contexts where content cannot be recorded for privacy reasons.
Role in Brain Foundation Models
Speech is a hybrid modality for the Brain FM: large general-purpose speech pre-training corpora exist (LibriSpeech, Common Voice), but these contain no cognitive state labels and may differ acoustically from domain-specific operational communications. Direct SSL pre-training on in-domain speech is feasible but requires collecting domain-specific labelled data.
The most tractable approach is:
- Pre-train a speech encoder on large general ASR corpora (wav2vec 2.0, HuBERT) to extract robust acoustic representations.
- Fine-tune the encoder on labelled in-domain speech segments with NASA-TLX or stress labels.
- Optionally align the speech encoder with the EEG encoder in the shared latent space for cross-modal fusion.