Speech

Speech is a uniquely convenient physiological modality for cognitive monitoring: it requires no body-worn sensor, only a microphone.

Cognitive Relevance

Stress and cognitive load manifest in measurable acoustic characteristics of speech - changes that occur involuntarily in the speaker's voice regardless of the verbal content:

Acoustic Feature	Mechanism	Cognitive / Physiological Association
Fundamental frequency (F0 / pitch)	Increased vocal fold tension under sympathetic arousal	Rises with stress and emotional arousal
Speech rate	Accelerates under time pressure (temporal demand); slows with cognitive fatigue	Faster rate → high temporal demand; slower rate → fatigue or confusion
Energy / loudness	Increased muscular effort under arousal	Increases with stress and urgency
MFCC features	Mel-Frequency Cepstral Coefficients capture vocal tract shape	Sensitive to vocal tract tension patterns under stress
Jitter	Micro-perturbations in pitch (cycle-to-cycle frequency variation)	Increases with physiological stress and fatigue
Shimmer	Micro-perturbations in amplitude	Increases with physiological stress
Vocal tremor	Low-frequency oscillations in pitch and amplitude	Stress, anxiety, and extreme cognitive load
Pause duration and frequency	Hesitations reflect working memory load and planning difficulty	Longer, more frequent pauses → higher cognitive demand

These para-linguistic features are distinct from the content of speech - what matters is how the words are spoken, not what is said. This separation allows cognitive monitoring from speech even in contexts where content cannot be recorded for privacy reasons.

Role in Brain Foundation Models

Speech is a hybrid modality for the Brain FM: large general-purpose speech pre-training corpora exist (LibriSpeech, Common Voice), but these contain no cognitive state labels and may differ acoustically from domain-specific operational communications. Direct SSL pre-training on in-domain speech is feasible but requires collecting domain-specific labelled data.

The most tractable approach is:

Pre-train a speech encoder on large general ASR corpora (wav2vec 2.0, HuBERT) to extract robust acoustic representations.
Fine-tune the encoder on labelled in-domain speech segments with NASA-TLX or stress labels.
Optionally align the speech encoder with the EEG encoder in the shared latent space for cross-modal fusion.