The Foundation Model Paradigm

Foundation models (FMs) have seen widespread use across domains such as textual language (BERT, GPT) and vision (MAE, DINOv2). The paradigm is simple: utilize a large encoder network (typically a transformer), pre-train on vast unlabelled data using self-supervised objectives, then fine-tune them for specific tasks with small, labelled datasets.

Foundation models aim to solve the data efficiency problem: how might we build accurate end-to-end models with scarce labelled datasets? They solve this by separating the representation learning problem from the task-specific prediction problem, solving the former with large abundant sets of unlabelled data.

Hence, foundation models break up the training into two stages:

Step 1 - Pre-training: A large encoder is trained on a large corpus of unlabelled data using a self-supervised objective.
Step 2 - Fine-tuning: The pre-trained encoder is adapted to a specific downstream task using a small labelled dataset, with a lightweight decoder head and (optionally) a small fraction of the encoder are updated.

Pre-training

Foundation models use a plethora of varying self-supervised pre-training objectives, attempting to optimize an encoder module.

The dominant pre-training objective is masked signal reconstruction: randomly mask a high fraction (typically 50-75%) of patches of the input signal and train the encoder-decoder to reconstruct them from unmasked context. This is directly analogous to BERT's masked language modelling and MAE's masked image modelling. The high masking ratio forces the encoder to learn long-range temporal dependencies and contextual representations, rather than simply copying local signal features.

Variants include:

Masked autoencoder (MAE-style): reconstruct raw signal values at masked positions.
Masked spectrum prediction: reconstruct frequency-domain representations of masked patches, which are useful especially for signals whose spectral features carry more cognitive meaning that raw signals, so frequency-domain objectives guide the encoder towards representations that are more directly useful for downstream tasks.
Contrastive SSL: use contrastive objectives (e.g. SimCLR, BYOL, MoCo) to learn similar representations across augmented views of the same signal and dissimilar representations across views from different signals. This approach is sensitive to choice of augmentations, as naïve augmentations that preserve artefacts can lead to the models memorizing samples.
JEPA-style prediction: predict the latent representation of masked patches rather than raw signal values, while a separate target encoder (EMA of context encoder) provides prediction targets. This approach avoids trivial solutions and the need to reconstruct high-frequency signal noise while encouraging semantic representations and being more robust to stochastic signal variations.

Fine-tuning

When fine-tuning a foundation model, much consideration is given to the varying strategies by which a pre-trained encoder can be adapted to a task. This consideration is due to the dataset requirements, compute requirements and more.

The standard strategies for fine-tuning are:

Linear probing: The encoder is frozen and a single linear layer is trained on top of the extracted features using a labelled dataset. This strategy is often used to test how well the pretrained objective has organised the latent space.
Full fine-tuning: All encoder weights are updated alongside a task head. This strategy generally achieves better performance given sufficient labelled data, but risk forgetting pre-trained representations if the labelled dataset is small.
LoRA / Adapter fine-tuning: The encoder weights are kept frozen, and small trainable modules are inserted into the network (e.g., low-rank update matrices in LoRA or bottleneck adapter layers between transformer blocks). Only these additional parameters are trained, enabling task adaptation with far fewer updated weights while largely preserving the pretrained representations.
Prompt tuning: The encoder is frozen and task-specific learnable prompt embeddings (“soft prompts”) are prepended or injected into the input sequence. Only these prompt vectors are optimized, steering the model’s behaviour toward the downstream task without modifying the core model parameters.

These costs and advised usage points are summarised as follows:

Strategy	When to Use	Compute Cost
Linear probing	Very small labelled dataset; encoder should remain general	Lowest
Full fine-tuning	Sufficient labelled data; maximum performance desired	High
LoRA / Adapter fine-tuning	Small labelled dataset; memory-efficient adaptation	Low
Prompt tuning	Encoder is very large; only soft prompts are updated	Very low