Self-Supervised Learning for Speaker Recognition (Ph.D. Thesis)

Abstract

Advances in Artificial Intelligence, driven by developments in Deep Learning, have led to tremendous progress in Speech Processing. In the context of Speaker Recognition (SR), the training objective is to associate an audio sample with the corresponding speaker identity. However, the performance of such supervised systems is highly dependent on the amount of labeled data available. This inherent reliance on human supervision is a major limitation since annotations are expensive and time-consuming to obtain, prone to bias, and often limited in scope, all of which can hinder scalability and generalization. This poses a particular challenge in speech domains, where collecting labeled audio across all languages (with over 7,000 dialects spoken worldwide), speaker profiles (e.g., age, gender), and conditions (e.g., recording device, environmental noise) is not feasible. Self-Supervised Learning (SSL) has recently emerged as a promising approach for learning relevant representations without human annotations, drawing inspiration from how humans learn through patterns and context rather than explicit labels. While SSL has proven effective across many downstream tasks, several applications remain underexplored. This thesis contributes to this fast-evolving paradigm for SR, toward greater generalization and reduced reliance on labeled data.

Resources

Thesis Document

Defense Slides

Toolkit (sslsv)

Ref (BibTeX)

Video

Contributions

SSLSV
Application and Study of SSL for SV

Benchmark and study SSL frameworks (e.g., SimCLR, MoCo, DINO) on SV under controlled conditions
→ Identify the role and limitations of positive sampling in modeling intra-speaker variability

Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (PDF)

Self-Supervised Learning for Speaker Recognition: A study and review (PDF)

Margins
Margins in Self-Supervised Contrastive Frameworks

Integrate CosFace, ArcFace, AdaFace, and other margin-based constraints into SimCLR and MoCo
→ Improve speaker separability in fully self-supervised settings

Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification (PDF)

Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (PDF)

SSPS
Self-Supervised Positive Sampling (SSPS) from Latent Space

Exploit latent-space proximity to sample cross-recording pseudo-positives
→ Reduce intra-speaker variability and improve SV performance across frameworks (-58% EER for SimCLR)

Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling (PDF)

SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification (PDF)

Foundation
Speech Foundation Models for SV without Labels

Develop an iterative pseudo-labeling approach to enable WavLM fine-tuning from a DINO-based model
→ 1.06% EER on VoxCeleb1-O, setting a new SOTA and approaching supervised performance

Towards Supervised Performance on Speaker Verification with SSL by Leveraging Large-Scale ASR Models (PDF)

sslsv: Open-Source PyTorch Toolkit for Self-Supervised SV

Release a PyTorch toolkit to support reproducibility and future research in the field
→ https://github.com/theolepage/sslsv