Ph.D. thesis defense held on February 13, 2026, at EPITA Paris, for the doctoral degree from Sorbonne University
Committee: Jean-François Bonastre (AMIAD), Benjamin Lecouteux (LIG), Driss Matrouf (LIA), Irina Illina (LORIA/Inria), Anthony Larcher (LIUM), and Douglas Reynolds (MIT LL) • Advisors: Réda DEHAK (LRE) and Thierry Géraud (LRE)
Advances in Artificial Intelligence, driven by developments in Deep Learning, have led to tremendous progress in Speech Processing. In the context of Speaker Recognition (SR), the training objective is to associate an audio sample with the corresponding speaker identity. However, the performance of such supervised systems is highly dependent on the amount of labeled data available. This inherent reliance on human supervision is a major limitation since annotations are expensive and time-consuming to obtain, prone to bias, and often limited in scope, all of which can hinder scalability and generalization. This poses a particular challenge in speech domains, where collecting labeled audio across all languages (with over 7,000 dialects spoken worldwide), speaker profiles (e.g., age, gender), and conditions (e.g., recording device, environmental noise) is not feasible. Self-Supervised Learning (SSL) has recently emerged as a promising approach for learning relevant representations without human annotations, drawing inspiration from how humans learn through patterns and context rather than explicit labels. While SSL has proven effective across many downstream tasks, several applications remain underexplored. This thesis contributes to this fast-evolving paradigm for SR, toward greater generalization and reduced reliance on labeled data.
Benchmark and study SSL frameworks (e.g., SimCLR, MoCo, DINO) on SV under controlled conditions
→ Identify the role and limitations of positive sampling in modeling intra-speaker variability
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (PDF)
Self-Supervised Learning for Speaker Recognition: A study and review (PDF)
Integrate CosFace, ArcFace, AdaFace, and other margin-based constraints into SimCLR and MoCo
→ Improve speaker separability in fully self-supervised settings
Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification (PDF)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (PDF)
Exploit latent-space proximity to sample cross-recording pseudo-positives
→ Reduce intra-speaker variability and improve SV performance across frameworks (-58% EER for SimCLR)
Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling (PDF)
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification (PDF)
Develop an iterative pseudo-labeling approach to enable WavLM fine-tuning from a DINO-based model
→ 1.06% EER on VoxCeleb1-O, setting a new SOTA and approaching supervised performance
Towards Supervised Performance on Speaker Verification with SSL by Leveraging Large-Scale ASR Models (PDF)
Release a PyTorch toolkit to support reproducibility and future research in the field
→ https://github.com/theolepage/sslsv