Current research in end-to-end speech deepfake detection predominantly centers around inputting “raw” waveforms to a deep architecture, such as RawNet2, and training the deep neural network to predict if the waveforms are fake. However, direct processing of waveforms could cause over-parameterization in the network, reducing its generalizability. To overcome this limitation, we propose a multi-level variational regularization framework integrating a modified Variational Autoencoder (VAE) with discriminative constraints. Specifically, we adopt an VAE with a deepfake discrimination constraint to regularize a RawNet2- based high-level feature map (HFM) extractor. Experimental results show that the proposed variational regularization leads to HFM features that improve the performance of AASIST, SERawformer, and RawBMamba by 36.01%, 10.07%, and 6.35%, respectively.
Variational regularization for end-to-end speech deepfake detection
APSIPA 2025, 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 22-24 October 2025, Shangri-la, Singapore
Type:
Conférence
City:
Shangri-la
Date:
2025-10-22
Department:
Sécurité numérique
Eurecom Ref:
8514
Copyright:
© 2025 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
See also:
PERMALINK : https://www.eurecom.fr/publication/8514