Traditional Text-to-Speech (TTS) systems rely on studioquality speech recorded in controlled settings. Recently, an effort known as “noisy-TTS training” has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available1 , created through a fully automated pipeline applied to the VoxCeleb1 dataset. It comprises two training sets: TITW-Hard, derived from the transcription, segmentation, and selection of raw VoxCeleb1 data, and TITW-Easy, which incorporates additional enhancement and data selection based on DNSMOS. State-of-theart TTS models achieve over 3.0 UTMOS score with TITWEasy, while TITW-Hard remains difficult showing UTMOS below 2.8. Beyond TTS, TITW’s unique design, leveraging a automatic speaker recognition dataset, strengthens ethical efforts to counteract malicious use of TTS models by supporting tasks such as speech deepfake detection.
The text-to-speech in the wild (TITW) database
Submitted to ArXiV, 1 June 2025
      
  Type:
        Report
      Date:
        2025-06-01
      Department:
        Digital Security
      Eurecom Ref:
        8326
      Copyright:
        © 2025 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
      See also:
        
      PERMALINK : https://www.eurecom.fr/publication/8326
 
 
 
     
                       
                      