Download PDFOpen PDF in browserMimicking Humanity: a Synthetic Data-Based Approach to Voice Cloning in Text-to-Speech SystemsEasyChair Preprint 157806 pages•Date: January 29, 2025AbstractVoice-based training data was often costly and challenging to obtain, leading to significant barriers in building high-performing models. Insufficient training data frequently resulted in overfitting and compromised model quality. An alternative approach involved generating synthetic data using publicly available tools, which provided a scalable and cost-effective solution to address these challenges. This study compared the performance of two models with identical architectures: one trained exclusively on human speech data and another trained entirely on synthetic audio. The evaluation demonstrated that the model trained with synthetic data outperformed the one trained with human data, primarily due to the availability of a substantially larger synthetic dataset. The findings highlighted the potential of high-quality synthetic data to serve as a viable replacement for real-world datasets, particularly in applications where data collection posed logistical, ethical, or financial challenges. The results underscored the effectiveness of synthetic data in training multimedia models, paving the way for broader adoption in diverse applications, including text-to-speech systems and beyond. Keyphrases: Natural Language Processing (NLP), Privacy, Speaker Generalization, Text-to-Speech (TTS), Voice Cloning, data scarcity, ethical concerns, linguistic diversity, speaker embeddings, speech synthesis, synthetic data
|