
Image by Vecstoc, from Freepik
New AI Model Stops Voice Cloning with “Machine Unlearning”
South Korean researchers developed a new way to make AI voice generators to “forget” how to imitate specific people’s voices.
In a rush? Here are the quick facts:
- The method reduces voice mimic accuracy by over 75%.
- Allowed voices still work, with only 2.8% performance loss.
- The system needs 5 minutes of audio to forget a speaker.
The ‘‘machine unlearning” system aims to be a solution to stop the misuse of voice-cloning technologies, which scammers and deepfake creators use.
The current zero-shot text-to-speech (ZS-TTS) models require only a few seconds of audio to create realistic voice imitations of any person. “Anyone’s voice can be reproduced or copied with just a few seconds of their voice,” said Jong Hwan Ko, a professor at Sungkyunkwan University, as reported by MIT Technology Review.
This opens the door to serious privacy and security concerns, such as impersonation and fraud.
The research team of Ko developed Teacher-Guided Unlearning (TGU) as the first system which trains AI models to forget how to produce specific people’s voices. They explain in their paper that instead of blocking requests with filters (called “guardrails”), this technique modifies the AI’s memory storage so the voice data becomes inaccessible to the system.
When prompted to generate speech in a forgotten voice, the updated AI model returns a random voice instead. This randomness, the researchers argue, proves that the original voice has been successfully erased. In tests, the AI was 75% less accurate at mimicking the removed voice, yet performance for allowed voices dropped only slightly (by 2.8%).
The method requires only five minutes of audio recordings from each speaker to complete its process. The early-stage development shows significant promise, according to expert opinions. “This is one of the first works I’ve seen for speech,” said Vaidehi Patil, a PhD student at UNC-Chapel Hill, as reported by MIT.