New Tools Allow Voice Patterns To Be Cloned To Produce Realistic But Fake Sounds Of Anyone Saying Anything

from the shopped-images-are-so-yesterday dept

Fake images, often produced using sophisticated software like Photoshop or the GIMP, were around long before so-called “fake news” became an issue. They are part and parcel of the Internet’s fast-moving creative culture, and a trap for anyone that passes on striking images without checking their provenance or plausibility. Until now, this kind of artful manipulation has been limited to the visual sphere. But a new generation of tools will soon allow entire voice patterns to be cloned from relatively small samples with increasing fidelity such that it can be hard to spot they are fake. For example, in November last year, the Verge wrote about Adobe’s Project VoCo:

“When recording voiceovers, dialog, and narration, people would often like to change or insert a word or a few words due to either a mistake they made or simply because they would like to change part of the narrative,” reads an official Adobe statement. “We have developed a technology called Project VoCo in which you can simply type in the word or words that you would like to change or insert into the voiceover. The algorithm does the rest and makes it sound like the original speaker said those words.”

Since then, things have moved on apace. Last week, the Economist wrote about the French company CandyVoice:

Utter 160 or so French or English phrases into a phone app developed by CandyVoice, a new Parisian company, and the app’s software will reassemble tiny slices of those sounds to enunciate, in a plausible simulacrum of your own dulcet tones, whatever typed words it is subsequently fed. In effect, the app has cloned your voice.

The Montreal company Lyrebird has a page full of fascinating demos of its own voice cloning technology, which requires even less in the way of samples:

Lyrebird will offer an API to copy the voice of anyone. It will need as little as one minute of audio recording of a speaker to compute a unique key defining her/his voice. This key will then allow to generate anything from its corresponding voice. The API will be robust enough to learn from noisy recordings. The following sample illustrates this feature, the samples are not cherry-picked.

Please note that those are artificial voices and they do not convey the opinions of Donald Trump, Barack Obama and Hillary Clinton.

As Techdirt readers will have spotted, this technical development raises big ethical questions, articulated here by Lyrebird:

Voice recordings are currently considered as strong pieces of evidence in our societies and in particular in jurisdictions of many countries. Our technology questions the validity of such evidence as it allows to easily manipulate audio recordings. This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.

The Economist quantifies the problem. According to its article, voice-biometrics software similar to the kind deployed by many banks to block unauthorized access to accounts was fooled 80% of the time in tests using the new technology. Humans didn’t do much better, only spotting that a voice had been cloned 50% of the time. And remember, these figures are for today’s technologies. As algorithms improve, and Moore’s Law kicks in, it’s not unreasonable to think that it will become almost impossible to tell by ear whether the voice you hear is the real thing, or a version generated using the latest cloning technology.

Follow me @glynmoody on Twitter or, and +glynmoody on Google+

Filed Under: , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “New Tools Allow Voice Patterns To Be Cloned To Produce Realistic But Fake Sounds Of Anyone Saying Anything”

Subscribe: RSS Leave a comment
Anonymous Coward says:

Reminds me of that episode of “The Clone Wars” where Obi-Wan had to infiltrate a group of bounty hunters by posing as one named Rako Hardeen.

In that episode he had to ingest a robot that had copied Hardeen’s voice so he could sound like him.

Makes me wonder if Star Wars tech isn’t as science fantasy as people thought.

Anonymous Coward says:

While I personally am really excited about the possibilities, I can’t help but think cops are, too. Coupled with so-called exonerating phrases like “stop resisting,” they may just try to make a situation look like they were justified in deploying even an RPG (perhaps by making it sound as if an arrestee hurled abusibe/menacing worda at the cops?). After all, they’ve never really shied away from deceit and manipulation.

Anonymous Coward says:

As Techdirt readers will have spotted, this technical development raises big ethical questions

There are probably some ethical questions to ask here, but that statement doesn’t talk about any of them. It, at most, introduces a question of the efficacy of police/criminal justice best practices. That’s not ethical, it’s procedural. The only vaguely ethical conundrum it alludes to is whether or not we should execute anyone who displays talent in any scientific or engineering field to prevent technology driven change.

Peter says:

Maybe someday...

Let’s be honest, the current state of the art isn’t fooling anyone. Just listen to the Lyrebird samples: while you can tell who the famous person is supposed to be, it’s still got the drunken Swedish robot quality that has plagued text-to-speech engines forever.

The very best TTS these days is very very good, but it still requires carefully collecting phonemes and a lot of work to make it sound realistic. Even then, its generally distinguishable from a real voice within a sentence or two.

Color me skeptical, but I think the nightmare scenario of creating forensically-realistic fake audio from just a few minutes of voice sample is a long way away. The old-fashioned way of splicing together words and phrases is still better.

PaulT (profile) says:

Re: Maybe someday...

Three thoughts come to mind:

1. So what if it’s a long way away, should the implications of the tech be ignored until someone perfects it?

2. These things do tend to have a tendency to improve exponentially, so it could be a lot sooner than you think.

“Even then, its generally distinguishable from a real voice within a sentence or two.”

3. Given the tendency for political debate to be driven by soundbites and for people to jump to conclusions based on a couple of seconds of video, that might be all that’s needed.

Anonymous Coward says:

Re: Maybe someday...

As someone who’s done graduate-level research in this area, let me comment on that. The algorithms in use here are being driven by a limited number of voice samples; clearly, if the size of the training set increases, so will the accuracy of the output. We’ve already seen similar rapid progress in image and video manipulation, so there’s no reason not to expect the same here.

The reason that you can — currently — readily detect that the output isn’t real is that you’re a human being who’s evolved an extraordinary auditory sense over millenia. Of all our senses, it’s arguably the most highly developed — which is why, for example, we can detect a musical note that’s only a tiny fraction off or recognize each other with a sample size of one word. In other words, our ability to detect ersatz speech is much better than our ability to detect ersatz pictures.

But this technology, or one like it, will eventually confound that too. Whether it takes a year or twenty, it’s coming. So just as “pictures don’t lie” is now obsolete, we’ll have to change our standards for evidence to cope.

Anonymous Coward says:

Re: Re: Maybe someday...

One thing that I have observed is that humans are very good at detecting changes in background noises. Also, this has been a way of detecting edits to sound for a very long time, and part of the reason than films and videos use so much background music, it makes the glitches in background noise over edits.

Conversely, if you want to protect a recording from alteration, play some songs at low level in the background, as that will make changing the recording hugely more difficult, in both separating your words from the background, and in syncing up the replacement background..

John says:

This could make future voice recordings invalid as evidence

With this tech it makes you wonder how future wire taps and similar voice recordings could be used as evidence in court. It would be “easy” to fabricate voice recordings or have the audio play in an environment in which the voice would be recorded, which could result in charges being brought against a person.

Se Habla Espol says:

Re: This could make future voice recordings invalid as evidence

I think you’ll find that chain-of-evidence rules require that any such evidence be attested by a human, under oath, claiming that he performed the recording being offered. Other possibilities exist, but they amount to swearing as to knowledge of the authenticity of the offered evidence.

Ninja (profile) says:

Re: This could make future voice recordings invalid as evidence

That’s one of the most important aspects of this issue. I’d go even further though. Countries like China could use the technology to rewrite history as they please. The implication of technologies that allow the production of full videos with voice and all that are very hard to distinguish from reality can have very real and devastating consequences. One more reason to doubt everything unless there’s a way to trace the ‘supply chain’ of the thing. Maybe we are entering an era of zero trust. Which may be a good thing since people will try to develop systems that don’t rely on trust to operate and produce reliable, trustworthy results (CAs came to mind instantly because they are already living that trust crisis).

Thad (user link) says:

Re: Re: This could make future voice recordings invalid as evidence

Countries like China could use the technology to rewrite history as they please.

Nah, not China, not anymore. Maybe North Korea. In China, and even countries like Iran, despite the government’s best efforts the public can still get access to the open internet.

That’s where Orwell was wrong: he lived in an era where the government could control the public’s access to mass communications media, and he assumed that would still be the case in the future. It’s not, except in nations with crippling poverty like NK.

China still does just fine with its disinformation campaigns, of course. And anyone, even Alex Jones, can make outlandish claims and convince some people that they’re true. But China doesn’t have the propaganda stranglehold on its public that it used to, and the way I see it, improvements to technology will benefit the public’s ability to see through bullshit more than the governments’ ability to create it.

(Whether or not people actually see through the bullshit is, to my mind, a separate issue. There are plenty of people who will believe what they want to believe regardless of evidence; more realistic fakes will color that issue but I don’t think they’ll fundamentally change it.)

PaulT (profile) says:

Re: Re:

That raises an interesting idea in my mind. If people are conditioned to ignore audio & video evidence because it’s often faked, how much are people going to get away with because people distrust the evidence? You can literally film someone red handed, and they just have to raise the idea that footage has been tampered with to introduce reasonable doubt and get away with it.

In fact, how would news reporting work, given that nobody trusts first hand accounts any more even when accurate audio & video evidence is gathered.

Anonymous Coward says:

Yep it’s pretty scary in terms of potential abuse by government, LEO, etc.
But I can’t help think this would work great in games:
Being able to generate dynamic NPC dialog without having to record hundreds of hours.
Calling the player by their actual chosen full name.
Imagine something like DA:O only fully voiced this time. Yes, even your character’s lines.

Anonymous Coward says:

Re: Re:

You could even have the player’s own voice for his or her character – during character creation, put a blurb of text on the screen and ask the player to read it out loud.

And the modding scene would take off. New quests would only need text typed into a database if the modder is happy with the existing library of in-game voices.

Anonymous Coward says:

Re: Re: Re:

This could potentially destroy the market for voice actors in video games and animated movies/videos. I wonder how they will protect the use of their voice in games or other media, especially as the samples to seed the algorithm could likely be taken from someone who sounds like the actor.

Anonymous Coward says:

Re: Re: Re: Re:

Actors (or their agencies) could license their own sample libraries. Or they could just refuse to record & license samples.

Western gamers don’t generally buy games based on voice actor casting. The studio’s name matters more than who’s voicing.

For example Lara Croft was voiced by at least 5 voice actors. And Cole MacGrath was voiced by 2 actors.

Anyway the VA market is a lot bigger in East Asia (S. Korea, Japan and maybe China) than it is here.

Joe P says:

scary possibilities

When terrorists use fake video with voice to convince the masses that their legitimate leaders are corrupt (e.g. Pope saying kill all the ..) then we are really screwed. A war could be started or just one lone wolf converted to the cause.

We need to teach everyone to be skeptical, inquisitive, and knowledgeable of the many ways people can be manipulated.

Stephen says:

With all due respect to Lyrebird, but the example of Obama, Trump, and Hillary talking on the demo page referenced in the article are all too obviously fake. The stilted, machine-like monotone gives them away as computer-generated voices.

Real people don’t talk like that.

If Lyrebird want authenticity they need to try harder to get rid of those qualities.

Anonymous Coward says:

could this be the savior of Reality TV?

The so-called “reality” TV shows will love this because it will be much quicker and easier for them to create fake dialog than their current method of painstakingly splicing a person’s spoken words together. And presumably much less fake sounding than the often sloppy splicing of multi-toned speech snippets.

The next logical innovation for “reality” shows may well be the ability to “photoshop” these synthesized words into people’s mouths so the camera won’t be forced to cut away whenever they “speak” spliced words.

… but on the other hand, wouldn’t it be so much easier to just give these “reality” actors an actual script instead of creating dialog in the editing room?

Anonymous Coward says:

Fake voice, ID Theft and Wall Street/401K

Kiss your 401K retirement money good by. If Wall Street itself doesn’t steal your retirement money then the ID thieves (who get the tool and a voice sample) will!

Remember, the scum on Wall Street use your recorded voice (on the phone) and the personal identity information (routinely exposed by Wall Street’s own butt kissing firms) to “establish” your identity. You/We are truly screwed.

The next time Al-Qaeda or ISIS or the Mob or the Drug Lords attack Wall Street (and Washington), I think I’m not likely to care too much. Wall Street/Washington is looting us so viciously, that their enemies attacking them is very low on my list of concerns!

Rekrul says:

If I didn’t know it was fake, the Trump speech samples might have fooled me, but Obama doesn’t sound right and Hillary sounds just like one of those robotic, female text-to-speech apps.

The problem isn’t so much matching the pitch and such of a specific voice, it’s writing the software to properly pronounce words. People have been putting up videos on YouTube with artificial voices for years. Many of them are very good and sound almost perfect, but then they mispronounce a word and you realize that it’s a machine.

Griffdog (profile) says:

Benefits for communications

In these days of broadband communications, it’s hard to remember that there are still some very low data rate channels in use. Meteor burst, VLF, and others offer some unique propagation benefits, but at speeds that were already eclipsed by 1980’s era telephone modems. So, imagine that your communications set already has the voice parameters of the people you’re most likely to talk with. Now, by simply exchanging text at a low data rate, your comm gear can convert the words into realistic voices that actually sound like the people with whom you’re talking. Real-time conversations on channels that are running 75 bits per second, or less. Just add some encryption and authentication protocols, and Bob’s your uncle.

Hats off to science fiction author David Drake and his Hammer’s Slammers series, where hovercraft tank commanders use this approach to hold voice conversations via radio waves bounced off of the ionized trails left by the small meteors that constantly burn up in the atmosphere; a very robust but low data rate communications channel.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...