04 Feb 2025 4 min read

Human- vs Artificially-Generated Content

Mel Blanc

Hank Azaria has a fun piece in this morning's Times. Azaria of course has been the voice actor behind the roles of Moe Syzslak (the Bartender), Chief Wiggum, Superintendent Chalmers, and a host of other characters on The Simpsons for the past 36 years.

In his op-ed – which is really creatively done, I encourage you to click over and read it for yourself – Mr. Azaria stresses that while artificial intelligence is getting close to being able to mimic speech patterns of real people, there is still a very human component to voice acting that AI hasn't quite caught up. These components involve actions that go beyond the script.

A stage, film, or TV actor uses movement, body language, facial expression, and other visual cues to complete their character. Saeed Jones (@theferocity.bsky.social) posted a live thread yesterday afternoon while he was watching Ridley Scott's "Gladiator 2." Jones specifically calls out Denzel Washington's movements in one part of that thread.

I tell you what tho, Denzel Washington knows how to work a flowing garment, bitch!!! He must have studied with drag queens before this movie. He is WORKING these capes and drapes and sashes and robes.

[image or embed]
— Saeed Jones (@theferocity.bsky.social) February 3, 2025 at 3:59 PM

Mr. Azaria and others like him such as Mel Blanc (Bugs Bunny, Daffy Duck, Elmer Fudd) also use movement and other forms of physical and visual expression when voicing their characters. These actions are not seen by the audience except through the actions of the animated characters. So, why does the voice actor include purely visual effects as part of their performance?

Because, as demonstrated so aptly in the op-ed by Mr. Azaria, it brings "life" to the characters.

If a character is chopping wood or running, then the voice of the actor will convey the breathlessness associated with those activities by mimicking the actions – pantomiming chopping wood, running in place – in a way that perhaps an AI-generated voice reading the same script could not. Similarly, if Chief Wiggum is chomping on a cigar and talking (even more) like Edward G. Robinson in a gangster movie, it helps to convey that sound and activity if Mr. Azaria is holding a highlighter in his mouth while voicing Chief Wiggum.

Making computer-generated content more human-like is nothing new. As early as 1949, Alan Turing, a British mathematician, proposed a test of a machine's ability to think that he called the "imitation game," later re-named the Turing Test. In this test, an evaluator judges a transcript of a conversation between a machine and a human and tries to detect which is which. The machine passes the test if the evaluator cannot distinguish it from the human.

Many current AI programs are simply attempts to get a computer to pass the Turing Test. A very early attempt at this involved getting a machine to repeat back to a human key elements of information that the human provided – mimicking conversation – and even to apologize and make "excuses" if it was prompted that it had made a "mistake."

This still occurs today with AI agents. ChatGPT will "apologize" if the human interacting with it prompts it to do so by indicating that a piece of information provided by the bot is incorrect, even if that information is perfectly valid.

So, while the reasoning ability of chatbots and other AI agents has come a long way in the seventy some odd years that people have been working on the problem, it is still far from perfect.

That is not to say that there haven't been vast improvements. Today's chatbots are far more advanced and human-like than those of just a few years ago. It wasn't very long ago when calling into an airline or a bank that the voice-prompted menu system was a source of extreme frustration. Such systems frequently caused folks to just smash zero to speak with a real person as soon as they entered the menu. Today, it can often be difficult to detect that you are not talking to a real person, at least in the first minute or so, when dialing into these systems. This is a great demonstration of the Turing Test, if ever there was one.

One advance in the last year or so that has helped to speed these perceptions of human-like qualities along has been in voice quality. I've noticed when using the Apple maps navigation app while driving, that the voice directing me has a much more conversational quality than it once did. Vocal tone is of course one way that we express emotions.

This is starting to get at what Mr. Azaria demonstrates in his op-ed this morning. If a machine can mimic a conversational tone, we may not be too far afield from a point where one can mimic breathlessness in a situation where we want it to act like it's running.

I recently re-read Matt Haig's 2013 novel, The Humans. An alien from a civilization far more mathematically advanced than ours is sent to Earth. No spoilers, but in the first few pages, the alien doesn't know that humans are supposed to wear clothes, and he learns about love by reading an issue of Cosmopolitan in a gas station. As the novel progresses, the alien, who is never named, becomes more and more human-like.

The alien can learn to be more human-like, enjoy poetry and music and peanut butter, and can even aspire to be human. Can machines do the same?

I feel like that is what we've been trying to do with AI agents for quite a long time. We want them to seem, act, and react more as humans do. We are, in fact, getting there. Mr. Azaria talks about this trend even encroaching into the realm of science fiction. Some aspects of this are exciting. Mr. Azaria sees potential upsides such as the ability to re-create some of Mel Blanc's Bugs Bunny performances.

Other aspects demand caution. In Mr. Azaria's words, "AI can make the sound, but it will still need people to make the performance. Will the computer ever understand emotion on its own, what's moving and what's funny?"

Mary Shelley, from over 200 years ago, may still have something to teach us about this.