Artificial intelligence is in an interesting place in 2021. Developments in the field will likely represent the single largest technological advance in the coming decades, yet today we’ve seen only a miniscule portion of that revolution unfold. Whether or not you consider machine learning to strictly be AI, the acceleration in this field has game-changing ramifications.
We're already in the middle of this revolution. Videogames are uniquely positioned to take advantage of AI at almost every stage of their production. AI is already behind world-building tools for artists, voice filtering in multiplayer, more intelligent bots, image upscaling, and generating photorealistic graphics.
Of all the recent applications, one of the most interesting is the employment of publicly available AI tools in modding communities. These can help modders target core constraints like image quality with AI image enhancement, and voice acting with AI speech synthesis. Jonx0r, creator of the Wyrmstooth mod for Skyrim, thinks these abilities will not only help modders achieve technical parity with triple-A games, but also unleash a new wave of creativity.
Voice acting has always been challenging for modders, who often work solo or in very small teams, and yet demand has soared as RPG games trend toward fully voiced casts. “In most cases modders are limited to repurposing existing voice lines for player interactions with new NPCs,” Jonx0r explains. “Therefore, the responsibility for driving the narrative forward in a quest mod rests mostly on the shoulders of the NPCs.”
While community voice actors for NPCs remain an option, having one consistent, distinctive voice for a protagonist like Geralt draws a sharp line between official and community content. Nikich340, creator of The Witcher 3 mod A Night to Remember, opted to use CyberVoice’s impressively replicated Geralt voice for this very reason. “It was apparent that I couldn’t just slice some lines together to get the proper phrases required for the story,” he says.
Without having to resort to chopping up existing lines to portray a main character, creators are free to script as normal. “There are a lot of talented writers in the modding community,” Jonx0r says. “I think [voice synthesis] could lead to quest or companion mods with richer player character development, narratives, and choices.”
Best of all, such tools seem numerous. Jonx0r compared the output of several different speech synthesis repositories on GitHub such as Fast Tacotron and Real Time Voice Cloning before settling on the Nvidia Tacotron 2 repository as the closest emulation of Skyrim’s voice acting. “From there I’d do what I could to upsample the audio in Audacity so it matched the quality of the real voice acting.”
Nikich340, on the other hand, contacted CyberMind for specific voice lines from their existing Geralt voice. He was the perfect specimen, Nikich340 explains: “an emotionless voice suits Geralt’s character.”
So where did these tools come from? “Videogames have played a role since the start,” Dan Ruta, PhD student and developer of the open source xVASynth machine learning-based speech synthesis app, tells me. Spurred in 2018 by Bethesda’s decision to use a voiced protagonist in Fallout 4, Ruta knew a solution was needed for modding communities. The tool’s description page now boasts quests, voiced companions, dialogue expansions and alterations, and translations.
Similarly, Mind Simulation Lab’s CyberVoice saw its own videogame-inspired genesis as a component of the much more ambitious CyberMind project. “Our project aims to make NPCs feel alive,” Mind Simulation’s CEO Leonid Derikyants explains. “We want to give them intelligence and create digital personalities, so that it’s interesting to communicate with them even if you are outside of a quest.”
In the pursuit of that debatably unreachable goal of general artificial intelligence, Derikyants’ lab found an immediately useful and more grounded tool in CyberVoice. “There is a lot of content that cannot be physically voiced,” Derikyants says, pointing out that only 4.5% of books are available to listen to. Likewise, CyberVoice also has real-time applications, such as reading out articles from websites or donation messages for streamers. “All of this is available in the short term,” Derikyants says.
xVASynth and CyberVoice both demonstrate the diversity of currently available AI tools. The former is an open source, downloadable application, while the latter is centralised and runs in the cloud as a paid product from a self-funded lab. Both believe there are benefits to their approach.
With xVASynth, users can adjust the pitch and duration of every letter in a sentence. “This allows for very fine control on exactly how a line is spoken,” Ruta explains, “especially in terms of emphasis and emotion.” While Ruta accepts that tools like CyberVoice and 15.ai running in the cloud offers the benefit of being accessible on any device, he prefers the community ownership of an open-source tool. “The community has full, unrestricted access at all times,” Ruta says. “Mods can be created for the actual tool via the recent plugins system. Therefore the tool can continue existing and evolving even in my absence.”
Derikyants, on the other hand, takes pride in both the quality of audio produced by CyberVoice and their potential solution to the ethical and legal issues produced by other AI tools. “We believe that copyright holders, especially actors, deserve to receive royalties,” Derikyants says. “And we believe that CyberVoice will be able to form the right platform and legal precedent so that in the future we will not drown in deepfakes.”
What of the current limitations in this space? Despite impressive results, Nikich340 remains skeptical about machine learning’s readiness to take over. “Voiced mods are always better for getting attention, but artificial voices can’t compete with real ones, and they won’t be able to for many years,” Nikich340 argues.
“I expect the believability of synthesised speech to improve over the next few years as new developments are made,” Jonx0r says. “There are already repositories available that allow you to easily change the prosody of synthesised speech.” The existence and quality of a Neural Parametric Singing Synthesizer also has the creator excited: “I might try synthesising a singing voice for a new bard song or two for Wyrmstooth with the help of our talented composer León van der Stadt.”
Machine learning is also not a simple or fast process. “It can be slow,” Jonx0r says. “Usually it takes between a few days and a week to train a new Tacotron and WaveGlow model off of a specific voice. You also need at least several hours’ worth of audio in your dataset to train a coherent model with minimal distortions. I expect to see more utilities being released that make these new AI-based technologies more accessible. xVASynth is a great example of this. It supports a growing list of voice types from different games such as Skyrim, Fallout 4, and The Witcher 3, and supports pitch editing at a very granular level.”
“The to-do list for xVASynth is still quite long,” Ruta says. Recent additions like a batch synthesis mode to generate a large number of lines at once – useful for characters with hundreds of lines – and a speech-to-speech mode that generates pitch and duration variations from your own voice, make up just a small portion of that list. A companion tool for modding communities to train new voices is also on the cards. “The app currently supports almost 300 voices across over 25 games, but once the community training tool is out, I’m sure these numbers will skyrocket,” Ruta enthuses.
CyberVoice also promises more of everything, including more voices, more integrations with other services, and more supported languages. The creation of a digital signature in an audio file to protect voice rights for example, and multi-language speech synthesis between languages without an accent, are priorities for Derikyants.
AI’s limited and targeted use in the current modding scene is helping creators achieve great results. The Witcher 3’s Geralt, with his monotone delivery, allowed CyberVoice to replicate most of the performance, but few other characters with current AI voice models could be presented so seamlessly. “More tools will definitely appear, and machine learning will definitely become a larger part of game development, which includes the modding community,” Ruta says.
“The really interesting stuff will be real-time use of machine learning models within games. You can already see a real-time gaming application for machine learning in Nvidia’s DLSS and ray tracing de-noising models,” Ruta explains. This potential AI-infused future is CyberMind’s bread and butter. Lipsync, visuals, character animations and movements, and locations might all be generated by machine learning approaches. “That’s going to be a tectonic shift in the industry,” Derikyants posits.
“If a game is carefully designed around it, we may even have some real-time, image-to-image translation models making the game frames look like real life,” Ruta says.
“But later down the line, we may even have things like NPC conversations without pre-written player responses, or situations where NPCs can react to the world around them without being explicitly programmed to do so,” Ruta says.
The best intersection of mods and AI is yet to come according to Ruta. “If you’re playing as Thomas the Tank Engine in Skyrim, you’d expect NPCs to have something to say about that.”