A 15-second audio clip can be used to clone any voice with OpenAI’s “Voice Generation” AI model.

0
97

With just a brief 15-second audio clip, OpenAI’s new text-to-speech model, Voice Engine, can replicate any voice.

Currently, the Voice Generation model is in preview and access is restricted. (Photo courtesy of OpenAI) With just a 15-second audio sample, OpenAI’s “Voice Generation” text-to-audio generative AI model can accurately mimic any voice. The most recent OpenAI model, which is currently limited to a few users, is offered to a few worldwide partners in a variety of industries, including government, media, entertainment, education, and more.

The text-to-voice generative AI model developed by OpenAI is claimed to have a number of practical uses, such as aiding deaf people read, translating content, producing audio, connecting with people worldwide, assisting patients in regaining their voice, and more.

Today we are presenting preliminary ideas and findings from a small-scale preview of a model called Voice Engine, which combines text input and a single 15-second audio clip to synthesise natural-sounding speech that closely mimics the original speaker,” OpenAI stated in an official blog post. It is noteworthy that a little model can produce lifelike and expressive sounds with just one 15-second sample.

According to OpenAI, Voice Generation is a modest model that was initially created in 2022 and made available to a limited number of users using ChatGPT Voice, Read Aloud, and text-to-speech API. The business is reportedly adopting a “cautious and informed approach to a broader release” in order to prevent misuse. Additionally, OpenAI has released additional samples produced with the Voice Generation model.

Additionally read | GPT-5 could be released by OpenAI in a few months: Report
OpenAI is investigating a number of issues prior to the official release to the public, such as laws protecting individual voices in AI, public awareness campaigns about the potential applications and constraints of AI, and the adoption of technological tools that facilitate the distinction between real and artificial intelligence-generated speech.