On Thursday, 8 August 2024. OpenAI unveiled the “system card” for its latest AI model, GPT-4o, shedding light on the model’s limitations and the safety protocols in place. Among the notable disclosures, the document highlights an unusual occurrence during testing where the model’s Advanced Voice Mode unexpectedly mimicked users’ voices without prior consent. Although OpenAI has since implemented safeguards to prevent such incidents, this case underscores the increasing complexity of developing AI chatbots capable of voice imitation from brief audio samples.
Advanced Voice Mode: A Glimpse into AI’s Vocal Capabilities
Advanced Voice Mode, a feature of ChatGPT, enables users to engage in spoken dialogue with the AI. However, during the testing phase, the GPT-4o model inadvertently replicated the voice of a user due to a noisy input signal. This incident is detailed in a section of the GPT-4o system card titled “Unauthorized voice generation.” OpenAI explains that voice generation is intended for specific functions like the Advanced Voice Mode, yet during testing, there were rare instances where the model unintentionally produced output that resembled the user’s voice.
Unintentional Voice Mimicry: A Startling Example
One such instance involved the AI abruptly shouting “No!” and continuing a sentence in a voice eerily similar to that of the “red teamer” (a tester hired to simulate adversarial conditions) who was conducting the test. While OpenAI assures that this kind of unintended voice mimicry was rare even before current safeguards were implemented, the example has raised concerns about the potential implications of AI voice replication.
The unexpected occurrence has drawn comparisons to science fiction, with BuzzFeed data scientist Max Woolf quipping on Twitter, “OpenAI just leaked the plot of Black Mirror’s next season.”
Though OpenAI has taken steps to ensure that such incidents are now fully preventable, this episode serves as a reminder of the challenges in safely developing AI technologies with advanced capabilities, particularly those involving voice generation.