OpenAI Sora text to video generator debuts – results can be amazing, but bugs admittedly remain

(Image credit: OpenAI)

Artificial intelligence pioneer OpenAI announced a new generative tool on Thursday. Dubbed Sora (the Japanese word for sky), it is OpenAI’s most ambitious development to date, able to generate complex high-definition videos up to a minute long from a mere text prompt. Image prompts can also be used. Sora isn’t open to the general public yet, as OpenAI has decided to restrict access to a select group of researchers and visual professionals while it refines its offering. Importantly, this pre-market period will also be used to implement safety measures so the tool isn’t used to build misinformation, hateful content, etc.

Open AI’s new text-to-video tool, Sora.The text prompt here, which (alone) created the video, was:‘A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots… —> pic.twitter.com/fK3ca9VcxIFebruary 16, 2024

Let’s dive straight into some examples. If a picture says a thousand words, a video can do the same at tens of frames per second. The first example is a full-minute clip from a relatively complex prompt. Here, Sora flexes its muscles, rendering the neon-lit streets of Tokyo, which have recently been dampened by rainfall, and the movements of the elegant central character.

In its blog post about Sora, OpenAI explains that this prompt-to-video tool has been designed to generate complex scenes with multiple characters with accurate and true-to-life details. “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world,” it asserts. By way of illustration, OpenAI leads with the video below. However, this representation of a Jeep speeding along a dry dusty mountainside road through a forest appears to be very ‘video game’ like.

OpenAI just left a bunch of TxT2Video companies in their dust with Sora."Prompt: The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the… pic.twitter.com/Tl5lSKZlS4February 15, 2024

On the topic of video game techniques, seasoned tech industry analyst Patrick Moorhead, reckons that most of the generated videos have both characters and camera moving simultaneously to “trick the brain into not noticing the details that would point out the uncanny valley.” Some believe Sora was at least partially trained using synthetic video sourced from Unreal Engine.

It is good that OpenAI isn’t shy about admitting its model still has weaknesses. The blog explains that Sora-generated videos “may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect.” Moreover, OpenAI again shows us some video examples. Of the five it highlights, probably the most jarring to our eyes are the ones featuring the gray wolf pups, and the video embedded below which was generated from the prompt “Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care.”

This Sora breaks my brain.What even is reality anymore tbhPrompt: Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care. pic.twitter.com/CuvvF2ro7IFebruary 15, 2024

Beneath this video in the OpenAI blog, Sora’s gaffe is explained as being due to the model not understanding that a chair is a rigid object. That shouldn't take long to fix...

Safety – why we can’t have nice things

We mentioned safety briefly in the intro and it is clear that a generative AI tool like Sora is going to be used for all kinds of mischief by the general public. However, OpenAI is understandably keen to implement safety measures in Sora before it goes prime time to lessen the tide of nasty stuff some folk will wish to generate.

Specifically, the OpenAI blog says it will be working with the first testers to prevent the generation of “misinformation, hateful content, and bias.” Additionally, it is taking steps to both prevent and detect any such content in videos. Other verboten prompt topics will include “extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others.”

Fake audio and video recordings have previously hit the news headlines and had all sorts of repercussions, so keeping a lid on the scope of Sora’s output may be a priority for a responsible developer.

Sora isn’t the first text-to-video generator we have seen, but it is the most advanced, complex, and realistic generator so far. Many have commented that the impact of Sora will be significant and felt far beyond the computer and tech news sphere.

TOPICS

Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.