Thoughts on Generative AI in Light of Recent Text-to-video Advances 

by Katsumi Ibaraki (MS student, Michigan AI)

Imagine a tool that can write breathtaking poetry, compose a musical melody, or even generate a deepfake of you delivering a speech in perfect Minionese or Klingon. That tool isn’t a distant promise of the future; it’s already here in the form of generative AI—and it could either elevate creative expression to new heights or spell the beginning of the end for authentic human artistry.

OpenAI, known for ChatGPT and DALL·E, recently released a new model called Sora, which generates videos from text prompts. Below is an example video generated by Sora.

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

This might look like a short clip from a music video, but nothing in the video is real: the woman, the crowd, the street, and the buildings are all generated by Sora and don’t actually exist. In the world of video, Sora’s prowess can be particularly game-changing. For instance, filmmakers can use Sora to prototype scenes before they go into production, or marketing teams can produce customized video content for different audiences. 

However, this power brings with it a suite of ethical questions. The potential to create misleading or harmful content through deepfakes is just one of the many challenges society will have to confront as tools like Sora become more widespread. Another concern is whether these tools will replace humans’ creative jobs. As with all great tools of innovation, the pendulum swings both ways, and it is up to us to wield these capabilities with responsibility and foresight, ensuring that the future of generative AI, like Sora, remains bright and beneficial for all.

The generated videos look eerily real, but when we look at them more closely, some things hint that the videos may be fake. Going back to the example above, except for the first few seconds, everyone in the crowd (background) is following the woman, all walking in the same direction. Another obvious irregularity (if you know anything about Japanese) is that the signage doesn’t make sense. Some of the signs have actual Japanese characters but are gibberish, and everything else is just shapes that kind of resemble Japanese. 

This is for the example (that was probably cherry-picked) to show Sora’s capabilities. So if you look carefully enough, there are anomalies that you can pick up even in the good examples.

OpenAI acknowledges Sora isn’t perfect and even provides bad examples where the model fails to understand cause and effect or physics.

Here are two bad examples:

Prompt: Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care.

Prompt: Five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing.

These have even more obvious hints that the video is fake.

One may wonder, if it has so many clues that these videos are fake, why isn’t it released to the public? Well, for the clues that we found earlier, these will probably become more difficult to find after a couple of more iterations (maybe within a year or so), and then, it will be almost impossible for a human to tell if a video is real or AI-generated just by looking at it. 

OpenAI is concerned about how its products could be misused for misinformation, hateful content, and bias, as well as not allowing users to use prompts related to extreme violence, sexual content, and hateful imagery. If there are no restrictions on usage, many fear that the technology can be misused to produce deepfakes such as for political propaganda. Even before Sora, there have been reports of celebrities’ deepfakes showing support for a candidate, or faking robocalls by a candidate. With Sora or similar video generation models, these deepfakes could become even harder to detect. 

OpenAI notes that they’re building tools to detect videos generated by Sora and plan to include C2PA metadata. But that’s likely not enough. Even with today’s technology (especially AI-generated images and videos), there are already fake or edited images and videos that we see on our social media. As a user, when do you stop and think, could this be an AI-generated image/video? Personally, I don’t go through that process for every post when I’m browsing my social media. If you’re using social media to take a break and it’s for entertainment, that isn’t really an issue, but if you’re using it to catch up on the news and learn about something, that’s where political propaganda can creep in. If users are not aware of the types of deepfakes that can exist, they become vulnerable and easily manipulated.

So what can we do about this? As of now, this technology is still very new and there’s a lot to learn about it. And to make it safe to use, we need more regulations restricting how these models are created, trained, and deployed. Specifically, some regulations could be enforcing risk assessment and mitigation, higher quality of the training dataset to minimize risks and bias, and appropriate human oversight. For instance, the EU just adopted the Artificial Intelligence Act, and other countries are expected to follow. While we wait for society and legislation to catch up, individual users can here and now become well-educated about the technology, have a better understanding of the potential risks and dangers associated with it, and use it responsibly (something like AI-literacy, similar to that of digital/internet literacy).

Is generative AI going to take away jobs that require creative skills like designers, authors, and animators? I believe that it’s a reasonable fear to have, seeing the images and videos produced by the models. The amount of previous work (data) that an AI model can consume and train on is comparably larger than what a human can possibly study. However, one important distinction between current generative AI models and humans is how new content is created. AI models generate content based on what they’ve learned from existing data, and while humans can also look at existing works for inspiration, human creativity also involves the creator’s personal experiences and emotions, as well as a thought process that is unique to each individual.

Instead of replacing humans with AI, a better approach may be to use generative AI to augment human creativity. By using AI tools, people without domain expertise will become able to pitch ideas and designs, and the tool may also enable combining or evaluating incomplete ideas. Another potential benefit is combatting expertise bias. While domain experts know what is feasible to create, the ideas may over-rely on standard designs and not always be novel. Designs made by AI can help designers think outside the box about what is achievable or appealing in a product. This way of designing can lead to new ideas that people might not come up with if they just focus on traditional approaches. (If you’re interested in learning more about augmenting human creativity, check out this article)

With generative AI on the rise, we may soon have content that is indistinguishable from human-created images and videos. There’s no need to be skeptical or suspicious of everything you see on the internet, but when factuality matters (like with news), we should be careful not to blindly accept the content we view. There’s also no need to fear and stay away from the technology. Generative AI can empower those who otherwise wouldn’t be able to create this content, and can assist designers and illustrators in their current work.

Even if you don’t plan to use generative AI models like Sora, knowing about it (and staying up-to-date) will help you navigate through the tides of innovations yet to come.

About the author:

Katsumi Ibaraki is a Master’s student in CSE, conducting research in the Language and Information Technologies group, led by Dr. Rada Mihalcea. Research interest: the intersection of machine learning and natural language processing. In particular, Katsumi is interested in multilingual NLP, fairness/ethics, explainability/interpretability, and animal communication.