一文帮你理解最近大火的Sora We’re teaching AI to understand and simulate the physical wo...

We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction. Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

我们正在教导人工智能理解和模拟运动中的物理世界，目标是训练能帮助人们解决需要真实世界交互的问题的模型。介绍一下 Sora，我们的文本到视频模型。Sora 能够生成长达一分钟的视频，同时保持视觉质量，并遵循用户的提示。

Today, Sora is becoming available to red teamers to assess critical areas for harms or risks. We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals. We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon.

今天，Sora 开始向红队团队(见后面的解释)提供服务，以评估重要领域的危害或风险。我们还向一些视觉艺术家、设计师和电影制作人授予访问权限，以获得关于如何推进模型的反馈，使其对创意专业人士最有帮助。我们提前分享我们的研究进展，开始与 OpenAI 之外的人合作，并获得反馈，让公众了解 AI 技术的前景。

Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what he user has asked for in the prompt, but also how those things exist in the physical world.

Sora 能够生成具有多个角色、特定类型动作和准确背景细节的复杂场景。该模型不仅理解用户在提示中要求的内容，还了解这些事物在物理世界中的存在方式。

The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.

该模型具有深刻的语言理解能力，使其能够准确解释提示，并生成具有生动情感的引人入胜的角色。Sora 还可以在单个生成的视频中创建多个镜头，准确地保持角色和视觉风格。

The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.

当前的模型存在一些弱点。它可能会在准确模拟复杂场景的物理学上遇到困难，可能无法理解特定的因果关系实例。例如，一个人可能会咬一口饼干，但之后，饼干上可能没有咬痕。该模型还可能会混淆提示的空间细节，例如混淆左右，并可能难以对随时间发生的事件进行精确描述，如跟踪特定的摄像机轨迹。

Safety We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model. We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product. In addition to us developing new techniques to prepare for deployment, we’re leveraging the existing safety methods that we built for our products that use DALL·E 3, which are applicable to Sora as well. For example, once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user. We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time. 10 of 10

安全性在将 Sora 推出到 OpenAI 产品之前，我们将采取几项重要的安全措施。我们正在与红队专家合作——这些专家是关于误导信息、仇恨内容和偏见等领域的领域专家，他们将对模型进行对抗性测试。我们还正在构建工具来帮助检测误导性内容，例如一个可以识别视频是否由 Sora 生成的检测分类器。如果我们将该模型部署到 OpenAI 产品中，我们计划在未来包含 C2PA 元数据。除了我们正在开发新技术来准备部署之外，我们还利用了我们为使用 DALL·E 3 的产品构建的现有安全方法，这些方法也适用于 Sora。例如，一旦进入 OpenAI 产品，我们的文本分类器将检查并拒绝违反我们使用政策的文本输入提示，例如请求极端暴力、性内容、仇恨图像、名人肖像或他人的知识产权的提示。我们还开发了强大的图像分类器，用于审核每个生成的视频帧，以帮助确保其符合我们的使用政策，在向用户显示之前。我们将与全球的政策制定者、教育工作者和艺术家进行交流，了解他们的关注点，并确定这项新技术的积极用例。尽管我们进行了广泛的研究和测试，但我们无法预测人们将如何以及如何滥用我们的技术。这就是为什么我们相信，从真实世界的使用中学习是创建和释放日益安全的人工智能系统的关键组成部分。

Research techniques Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps. Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily. Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance. We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios. Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully. In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames. Learn more in our technical report. Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.

研究技术 Sora 是一个扩散模型，通过从一个看起来像静态噪音的视频开始，并通过逐步消除噪音的方式逐步转换它来生成视频。 Sora 能够一次性生成整个视频，或者延长生成的视频以使其更长。通过一次性给予模型多帧的前瞻性，我们解决了一个具有挑战性的问题，即确保主题即使在暂时离开视野时也保持不变。与 GPT 模型类似，Sora 使用了一个变压器架构，实现了出色的扩展性能。我们将视频和图像表示为称为补丁的较小数据单元的集合，每个补丁类似于 GPT 中的一个令牌。通过统一我们表示数据的方式，我们可以训练扩散变压器处理比以前更广泛的视觉数据，涵盖不同的持续时间、分辨率和宽高比。 Sora 建立在过去 DALL·E 和 GPT 模型的研究基础之上。它使用了 DALL·E 3 中的重新标题化技术，该技术涉及为视觉训练数据生成高度描述性的标题。因此，模型能够更忠实地遵循用户在生成的视频中的文本指令。除了能够仅根据文本指令生成视频之外，该模型还能够接收现有的静止图像并从中生成视频，以准确地对图像的内容进行动画化，并关注细微的细节。该模型还可以接收现有的视频并延长它或填补缺失的帧。在我们的技术报告中了解更多信息。 Sora 为能够理解和模拟现实世界的模型奠定了基础，我们认为这是实现通用人工智能的一个重要里程碑。$科大讯飞(SZ002230)$

一文帮你理解最近大火的Sora

作者：UncleMark

全部讨论