看到一篇很有趣的文章，证明了以前的一个猜想：人工智能模型的性能，主要取决于训练数据，而不是各种参数。原文如下：----...

看到一篇很有趣的文章，证明了以前的一个猜想：人工智能模型的性能，主要取决于训练数据，而不是各种参数。原文如下：
-------------------/-
The “it” in AI models is the dataset.
Posted on June 10, 2023 by jbetker
I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs.
It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between distributions that don’t matter, like what photos humans are likely to take or words humans commonly write down.
What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.
This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.
Then, when you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude” then, it’s not the model weights that you are referring to. It’s the dataset.
翻译：
这篇文章的标题是“在AI模型中，‘它’指的是数据集”，由jbetker在2023年6月10日发表。作者在OpenAI工作了将近一年，期间训练了很多生成模型，比任何人都多。在观察调整各种模型配置和超参数的效果的几个小时里，作者注意到了所有训练运行之间的相似性。
作者意识到，这些模型真正以惊人的程度近似于它们的数据集。这意味着它们不仅学会了什么是狗或猫，还学会了分布之间不重要的间歇频率，比如人类可能拍摄的照片类型或人类通常写下的单词。
这表现为——如果用相同的数据集训练足够长的时间，几乎所有具有足够权重和训练时间的模型都会收敛到同一点。足够大的扩散卷积网络(conv-unets)会产生与ViT生成器相同的图像。AR采样产生的图像与扩散产生的图像相同。
这是一个令人惊讶的观察！它意味着模型行为不是由架构、超参数或优化器选择决定的。它是由你的数据集决定的，没有其他。其他一切都是高效交付计算以近似该数据集的手段。
因此，当你提到“Lambda”、“ChatGPT”、“Bard”或“Claude”时，你指的不是模型权重，而是数据集。

作者：mike163

全部讨论