The Bitter Lesson | Du Mingzhe

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

回顾70 年人工智能的研究经历, 我们可以学到的最大教训是，利用计算的通用方法终会是最高效的方法。其本质原因是摩尔定律，或者更确切地说是摩尔定律对每单位计算成本持续呈指数下降的泛化。大多数人工智能研究都是在假设可用计算资源恒定的情况下进行的 (在这种情况下，借助人类知识将是提高性能的唯一方法之一)，然而在比一个典型研究项目略长的时间里，必然会有大量更多的计算能力变得可用。研究人员在寻求短期内产生影响的改进时，会寻求利用他们对领域的人类知识，但从长远来看唯一重要的是算力的利用。这两者不必相互对立，但在实践中它们往往如此。人们对投资于这种或那种方法存在心理承诺, 在其中一个方向上花费精力就会导致无法在另一个方向上花费太多时间。利用人类知识的方法往往会使解决方案复杂化，使它们不太适配利用计算的通用方法。有许多例子表明人工智能研究者迟钝地学到了这个痛苦的教训，让我们来回顾其中一些最为突出，颇具启示性的例子:

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

在计算国际象棋领域，1997 年击败世界冠军Kasparov的方法基于大规模的深度搜索，这让大多数追求利用人类对象棋特殊结构理解的计算象棋研究者感到沮丧。当一个结合特制的硬件和软件, 基于搜索的简单方法展现出远超其他复杂方法的高效能力时，这些基于人类知识的象棋研究者并输不起。他们说，这一次“暴力”搜索可能赢了，但这并不是通用的策略，而且也不是人们下棋的方式。这些研究人员寄希望于利用人类知识的方法能够获胜，但他们为没有做到这一点而感到失望。

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

计算围棋领域也出现了类似的研究进展，只是延迟了 20 年。最初的巨大努力是通过利用人类知识或围棋本身的特殊规则来避免搜索，但一旦搜索得到大规模有效应用，所有这些努力都被证明是无关紧要的，甚至更糟。真正重要的是利用自我对弈来学习价值函数（就像在许多其他游戏甚至国际象棋中一样，尽管学习在 1997 年首次击败世界冠军的程序中并没有发挥重要作用）。自我对弈学习可以像搜索一样进行海量计算。搜索和学习是人工智能研究中进行海量计算的最重要的两类技术。就和在计算国际象棋中一样, 在计算围棋中研究人员最初的努力是为了利用人类的理解（因此需要更少的搜索），直到很久以后，通过拥抱搜索和学习才取得了更大的成功。

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.

在语音识别方面，早在 20 世纪 70 年代就有了一场由 DARPA 赞助的竞赛。参赛者一方面包括许多利用人类知识的特殊方法——单词、音素、人类声道等知识，另一方面，新兴的方法则更多地依赖统计学，基于隐马尔可夫模, 并且需要更多计算资源。再一次，基于统计学的方法胜过了基于人类知识的方法，这导致了整个自然语言处理领域在几十年来逐渐发生了巨大的转变，统计和计算开始主导该领域。最近语音识别领域深度学习的兴起则是朝着这个一致方向迈出的最新一步。深度学习方法对人类知识的依赖更少，并使用更多的算力，再加上在庞大的训练集上学习，可以产生更好的语音识别系统。就像在对弈游戏中一样，研究人员总是试图制造出按照人们认为的思维方式工作的系统, 并试图将这些知识放入他们的系统中——但这最终事与愿违，当通过摩尔定律，大量的计算能力变得可用且发现将其有效利用的方法时，这种做法被证实是一种巨大的研究时间浪费。

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

在计算视觉领域，也存在类似的模式。早期的方法将视觉定义为搜索边缘、广义圆柱体或 SIFT 特征。但今天这一切都被抛弃了。现代深度学习神经网络仅使用卷积和某些不变性的概念，并且性能更好。

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

这是一个很大的教训。在这个领域中, 我们还在持续犯同样的错误。为了看到这一点并有效地抵制它，我们必须了解这些错误的吸引力。我们必须吸取惨痛的教训，从长远来看，建立依照我们思维方式思考的系统是行不通的。这个惨痛的教训是基于历史观察：1）人工智能研究人员经常试图将知识构建到他们的代理中; 2）这在短期内总是有帮助的，并且对研究人员个人来说是令人满意的; 3）但从长远来看, 这会导致技术停滞不前，甚至抑制进一步的进展，4）突破性的进展最终将通过相反的基于搜索和学习的缩放计算方法实现。最终的成功会带有一丝苦涩，而且常常没有被完全消化，因为这种成功战胜了受青睐的、以人为本的方法。

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

从这个痛苦的教训中应当学习到的第一件事是，通用方法的巨大威力，即随着可用计算能力的显著增加，这些方法继续保持可扩展性。在这方面似乎可以任意扩展的两种方法是搜索和学习。

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

从痛苦的教训中学习到的第二个普遍观点是，心智的实际内容不可救药地复杂；我们应当停止尝试寻找体现思考心智内容的简单方式，比如利用空间、对象、多代理或者对称性。所有这些都是本质上任意复杂的外部世界的一部分, 它们不应该被人为定义，因为它们的复杂性是无穷无尽的；相反，我们应当只定义可以找到并捕获这种任意复杂性的元方法。这种方法的关键在于如何使用我们定义的元方法来找到良好的近似，而不是我们自己寻找。我们希望人工智能可以像我们一样去发现，而不是只包含我们已经发现的东西。人为定义我们的已有的发现，只会让我们更难看到发现的过程。

Takeaways

Embrace general-purpose methods such as search and learning that effectively scale with increased computational resources, transcending limits inherent in specifically designed approaches.
Focus on developing meta-methods capable of understanding and capturing complexity instead of hard-coding our existing knowledge about the world, which is intrinsically complex and ever-evolving.