“Data is the new oil,” — but here we have a company with very little proprietary data (OpenAI) creating a model that powers a product that beats Stack Exchange, a company with a large amount of proprietary data. More than that, the Stack Exchange data seems perfectly fit for the RLHF layer over these models — they’ve been collecting human feedback for decades on answers.
A few thoughts.
They gave the data away for free. Stack Exchange (the overarching brand covering Stack Overflow, Math Overflow, etc.) makes up 5.13% (64GB) of The Pile — a dataset used to train many of the large language models. Stack Exchange has been publishing this data since 2014 (archive).
Stack Exchange was already in decline. The company has struggled to monetize its engaged user base for years, resulting in a sale to Prosus in 2021. Increasingly, question-answering and knowledge sharing happens in GitHub repositories around issues and pull requests.