The Commoditization of Large Language Models: Part 2
Since I wrote Commoditization of Large Language Models almost two months ago, things have progressed much faster. There are now two serious competitors to DALL-E: Midjourney and Stable Diffusion. DALL-E pricing hasn't changed, but GPT-3 is now cheaper.
Two observations:
The scale moat is eroding – new models are trained on fewer parameters, and the power law curve of increasing parameters to increase the quality doesn't hold as often.
The cost moat is nearly gone – new models are trained cheaper, on non-proprietary data, on commodity hardware. Access to these models is trending open rather than closed.
On scale, DALLE-2 (v1) had 12 billion parameters. DALL-2 (v2) only had 3.5 billion parameters (combined with another 1.5 billion parameter model for enhancing the resolution).
Stable diffusion only has 890 million parameters.
On cost, training the Stable Diffusion model from scratch costs less than $600k (tweet 1, tweet 2).
Cost will only be a differentiator for a short time. The Stable Diffusion code is open-sourced, and you'll soon be able to run it on your laptop. Using the different models, you can still tell there are some differences in how well they respond to prompts (so far, DALL-E is still the best). But since none of these models have access to proprietary datasets, you can imagine that they will probably converge in quality.
While the business model is still TBD, different ways are emerging to differentiate and go to market.
Stable Diffusion and Midjourney used Discord as their go-to-market strategy. The interface (input text -> output rich media) is a great fit for Discord, as well as the network effects of users learning from each other's prompts and new waitlisted users observing and feeling FOMO.
It's becoming more apparent that these models can be integrated into the photo editing toolchain. I wouldn't be surprised to see the tool appear in Figma, Photoshop, or GIMP. There are already rudimentary ways to manipulate a model's output iteratively – e.g., choosing regions to redraw and refining prompts.