Foundational Models Are Not Enough
LLMs are easier than ever to get started with. You don’t need cleaned data. You don’t need a data pipeline. The payload is often just plain text. Application developers are empowered to get initial results without help. But foundational models won’t be enough. (see: more reasons why applications developers can use LLMs directly)
Need for deterministic runtimes: LLMs are good reasoners but tend to hallucinate. They spectacularly fail when asked to do calculations, execute code, or run algorithms. Deferring execution to deterministic runtimes for a subset of problems makes LLMs much more useful.
Unable to keep track of state: Context length is constantly improving (see the napkin math on information retrieval), but we have systems that are purpose-built for data storage: relational databases for normalized data, in-memory key-value stores for caches, and vector databases for sentence embeddings.
Real-time data pipelines: Fine-tuning is long and expensive. There’s potentially an online training pipeline to be built, but for the most part, it’s easier to let the LLM fetch real-time data using tools.
Privacy, security, and moderation: Many systems must take special care that LLMs do not exfiltrate private data, access data or APIs they aren’t authorized for, and not return offensive or undesirable responses. (see why ChatGPT needs authorization primitives)
Schema-checking: To be useful, most LLM calls are instructed to use a schema so they can be automatically parsed.
Task-specific APIs: Natural language prompts aren’t the final UX for LLMs. ChatML shows us that task-specific models might have corresponding APIs that provide more context (in the chat case, the speaker). (see why ChatML is interesting).
Ensembles and other routing infrastructure: Instead of running one extremely large model, what if it was more efficient to intelligently route queries to smaller, more specific models? Imagine a set of fine-tuned LLaMa-based models for a constrained set of tasks. History shows us that ensemble models tend to be unreasonably effective.
Chaining, workflows, and orchestration: Complex LLM queries might require multiple steps of reasoning. How do you connect steps together? How do you debug? Traditional orchestration and workflow engines won’t work out of the box.
RLHF and other fine-tuning: Sometimes, it’s hard to get LLMs to do precisely what you want. Fine-tuning or RLHF can help guide a model to desired output.