Product teams and application engineers will be the buyers of the foundational model stack, not data teams. Why?
Direct value without a data pipeline. Application engineers can get direct value out of LLMs without involving a data team. For a proof-of-concept or demo, all they have to do is build some infrastructure around a hosted foundational model. They don’t need access to the data warehouse since a tiny bit of copy-pasted data can validate an idea. Python won’t be the only language of LLMs.
Shared infrastructure often means shared responsibility. MLOps and the data stack are beginning to converge toward DevOps primitives. Kubernetes has already infiltrated the data stack as the substrate for orchestration, data ingestion, and ETL. OpenAI uses Kubernetes to run distributed training and inference.
Products, not insights. The early use cases around generative AI have been product-centric, not insight-centric. Just look at the number of companies refreshing old products with new generative AI features. For now, data scientists are safe — insights are hard to automatically extract from data — it requires a mix of technical expertise, domain knowledge, and exploration that is tough for models to emulate.
Of course, organizations will rush to fine-tune their own foundational models eventually. This will require data pipelines and data expertise. Prompt engineering is important, and data scientists are in the best position to figure it out.
Will the data teams and engineering teams merge? Will they coexist in some new configuration?
> For now, data scientists are safe — insights are hard to automatically extract from data — it requires a mix of technical expertise, domain knowledge, and exploration that is tough for models to emulate.
although solving for really valuable insights is often hard. I still see this problem being solved to a very great degree. Drawing out some of the key insights algorithmically is quite simple, most insights are about plotting a distribution and finding significant segment of that distribution w.r.t a north star business metric like OTR. or simply launching the initiative as an experiment and tracking the metrics. I'm sure there are other techniques to draw out insights that require domain knowledge but most product initiatives have a common playbook in terms of what insights they want to measure, funnel for a company remains the same for a long time.
Given enough sample data from previous product initiatives(docs), Generative AI should easily be able to solve "recommend top 5 insight statements given the schema of the data and previous product initiatives", "given an insight statement, generate the sql query for the schema of the table". I could imagine engineers getting most of the insights readily without data scientists given the data they are exporting is sufficiently annotated.