Discover more from Matt Rickard
Data Local Machine Learning
Data is slow and expensive to move around. What if we moved our compute local to our data? Running functions, containers, and other jobs right next to where the data is stored? What's been tried, and where things go from here.
Integrated compute over a distributed object store (Manta). The earliest cloud-native version of this that I've seen is Manta from Joyent, which was started back in 2011. The insight was from Bryan Cantrill (Sun, dtrace, Joyent, Oxide) that Solaris Zones (a precursor to modern containers) could provide isolation over object stores. Unfortunately, the idea was probably ahead of its time. Docker containers were based on Linux containers (not Solaris Zones), and Kubernetes and public clouds took the lead on object storage.
Another strategy is implementing machine learning at the database layer. BigQuery ML, MindsDB, and, more recently, PostgresML are all examples of this. This means that data analysts and data scientists can directly call models from SQL. Usually, that means quicker latency and less boilerplate with shifting data around. The downside is that SQL isn't great for procedural logic. For example, cleaning data, experimenting, and visualizing data are often hard or impossible directly in SQL.