Robots.txt is a file that gives search engine crawlers a polite hint on which pages shouldn’t be crawled. It’s not legally binding (I’m not a lawyer). It used to be beneficial for both webmasters and search engine crawlers — Google used to actually take down sites by accident by sending them too much traffic. (Obviously, not a concern anymore).
How can sites tell LLMs what data shouldn’t be included in a training corpus? But are the incentives there for both data creators and consumers?
Avoid Copyrighted data — Distributors and creators of LLMs would like to know with more certainty that they haven’t been trained on copyrighted data. A robots.txt could hint at which files are under copyright, but a better solution might be something more integrated with the license itself.
Keep Content Quality High — Some content hosted on websites might not be relevant for LLMs, just as it wasn’t for search engines (admin pages, etc.). On the flip side, it might steer LLMs toward content that creators want to be indexed.
Allow Privacy and Control — Some content creators might not want their data indexed in an LLM. A robots.txt file wouldn’t prevent this, but I believe that most LLM companies would respect it (just like you can opt out of many of the ad-tracking policies on Google and Meta if you dig deep enough).
The other question: where should it go? Should it just be limited to web servers? Should it sit in public code repositories? Is it embedded in the markup itself?
This is definitely a complex area, and very likely to be THE legal story of the year, either this year or next. This is kind of everything.
In principle I like the idea, but I can also see some ways in which this could have some very weird, subpar outcomes. One easy example: a news provider (NYT, WSJ, etc.) asks LLMs not to crawl any of their content, while the subjects of some of the journalists' corruption investigations launch their own made-for-LLM pages with positive coverage of themselves.