Matt Rickard

Share this post

Robots.txt For LLMs

blog.matt-rickard.com

Discover more from Matt Rickard

Thoughts on engineering, startups, and AI.
Continue reading
Sign in

Robots.txt For LLMs

Jul 20, 2023
8
Share this post

Robots.txt For LLMs

blog.matt-rickard.com
2
Share

Robots.txt is a file that gives search engine crawlers a polite hint on which pages shouldn’t be crawled. It’s not legally binding (I’m not a lawyer). It used to be beneficial for both webmasters and search engine crawlers — Google used to actually take down sites by accident by sending them too much traffic. (Obviously, not a concern anymore). 

How can sites tell LLMs what data shouldn’t be included in a training corpus? But are the incentives there for both data creators and consumers?

  • Avoid Copyrighted data — Distributors and creators of LLMs would like to know with more certainty that they haven’t been trained on copyrighted data. A robots.txt could hint at which files are under copyright, but a better solution might be something more integrated with the license itself.

  • Keep Content Quality High — Some content hosted on websites might not be relevant for LLMs, just as it wasn’t for search engines (admin pages, etc.). On the flip side, it might steer LLMs toward content that creators want to be indexed. 

  • Allow Privacy and Control — Some content creators might not want their data indexed in an LLM. A robots.txt file wouldn’t prevent this, but I believe that most LLM companies would respect it (just like you can opt out of many of the ad-tracking policies on Google and Meta if you dig deep enough).

The other question: where should it go? Should it just be limited to web servers? Should it sit in public code repositories? Is it embedded in the markup itself? 

8
Share this post

Robots.txt For LLMs

blog.matt-rickard.com
2
Share
Previous
Next
2 Comments
Share this discussion

Robots.txt For LLMs

blog.matt-rickard.com
Andrew Smith
Writes Goatfury Writes
Jul 20

This is definitely a complex area, and very likely to be THE legal story of the year, either this year or next. This is kind of everything.

Expand full comment
Reply
Share
Jay Pinho
Writes networked
Jul 20

In principle I like the idea, but I can also see some ways in which this could have some very weird, subpar outcomes. One easy example: a news provider (NYT, WSJ, etc.) asks LLMs not to crawl any of their content, while the subjects of some of the journalists' corruption investigations launch their own made-for-LLM pages with positive coverage of themselves.

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Matt Rickard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing