The rise and fall of robots.txt

alex [they, il]@jlai.lu · 11 months ago

The rise and fall of robots.txt

AutoTL;DR@lemmings.world · 11 months ago

This is the best summary I could come up with:

If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.

You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.

The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.

The original article contains 2,912 words, the summary contains 239 words. Saved 92%. I’m a bot and I’m open source!

raoulraoul@midwest.social · 11 months ago

Screw, tin man. 🖕