Oh hell yeah.
Months ago I was brainstorming something almost identical to this concept: use the reverse proxy to serve pre-generated AI slop to AI crawler user agents while serving the real content to everyone else. Looks like someone did exactly that, and now I can just deploy it. Fantastic.
Ai slop is actually better than random data because it gets in a feedback loop which is more destructive.
If you use natural text to train model A, and then use model A’s output, a, to train model B, then model B’s output will be less good than model A’s output. The quality degenerates with each generation, but the it happens over generations of models. So, random data is worse than AI slop, because random data is already of the lowest possible quality for AI training.
Yes, but random data might be easier to detect in the first place, and could then be filtered.
Poison the AI. I’m all for it.
Why is no one talking about the fact that the demo is clearly using the Bee movie script to power the Markov Chain generation?
This thing spits out some gold:
Honey, it changes people.
I’m taking aim at the baby.
Would this interfere with legitimate crawlers as well, the Internet Archive for instance?
Could you list specific crawlers to be automatically blocked by the iocaine site?
deleted by creator
So it’s like nightshade for LLMs?
Better, actually. This feeds the crawler a potentially infinite amount of nonsense data. If not caught, this will fill up the whatever storage medium is used. Since the data is generated using Markov-chains, any LLM trained on it will learn to disregard context that goes farther back than one word, which would be disastrous for the quality of any output the LLM produces.
Technically, it would be possible for a single page using iocaine to completely ruin an LLM. With nightshade you’d have to poison quite a number of images. On the other hand, Iocaine text can be easily detected by a human, while nightshade is designed to not be noticeable by humans.
I’m not sure I fully understand.
This generates garbage if it thinks the client making the request is an ai crawler. That much I get.
What I don’t understand is when it talks about trapping the crawler. What does that mean?
Simply put, a crawler reads a site, takes note of all the links in the site then reads all of these sites, again notes all the links there, reads those, etc. This website always and only links to internal resources which were randomly generated and again only link to other randomly generated sources, trapping the crawler if it has no properly configured exit condition.
How hard would this be for a sophisticated enough bot to detect the intention here, and blacklist the domain on a shared blacklist set? I would imagine not too difficult. Good idea, though. The start of something potentially great.
Don’t these crawlers save some kind of metadata before fully committing it to their databases? It’d surely be able to see that a specific domain served just garbage (and/or that it’s so “basic”), and then blacklist/purge the data? Or are the AO crawlers even dumber than I’d imagine?
I’d be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3’s initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there’s a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug