• Admiral Patrick@dubvee.org
    link
    fedilink
    English
    arrow-up
    46
    ·
    5 days ago

    Oh hell yeah.

    Months ago I was brainstorming something almost identical to this concept: use the reverse proxy to serve pre-generated AI slop to AI crawler user agents while serving the real content to everyone else. Looks like someone did exactly that, and now I can just deploy it. Fantastic.

    • AItoothbrush
      link
      fedilink
      English
      arrow-up
      8
      ·
      4 days ago

      Ai slop is actually better than random data because it gets in a feedback loop which is more destructive.

      • Saledovil@sh.itjust.works
        link
        fedilink
        arrow-up
        4
        ·
        4 days ago

        If you use natural text to train model A, and then use model A’s output, a, to train model B, then model B’s output will be less good than model A’s output. The quality degenerates with each generation, but the it happens over generations of models. So, random data is worse than AI slop, because random data is already of the lowest possible quality for AI training.

  • wizardbeard@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    13
    ·
    4 days ago

    Why is no one talking about the fact that the demo is clearly using the Bee movie script to power the Markov Chain generation?

    This thing spits out some gold:

    Honey, it changes people.

    I’m taking aim at the baby.

    • Saledovil@sh.itjust.works
      link
      fedilink
      arrow-up
      14
      ·
      4 days ago

      Better, actually. This feeds the crawler a potentially infinite amount of nonsense data. If not caught, this will fill up the whatever storage medium is used. Since the data is generated using Markov-chains, any LLM trained on it will learn to disregard context that goes farther back than one word, which would be disastrous for the quality of any output the LLM produces.

      Technically, it would be possible for a single page using iocaine to completely ruin an LLM. With nightshade you’d have to poison quite a number of images. On the other hand, Iocaine text can be easily detected by a human, while nightshade is designed to not be noticeable by humans.

  • anytimesoon@feddit.uk
    link
    fedilink
    arrow-up
    5
    ·
    4 days ago

    I’m not sure I fully understand.

    This generates garbage if it thinks the client making the request is an ai crawler. That much I get.

    What I don’t understand is when it talks about trapping the crawler. What does that mean?

    • the_strange@feddit.org
      link
      fedilink
      English
      arrow-up
      23
      ·
      4 days ago

      Simply put, a crawler reads a site, takes note of all the links in the site then reads all of these sites, again notes all the links there, reads those, etc. This website always and only links to internal resources which were randomly generated and again only link to other randomly generated sources, trapping the crawler if it has no properly configured exit condition.

  • 𝕸𝖔𝖘𝖘@infosec.pub
    link
    fedilink
    English
    arrow-up
    3
    ·
    4 days ago

    How hard would this be for a sophisticated enough bot to detect the intention here, and blacklist the domain on a shared blacklist set? I would imagine not too difficult. Good idea, though. The start of something potentially great.

  • kungen@feddit.nu
    link
    fedilink
    arrow-up
    2
    ·
    4 days ago

    Don’t these crawlers save some kind of metadata before fully committing it to their databases? It’d surely be able to see that a specific domain served just garbage (and/or that it’s so “basic”), and then blacklist/purge the data? Or are the AO crawlers even dumber than I’d imagine?

    • Hackworth@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      4 days ago

      I’d be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3’s initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there’s a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug