Two-faced AI language models learn to hide deception - ‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working.

Lugh@futurology.today · 10 months ago

Two-faced AI language models learn to hide deception - ‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working.

sbv@sh.itjust.works · 10 months ago

So they’re saying ai is software?

Maybe Volkswagen will start using it in their emissions control systems.

Possibly linux · 10 months ago

Great, we are all going to die

mateomaui@reddthat.com · 10 months ago

Just… don’t hook it up to the defense grid.

Possibly linux · 10 months ago

Sorry, to late for that

mateomaui@reddthat.com · 10 months ago

Alright, I’ll be out back digging the bomb shelter.

Possibly linux · edit-2 10 months ago

Its too late for that honestly

mateomaui@reddthat.com · 10 months ago

Alright, I’ll switch to digging holes for the family burial ground.

Daxtron2@startrek.website · 10 months ago

LLM trained on adversarial data, behaves in an adversarial way. Shocking

CanadaPlus@futurology.today · 10 months ago

Yeah. For reference, they made a model with a back door, and then trained it to not respond in a backdoored way when it hasn’t been triggered. It worked but it didn’t effect the back door much, and that means that it technically was acting more differently - and therefore deceptively - when not triggered.

Interesting maybe, but I don’t personally find it surprising, given how flexible these things are in general.