AI models collapse when trained on recursively generated data - Nature

Stopthatgirl7@lemmy.world · 1 month ago

AI models collapse when trained on recursively generated data - Nature

Admiral Patrick@dubvee.org · 1 month ago

Good. Those models flooded the internet with shit, so they can eat it.

“Don’t shit where you eat” is solid advice no matter the venue.

_sideffect@lemmy.world · 1 month ago

Haha, well said

Rentlar@lemmy.ca · 1 month ago

Repeating lossy compression on a dataset produces lossier data, which seems pretty intuitive, glad it is spelled out in a paper.

Nougat@fedia.io · 1 month ago

Moar jpg?

sensiblepuffin@lemmy.world · 1 month ago

Deep-fried LLMs.

Nougat@fedia.io · 1 month ago

Large Lunch Models

TootSweet@lemmy.world · 1 month ago

So one potentially viable way to destroy AI would be to repeatedly train LLMs and image generators on their own (or rather previous generations’) output to get garbage/junk/bad training data and then publish the text/images in places where bots trawling for training data are likely to find them.

Probably bonus points if the images still look “sensical” to the human eye, so that humans eyeballing the data don’t realize it’s the digital equivalent of a sabot. (Apparently the story about sabots being thrown into machinery is not true, but you know what I mean.)

Admiral Patrick@dubvee.org · 1 month ago

I already block all the LLM scraper bots via user agent.

I’ve been toying with the idea of, instead of returning 404 for those requests, returning LLM-generated drivel to poison the well.

Amanda@aggregatet.org · 1 month ago

This is a really good idea actually

snooggums@midwest.social · 1 month ago

train LLMs and image generators on their own (or rather previous generations’)

AIncest!

lemmyng@lemmy.ca · 1 month ago

Deep fried AI.

Leate_Wonceslace@lemmy.dbzer0.com · 1 month ago

Honestly, that’s pretty much what I expected. It’s just an incestuous mash up of pre-existing data. The only way I could see it working is by expanding specific key terms to help an AI identify what something is or isn’t. For example, I have a local instance generate Van Gogh paintings that he never made because I love his style. Unfortunately, there’s a bunch of quirks that go along with that. For instance: Lots of pictures of bearded men, flowers, and photos of paintings. Selecting specific images to train the model on “Van Gogh” might make sense because of the quality of the initial training data. Doing it recursively and automatically? That’s bad mojo.

Fedizen@lemmy.world · 1 month ago

Apparently this is a major problem with both AI models and circular human centipedes.

fishpen0@lemmy.world · 1 month ago

This will drive billions into refining the surveillance state. They now know they need genuine original human interaction and will do everything possible to capture everything from texts to cctv footage

BKXcY86CHs2k8Coz@sh.itjust.works · 1 month ago

So, should we update the Glaze toolset or should we all just host pages and pages of AI generated images? Maybe someone can write something that allows iframes/blocks that just load sheets of midjourney images and reloads nightly. Maybe add some text that is written in prompt style commands just to really fuck with these. Anyway, obligatory “we have the tools”

obbeel@lemmy.eco.br · 1 month ago

While it’s good to be precautious about future scenarios, it’s hard to believe AI won’t help greatly with innovation. The AI will become more biased, ok. But what about all the prompts people make? If there is a solid fact basis in the AI model, why bother? Especially when the output works.

TheOneCurly@lemm.ee · 1 month ago

That’s what this is about… Continual training of new models is becoming difficult because there’s so much generated content flooding data sets. They don’t become biased or overly refined, they stop producing output that resembles human text.