Internet Archive played crucial role in tracking shady CDC data removals

ForgottenFlux@lemmy.world · 1 month ago

Internet Archive played crucial role in tracking shady CDC data removals

Omgboom · 1 month ago

Don’t worry they are going to go after that too, I guarantee it

Arghblarg@lemmy.ca · 1 month ago

archive.is or their mirrors should also be used, as archive.org has proven vulnerable to takedown requests from corporations, wouldn’t surprise me if they could be coerced into removing their data by USA govt request as well.

cygnus@lemmy.ca · 1 month ago

It wouldn’t take much; they had multiple breaches and other problems last fall, seemingly due to very avoidable reasons.

dan@upvote.au · edit-2 1 month ago

very avoidable reasons.

They’re understaffed for the amount of work they do, and their staff are probably even more busy fighting lawsuits at the moment. Things are going to slip through the cracks, unfortunately.

GreenKnight23@lemmy.world · edit-2 1 month ago

deleted by creator

Bogasse@lemmy.ml · 1 month ago

When the internet archive was attacked a few months ago we were like “who would be dumb and mean enough to do that?”. We have new suspects! 🎉

fishos@lemmy.world · edit-2 1 month ago

I literally posted a comment back then saying “sure is odd that this is happening right before the US election. Not saying it means anything, but maybe it’s not a coincidence?” and got downvoted to hell lmao.

BassTurd@lemmy.world · 1 month ago

Any idea the size of IA? Could it be packaged in some torrents and distributed to the masses for decentralized archiving? I’m guessing it’s way more than I could store.

CitricBase@lemmy.world · 1 month ago

As of five years ago, 70 petabytes: https://blog.adafruit.com/2020/12/01/donate-to-the-internet-archive-digital-library-of-free-borrowable-books-movies-music-wayback-machine-internetarchive/

in 2012 it was 10 petabytes. Now, it’s probably well over 100 petabytes. I think it well beyond the scope of torrents by now.

BassTurd@lemmy.world · 1 month ago

That’s a bit more than my home server can handle. I could maybe take some CDC data, but definitely not the full shebang. It would be neat if someone could segment the data so we could save some more critical things.

xektop@lemmy.world · 1 month ago

A couple years ago I read that Filecoin has teamed up with the internet archive to synchronize the data on the Blockchain. I’m not sure how far they are yet, but it’s something that could work if it doesn’t turn out to be just crypto hype in the end.

dan@upvote.au · 1 month ago

This comment from 8 months ago says 152PB: https://www.reddit.com/r/DataHoarder/comments/1cu79ke/the_archiveteam_has_a_cost_shameboard_of_the_top/l4om4m6/

TimeSquirrel@kbin.melroy.org · 1 month ago

Is there a way to distribute it so everyone just has parts of it? Aren’t there p2p cloud storage solutions that exist?

9point6@lemmy.world · edit-2 1 month ago

The problem is you’d need to split it down to an amount that people would be happy hosting and then host it multiple times in case any node goes offline.

Another comment in the thread says it’s likely over 100PB today (100,000 terabytes). I’d say 4 copies (spread over different time zones) is a relatively minimal level of redundancy (people may host on machines that aren’t powered all the time), and I reckon you’d get a network with the most participants, whilst still getting enough storage, at around the 150gb per node mark.

That comes to nearly 3 million participants needed just to cover today’s archive, new people will obviously need to join every day. Also given I imagine it would need to be open to all, the redundancy level could do with increasing to avoid malicious actors with a lot of resources taking on a lot of the network and forcing it all offline at once in an effort to cause data loss

Nothing here is insurmountable, but also not remotely easy

LucidNightmare@lemm.ee · 1 month ago

It doesn’t help that people put silly things onto the IA. I’ve seen some things like YouTube videos that really didn’t need to be there (they have, objectively, nothing of value enough to warrant taking up space on these servers that could be used for more important materials…).

BassTurd@lemmy.world · 1 month ago

If they added download options on different taxonomies, I’d try to grab some things to archive.

LucidNightmare@lemm.ee · 1 month ago

Yeah, stuff that is able to be taught is vital to have archives, but some Twitch streamer playing some MMO/shooter/scary game isn’t what I would consider very imperative to get backed up. :P

fossilesque@mander.xyz · 1 month ago

Here’s how to help them: https://github.com/ArchiveTeam/warrior-dockerfile

AlecSadler@sh.itjust.works · 1 month ago

Oh, cool, didn’t know about this, throwing it on my home lab now.

dan@upvote.au · edit-2 1 month ago

That’s not the Internet Archive; that’s a separate group (ArchiveTeam). They use the Internet Archive for storage but are otherwise completely unrelated. ~~The data archived by Archive Team Warrior does not go into the Wayback Machine~~ (edit: sounds like this is no longer true)

Their data is 2.8% of the data in the Internet Archive: https://www.reddit.com/r/DataHoarder/comments/1cu79ke/the_archiveteam_has_a_cost_shameboard_of_the_top/l4om4m6/

PassingThrough@lemm.ee · 1 month ago

As I understand it, their data does in fact enter into the Wayback Machine. They are just also available in the direct WARC archive files(which IMO sounds beneficial to the idea of exporting in bulk to another backup host). At least that’s how their FAQ reads.

And given that they focus on web crawling, and not other arbitrary data formats that IA accepts, 2.8% of over 100 petabytes is still a respectable amount of data.

That said, help is help. If another archival project team wants me to run a worker node so they can distribute load and dodge crawler blocks, let me know, I’ve got space.

dan@upvote.au · 1 month ago

As I understand it, their data does in fact enter into the Wayback Machine

Thanks for the info! It never used to, so I guess that changed at some point.

fossilesque@mander.xyz · 1 month ago

It does go into the WaybackMachine AFAIK.

aeshna_cyanea@lemm.ee · 1 month ago

These guys seem cool but they’re not the archive.org from the op article

fossilesque@mander.xyz · 1 month ago

It’s a team of volunteers who help scrape and upload things to archive.org.

T00l_shed@lemmy.world · 1 month ago

Need an archive of the archive