

Even more reason why it should be released as a dump, and not walled in behind a website.


Even more reason why it should be released as a dump, and not walled in behind a website.
This is assuming aggressively cached, yes.
Also “Just text files” is what every website is sans media. And you can still, EASILY get 10+ MB pages this way between HTML, CSS, JS, and JSON. Which are all text files.
A gitea repo page for example is 400-500KB transferred (1.5-2.5MB decompressed) of almost all text.
A file page is heavier, coming in around 800-1000KB (Additional JS and CSS)
If you have a repo with 150 files, and the scraper isn’t caching assets (many don’t) then you just served up 135MB of HTMl/CSS/JS alongside the actual repository assets.


Fair fair. I missed that
I can get a 50Gb/s residential link where I am, and have a whole rack of servers.
Sounds like a good opportunity to crowd fund thousands and thousands of common scrapeable instances that have random poisoning.


Low key win for kink communities.


Yeah but that was before you had billionaires of this size able to manipulate entire markets in this capacity.


Why do they make this so difficult by not providing a raw dump…


As usual, information leaks that are tied to a single source are incredibly easy to squash before distribution…
A raw dump needs to be available for things like this. Unloaded as a torrent, and/or to usenet.


There’s like… 8 people on this list. And there is no way to actually just download the information either for offline processing and analysis.
Yet all reports are saying 4500?
Seems off.


It doesn’t need just a website. It needs a torrent so it’s not centralized.


They’re downloading doesn’t work. It just opens the blank page.
Have that PDF handy?


ICE isn’t law enforcement, they don’t have the legal rights to enforce laws.
Being an illegal immigrant is also not a crime, is a civil offense. Their law given abilities stop there.
Essentially everything they are doing right now is blatantly illegal. It’s no different than an armed milita in that sense.


This is also a strategic way to prevent resistance from Americans against an authoritarian regime.
Drones would be a significant part of that.


Yeah, it should inflate to 15TB or more I think


It’s literally says in the link. Go to the link and it’s the title.


We found a cool laser mod for our Minecraft server, thermal expansion compatible.


The end for this country really got accelerated about 10 years ago, it’s all momentum wince then.
The US has a LOT of inertia, but dammit if it’s corruption isn’t doing everything it can to find new brakes.


Really with they would take security vulnerabilities seriously 😞
Because they are significant, and broad reaching.
I assume that the gitea instance itself was being hit directly, which would make sense. It has a whole rendering stack that has to reach out to a database, get data, render the actual webpage through a template…etc
It’s a massive amount of work compared to serving up static files from say Nginx or Caddy. You can stick one of these in front of your servers, and cache http responses (to some degree anyways, that depends on gitea)
Benchmarks like this show what kind of throughput you can expect on say a 4 core VM just serving up cached files: https://blog.tjll.net/reverse-proxy-hot-dog-eating-contest-caddy-vs-nginx/#10-000-clients
90-400MB/s derived from the stats here on 4 cores. Enough to saturate a 3Gb/s connection. And caching intentionally polluted sites is crazy easy since you don’t care if it’s stale or not. Put a cloudflair cache on front of it and even easier.
You could dedicate an old Ryzen CPU (Say a 2700x) box to a proxy, and another RAM heavy device for the servers, and saturate 6Gb/s with thousands and thousands of various software instances that feed polluted data.
Hell, if someone made it a deployable utility… Oof just have self hosters dedicate a VM to shitting on LLM crawlers, make it a party.