Stopping a badly behaved bot the wrong way.

DigitalDilemma@lemmy.ml · edit-2 7 months ago

Stopping a badly behaved bot the wrong way.

Skull giver@popplesburger.hilciferous.nl · edit-2 7 months ago

deleted by creator

Deebster@programming.dev · 7 months ago

I was kinda hoping for another story about some clever compression bomb or similar to slow up the bot - after all, if it’s hammering this little site it’s surely doing the same to others, even if they haven’t noticed yet. After the robots.txt was ignored I was sure, but I guess this mature, restrained response is probably the correct one *discontentedly kicks can down sepia street*

DigitalDilemma@lemmy.ml · 7 months ago

Some nice evil ideas there!

Daniel Quinn@lemmy.ca · edit-2 7 months ago

Not throwing any shade, just some advice for the future: try to always consider the problem in the context of the OSI model. Specifically, “Layer 3” (network) is always a better strategy for routing/blocking than “Layer 5” (application) if you can do it.

Blocking traffic at the application layer means that the traffic has to be routed through (bandwidth consumption) assembled and processed (CPU cost) before a decision can be made. You should always try to limit the stuff that makes it to layer 5 if you’re sure you won’t want it.

The trouble with layer 3 routing of course is that you don’t have application data there. No host name, no HTTP headers, etc., just packets with a few bits of information:

source IP and port
destination IP and port
A few other firewall-specific bits of information like whether this packet is part of an established connection (syn) etc.

In your case though, you already knew what you didn’t want: traffic from a particular IP, and you have that at the network layer.

At that point, you know you can block at layer 3, so the next question is how far up the chain can you block it?

Most self-hosters will just have their machines on the open internet, so their personal firewall is all they’ve got to work with. It’s still better than letting the packets all the way through to your application, but you still have to suffer the cost of dropping each packet. Still, it’s good enough™ for most.

In your case though, you had setup the added benefit of Cloudflare standing between you and your server, so you could move that decision making step even further away from you, which is pretty great.

xthexder@l.sw0.com · 7 months ago

I learned this in highschool when I discovered sending ping floods from a 1gbit VPS to a slow residential Internet connection can take down your Internet even if the router doesn’t respond to pings. The bandwidth still all needs to make it to the router in your house to be dropped.

Possibly linux · 7 months ago

Now that’s interesting. I know that i2p can crash some cheap routers because they run out of ram. I wonder if you could do that from the outside.

DigitalDilemma@lemmy.ml · 7 months ago

Yep - agree with all of that. It’s a fault of mine that I don’t always step back and look at the bigger picture first.

ArcticDagger@feddit.dk · 7 months ago

Could it be this fella who’s hitting you up: https://claude.ai/login

Deebster@programming.dev · 7 months ago

I feel a company that big would write a more competent bot, but I also wouldn’t be too astonished.

mvirts@lemmy.world · 7 months ago

Beaurocracy is a potent evil

DigitalDilemma@lemmy.ml · 7 months ago

Maybe? It feels like the kind of stupid that you really need a human to half-ass it to achieve this thoroughly though.

DigitalDilemma@lemmy.ml · 7 months ago

It’s back today with a new user-agent, this time containing an email address at anthropic.com - so it looks like it’s Claude3, a scraper for an AI bot.

ArcticDagger@feddit.dk · 7 months ago

Interesting that they have such a greedy/stupid bot

Nibodhika@lemmy.world · 7 months ago

I know you couldn’t do that because you have data limits in the US, but my first instinct would have been to put an Ubuntu iso as the robots.txt (or better yet, point it to /dev/urandom) let that bot download GB of data to fuck with his connection/disk.

Probably shouldn’t do that though, and blocking it on Cloudflare is the correct approach.

toastal@lemmy.ml · 7 months ago

Letting Cloudflare centralize the internet isn’t always the solution. I’m sick of hCAPTCHAs just for living in a non-Western country.

a Kendrick fan@lemmy.ml · 7 months ago

same here, i can’t seem to do anything on cuckflared sites because i’m on a vpn…

just_another_person@lemmy.world · edit-2 7 months ago

Good tips for beginners who don’t stare at this stuff all day.

One extra tip for you: just script blocking these things after they act up, but before they cost real money. You know your expected traffic patterns, so setting thresholds should be easy.

Fail2ban is tried and true, and dead simple, or you could use something a bit fancier like crowdsec to setup chains that send IPs to block directly at the WAF. Getting some of the more popular blacklists at the edge would be a good idea as well.

DigitalDilemma@lemmy.ml · 7 months ago

Fail2ban is something I’ve used for years - in fact it was working on these very sites before I decided to dockerise them, but find it a lot less simple in this application for a couple of reasons:

The logs are in the docker containers. Yes, I could get them squirting to a central logging serverbut that’s a chunk of overhead for a home system. (I’ve done that before, so it is possible, just extra time)

And getting the real IP through from cloudlfare. Yes, CF passes headers with it in, and haproxy can forward that as well with a bit of tweaking. But not every docker container for serving webpages (notably the phpbb one) will correctly log the source IP even when passed through from Haproxy as the forwarded-ip, instead showing the IP of the proxy. I’ve other containers that do display it, and it can obviously be done, but I’m not clear yet why it’s inconsistent. Without that, there’s no blocking.

And… You can use the cloudflare IP to block IPs, but there’s a fixed limit on the free accounts. When I set this up before with native webservers and blocked malicious url scanning bots, then using the api to block them - I reached that limit within a couple of days. I don’t think there’s automatic expiry, so I’d need to find or build a tool that manages the blocklist remotely. (Or use haproxy to block and accept the overhead)

It’s probably where I should go next.

And yes - you’re right about scripting. Automation is absolutely how I like to do things. But so many problems only become clear retrospectively.

DigitalDilemma@lemmy.ml · 7 months ago

Doh - another example of my muddled thinking.

Fail2ban will work directly on haproxy’s log, no need to read the web logs from containers at all. Much simpler and better.

Para_lyzed@lemmy.world · 7 months ago

Yeah, I’ve always used fail2ban on the main server with my dockerized services. Works great, and requires very little work.

lettruthout@lemmy.world · 7 months ago

Thanks for writing this up, it’s very interesting!

StarDreamer@lemmy.blahaj.zone · edit-2 7 months ago

I’ve recently moved from fail2ban to crowdsec. It’s nice and modular and seems to fit your use case: set up a http 404/rate-limit filter and a cloudflare bouncer to ban the IP address at the cloudflare level (instead of IPtables). Though I’m not sure if the cloudflare tunnel would complicate things.

Another good thing about it is it has a crowd sourced IP reputation list. Too many blocks from other users = preemptive ban.

DigitalDilemma@lemmy.ml · 7 months ago

Thanks, I’ve not heard of that, it sounds like it’s worth a look.

I don’t think the tunnel would complicate blocking via the cloudflare api, but there is a limit on the number of IPs you can ban that way, so some expiry rules are necessary.

StarDreamer@lemmy.blahaj.zone · edit-2 7 months ago

Pretty sure expiry is handled by the local crowdsec daemon, so it should automatically revoke rules once a set time is reached.

At least that’s the case with the iptables and nginx bouncers (4 hour ban for probing). I would assume that it’s the same for the cloudflare one.

Alternatively, maybe look into running two bouncers (1 local, 1 CF)? The CF one filters out most bot traffic, and if some still get through then you block them locally?

DigitalDilemma@lemmy.ml · 7 months ago

I’ve just installed crowdsec and its haproxy plugin. Documentation is pretty good. I need to look into getting it to ban the ip at cloudflare - that would be neat.

Annoyingly, the claudebot spammer is back again today with a new UA. I’ve emailed the address within it politely asking them to desist - be interesting to see if there’s a reply. And yes, it is Claudebot 3 - AI.

UA:like Gecko; compatible; ClaudeBot/1.0; [email protected])

StarDreamer@lemmy.blahaj.zone · 7 months ago

iirc the bad UA filter is bundled with either base-http-scenarios or nginx. That might help assuming they aren’t trying to mask that UA.

Deebster@programming.dev · 7 months ago

Thinking there must be another way, I switched to Haproxy.

Hang on, weren’t you on Haproxy already? Or do you mean you switched your attention to Haproxy? (If not, what were you in before?)

As others have said, blocking incoming stuff as high up as possible is definitely the right way, and Cloudflare is the right place for you. It’s interesting that this bot wasn’t caught by Cloudflare, I wonder who runs it.

DigitalDilemma@lemmy.ml · 7 months ago

I mean - I switched my attention to Haproxy. And yes, no argument there.

atzanteol@sh.itjust.works · 7 months ago

I love a good post mortem. Thanks for sharing!

boredsquirrel@slrpnk.net · 7 months ago

Wow ClaudeAI, goos job!

Atemu@lemmy.ml · 7 months ago

While I wouldn’t put it past tech bros to use such unethical measures for their latest grift, it’s not a given that it’s actually claudebot. Anyone can claim to be claudebot, googlebot, boredsquirrelbot or anything else. In fact it could very well be a competitor aiming to harm Claude’s reputation.

boredsquirrel@slrpnk.net · 7 months ago

OpenAI wanting to shit on competitors?