Bots and scrapers

Comments & questions about this site.
User avatar
dima
Posts: 1885
Joined: Wed Feb 12, 2014 1:35 am
Location: Los Angeles

Post by dima »

We're having more overloading issues with people's poorly-behaved scripts, and a dumb geoblocker (like I had before) is no longer enough. I just installed fail2ban to kill the worst offenders, and it seems to be doing the job right now. It's possible I tuned it too aggressively: if you see any issues (browser says "site cannot be reached", or something along those lines), please tell me
User avatar
tekewin
Posts: 1399
Joined: Thu Apr 11, 2013 5:07 pm

Post by tekewin »

Unfortunately, we in the Age of Ultron Agents.

I am one of the offenders (not on this site), but I was sending agents out to scour the world for information and getting blocked with 429 errors everywhere. I stopped doing that with limited exceptions and with my own throttles in place. There will soon be far more agents on the Internet than people. That may already be the case.

Peakbagger.com and Bob Burds site are now gated with Cloudflare. We might need to do something similar if it is not cost prohibitive. They have a free plan with DDoS protection, which is what agents unintentionally are doing.
User avatar
dima
Posts: 1885
Joined: Wed Feb 12, 2014 1:35 am
Location: Los Angeles

Post by dima »

Yeah, Cloudflare or something like it would solve it, but I REALLY don't want to go there yet. We're a location-specific, niche, old-school forum about the mountains. We shouldn't NEED such big hammers to be able to operate. I'm wondering if the recent influx was related to the thread about Monica receiving a lot of outside attention, which brough with it lots of additional traffic (both human and robot). In any case, the storm seems to have died down for now (maybe because I blocked everybody and they went home, or maybe not :) ) The current blocking settings maybe are close-enough now. Look at

Code: Select all

/etc/fail2ban/jail.d/defaults-debian.conf
and

Code: Select all

/etc/fail2ban/filter.d/apache-eispiraten.conf
to see the current settings. To see who's banned right now:

Code: Select all

fail2ban-client status apache-eispiraten-hammer
and -misc.
User avatar
tekewin
Posts: 1399
Joined: Thu Apr 11, 2013 5:07 pm

Post by tekewin »

Wow. TIL that fail2ban can secure more than SSH.

To try to understand the config, I fed the .conf files into a friendly AI who gave me this unsolicited comment. Do with it what you will. Your current config seems to be working.
A maxretry of 20 combined with a findtime of 20 is quite "loose." This configuration allows a bot to make 1 request per second indefinitely without ever getting banned.

Tip: Usually, for aggressive scrapers, you want a longer findtime (like 600 for 10 minutes) or a much lower maxretry (like 5) to catch bots that pace their requests to stay under the radar.
User avatar
dima
Posts: 1885
Joined: Wed Feb 12, 2014 1:35 am
Location: Los Angeles

Post by dima »

Oh man. It's totally right. Previously I had problems with it being too aggressive, banning confirmed humans. I detuned it, but I also adjusted the filter regex. After the more specific regex I can probably tighten it again, but I haven't bothered to do that yet. Feel free to play with it. For what it's worth, the onslaught seems to have subsided for now, so maybe we can leave it alone.
User avatar
Nate U
Posts: 642
Joined: Wed Apr 05, 2023 7:38 pm

Post by Nate U »

off-trail Los Angeles Mtn explorers and true crime enthusiasts are 2 WILDLY different-sized demographics... this site is not designed to handle the latter.