Differences

This shows you the differences between two versions of the page.


Previous revision
Next revision
fuss:security [2018/08/31 16:17] – [Blocking SemrushBot] office
Line 1: Line 1:
 +====== Blocking SemrushBot ======
 +
 +SemrushBot is an annoying web crawler that has proven to completely disregard the robots policies as well as hammering webservers hard by recursively following all the links on a website without delay and outright ignoring any repeating ''403 Forbidden'' error messages.
 +
 +{{:wizardale.png?nolink | Oh no, not this shit again! }} Folklore claims that SemrushBot helps your site generate revenue from ads but the question is whether that revenue outweighs the money spent accommodating SemrushBot's rampant behaviour that yields a morbidly increased server load.
 +===== IP Layer =====
 +
 +On the IP layer:
 +<code bash>
 +iptables -t mangle -A INPUT -p tcp --dport 80 -m string --string 'SemrushBot' -j DROP
 +</code>
 +
 +Which is an awful solution to get rid of this pest without even hitting the application layer!
 +
 +===== Apache2 =====
 +
 +If are okay with your frontend being hammered by this total garbage, then the ''SemrushBot'' user agent can be blocked in Apache2.
 +
 +Enable the ''rewrite'' module:
 +<code bash>
 +a2enmod rewrite
 +</code>
 +and include in virtual hosts:
 +<code apache2>
 +       <IfModule mod_rewrite.c>
 +                RewriteEngine on
 +                RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR]
 +                RewriteCond %{HTTP_USER_AGENT} sosospider [NC,OR]
 +                RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC]
 +                # Allow access to robots.txt and forbidden message
 +                # at least 403 or else it will loop
 +                RewriteCond %{REQUEST_URI} !^/robots\.txt$
 +                RewriteCond %{REQUEST_URI} !^/403\.shtml$
 +                RewriteRule ^.* - [F,L]
 +       </IfModule>
 +</code>
 +
 +which is a bad solution because ''Forbidden'' is meaningless to the greatness that is ''SemrushBot''.
 +
 +===== Varnish =====
 +
 +Perhaps blocking with Varnish may be a good compromise between having your Apache2 hammered and blocking the string ''SemrushBot'' on the IP layer:
 +<code varnish>
 +sub vcl_recv {
 +    # Block user agents.
 +    if (req.http.User-Agent ~ "SemrushBot") {
 +        return (synth(403, "Forbidden"));
 +    }
 +    
 +    # ...
 +
 +}
 +</code>
 +
 +An even better method would be to use fail2ban to block ''SemrushBot'' by reading Varnish logs on the frontend or Apache2 log files on the backend which will prevent either of them to get hammered with requests.
 +
 +===== Varnish and Fail2Ban =====
 +
 +For Varnish, copy ''/etc/fail2ban/filter.d/apache-badbots.conf'' to ''/etc/fail2ban/filter.d/varnish-badbots.conf'' thereby duplicating the Apache2 configuration (this works due to NCSA log format) and edit ''/etc/fail2ban/filter.d/varnish-badbots.conf'' to add ''SemrushBot'' to the list of custom bad bots:
 +<code>
 +badbotscustom = EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|SemrushBot
 +</code>
 +
 +then correct the ''failregex'' line to:
 +<code>
 +failregex = ^<HOST> -.*(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s).*?$
 +</code>
 +
 +
 +and finally add the following to the jail configuration:
 +<code>
 +[varnish-badbots]
 +enabled  = true
 +port     = http,https
 +filter   = varnish-badbots
 +logpath  = /var/log/varnish/varnishncsa.log
 +maxretry = 1
 +</code>
 +
 +and restart ''fail2ban''.
 +
 +To check that the bots are being banned, tail ''/var/log/syslog'' and look for:
 +<code>
 +fail2ban.jail[18168]: INFO Jail 'varnish-badbots' started
 +</code>
 +indicating that the ''varnish-badbots'' jail has started.
 +
 +Hopefully followed by lines similar to:
 +<code>
 +NOTICE [varnish-badbots] Ban 46.229.168.68
 +</code>
  

fuss/security.txt · Last modified: 2022/09/27 14:15 by office

Access website using Tor Access website using i2p Wizardry and Steamworks PGP Key


For the contact, copyright, license, warranty and privacy terms for the usage of this website please see the contact, license, privacy, copyright.