This is an old revision of the document!

Blocking SemrushBot

SemrushBot is an annoying web crawler that has proven to completely disregard the robots policies as well as hammering webservers hard by recursively following all the links on a website without delay.

On the IP layer:

iptables -t mangle -A INPUT -p tcp --dport 80 -m string --string 'SemrushBot' -j DROP

Which is a great solution to get rid of this pest without even hitting the application layer!

If are okay with your frontend being hammered by this total garbage, then the SemrushBot user agent can be blocked in Apache2.

Enable the rewrite module:

a2enmod rewrite

and include in virtual hosts:

       <IfModule mod_rewrite.c>
                RewriteEngine on
                RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR]
                RewriteCond %{HTTP_USER_AGENT} sosospider [NC,OR]
                RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC]
                # Allow access to robots.txt and forbidden message
                # at least 403 or else it will loop
                RewriteCond %{REQUEST_URI} !^/robots\.txt$
                RewriteCond %{REQUEST_URI} !^/403\.shtml$
                RewriteRule ^.* - [F,L]
       </IfModule>

which is a bad solution because Forbidden is meaningless to the greatness that is SemrushBot.

Perhaps blocking with Varnish may be a good compromise between having your Apache2 hammered and blocking the string SemrushBot on the IP layer:

sub vcl_recv {
    # Block user agents.
    if (req.http.User-Agent ~ "SemrushBot") {
        return (synth(403, "Forbidden"));
    }
 
    # ...
 
}