This shows you the differences between two versions of the page.
Next revision | |||
— | fuss:security [2018/08/31 15:20] – created office | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Blocking SemrushBot ====== | ||
+ | |||
+ | SemrushBot is an annoying web crawler that has proven to completely disregard the robots policies as well as hammering webservers hard by recursively following all the links on a website without delay. | ||
+ | |||
+ | On the IP layer: | ||
+ | <code bash> | ||
+ | iptables -t mangle -A INPUT -p tcp --dport 80 -m string --string ' | ||
+ | </ | ||
+ | |||
+ | Which is a great solution to get rid of this pest without even hitting the application layer! | ||
+ | |||
+ | If are okay with your frontend being hammered by this total garbage, then the '' | ||
+ | |||
+ | Enable the '' | ||
+ | <code bash> | ||
+ | a2enmod rewrite | ||
+ | </ | ||
+ | and include in virtual hosts: | ||
+ | <code apache2> | ||
+ | < | ||
+ | RewriteEngine on | ||
+ | RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR] | ||
+ | RewriteCond %{HTTP_USER_AGENT} sosospider [NC,OR] | ||
+ | RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC] | ||
+ | # Allow access to robots.txt and forbidden message | ||
+ | # at least 403 or else it will loop | ||
+ | RewriteCond %{REQUEST_URI} !^/ | ||
+ | RewriteCond %{REQUEST_URI} !^/ | ||
+ | RewriteRule ^.* - [F,L] | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | which is a bad solution because '' | ||
+ | |||
+ | Perhaps blocking with Varnish may be a good compromise between having your Apache2 hammered and blocking the string '' | ||
+ | <code varnish> | ||
+ | sub vcl_recv { | ||
+ | # Block user agents. | ||
+ | if (req.http.User-Agent ~ " | ||
+ | return (synth(403, " | ||
+ | } | ||
+ | | ||
+ | # ... | ||
+ | |||
+ | } | ||
+ | </ | ||