One of the problems with blocking IP addresses with Apache is that conventionally the IP addresses are to be listed as some enumeration such that the Apache process will end up going through numerous entries until it might reach a matching entry. This implies a time complexity of at best, when the entry being looked for being the first in the list and a time complexity of , with being the size of the list, in case the entry to be found is the last on the list. This makes dealing with large lists very difficult and computational expensive due to the stateless nature of the HTTP protocol resulting in the need to check the list upon every single request. Large blacklists such as the apache ultimate bad bot blocker are ambitious lists but they suffer from the same time complexity problem as any other lists in Apache that will end up being scanned linearly.
Ultimately, a better solution is to use IP sets via ipset
, that provide network hashes by default and matches will be performed by the kernel itself, along with the usage of iptables
in order to conditionally ban or allow connections from various IP addresses. However, there are times when ipset
and iptables
cannot be used, such as when a website is heavily reverse-proxied, for example via Cloudflare, when the real IP address is only revealed by decapsulating the protocol (in this case HTTP). In such cases, Apache must apply filters itself and cannot rely on the incoming connecting IP addresses to be the real connecting IP address.
The fastest way would be to hash all the IP addresses (or domains) to a hashmap such that an lookup can be confirmed that will determine instantaneously whether the IP address is on the list. This will alleviate having to run through the entire list upon every request and expend a terrible amount of computational time.
Some leviathan blogpost from 2017 details the efforts to get a blocklist working using an external program. However, the blog post does not seem to solve the issue. The author describes using an external Perl script that seems to have the same linear time complexity in looking up the IP address.
The GNU dbm is a good candidate due to being designed with hashmaps in mind, such that lookups on keys have an time complexity. This is not mentioned explicitly in the gdbm documentation, yet looking at the source-code revelals that gdbm_fetch
calls _gdbm_findkey
that computes a hash and then locates the key using buckets.
The gdbm command-line tools can be installed on Debian via:
aptitude install gdbmtool
As an example, a script can be written, to generate a database containing the Amazon IP addresses that are publicly listed and then apply some policy to all connections from the Amazon IP range. One could block rented machines on Amazon that are used by spammers such as Semrush to access content whilst seeming legitimate.
The script can be scheduled with cron in order to run periodically and update the database of IP addresses every time it runs:
#!/usr/bin/env sh ########################################################################### ## Copyright (C) Wizardry and Steamworks 2024 - License: MIT ## ########################################################################### # This script is meant to run from cron and it will generate a database # # of IP addresses belonging to the Amazon IP address range that can then # # be used with Apache rules in order to block Amazon machines. # ########################################################################### EXTRACT_NETWORKS=`curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | sed -r 's/^"(.+?)"$/\1/g'` truncate -s 0 /etc/apache2/aws.dbm gdbmtool --quiet --newdb /etc/apache2/aws.dbm reorganize 2>&1 >/dev/null for NETWORK in $EXTRACT_NETWORKS; do for ADDRESS in $(prips $NETWORK); do gdbmtool /etc/apache2/aws.dbm store "$ADDRESS" "$ADDRESS" done done
The script additionally uses prips
, a software tool, that expands and lists all IP addresses from the IP and netmask provided by Amazon because the resulting database should contained all the IPs spelled out in order to achieve lookups.
The database file is placed at /etc/apache2/aws.dbm
and is then read-in by Apache. For example, a blacklist can be constructed with Apache that will load the database and then drop the connection from any Amazon IP address previously generated whilst spending in order to accomplish the task.
The following configuration follows the idea of using RewriteMap
to match IP addresses from a database and uses a script as well that is placed at /usr/local/bin/aws2apache2
:
# based on https://www.ispcolohost.com/2017/02/03/keeping-amazon-ec2-crap-off-your-website/ <IfModule mod_rewrite.c> RewriteMap aws prg:/usr/local/bin/aws2apache RewriteCond ${aws:%{REMOTE_ADDR}} ^1$ RewriteRule .* - [F,L] </IfModule>
However the script placed at /usr/local/bin/aws2apache2
uses gdbmtool
in order to perform a lookup if the IP / network address thereby achieving the time complexity that was sought after:
#!/usr/bin/env sh ########################################################################### ## Copyright (C) Wizardry and Steamworks 2024 - License: MIT ## ########################################################################### # This is a short script that will lookup an entry in a DMB database # # passed as the first argument to the script and then terminate with exit # # code 0 if the entry cannot be found or 1 otherwise. The script is meant # # to work with the apache configuration listed at: # # * https://grimore.org/fuss/apache # # #black_or_whitelist_ip_addresses_without_slowdown # # in order to batch block a large number of IP addresses without slowdown # ########################################################################### if [ -z "$1" ]; then echo 0 exit fi FETCH=`gdbmtool --read-only /etc/apache2/aws.dbm fetch "$1" 2>/dev/null` if [ "$FETCH" != "$1" ]; then echo 1 exit fi # explicit echo 0
The generator script in the previous section needs to list the entire IP range for every network such that most of the time taken is not algorithmic but is lost on starting and stopping gdbmtool
. However, the purpose of the script is to generate IP addresses seldomly, perhaps once every week, yet use the list of IP addresses frequently, as in algorithmically-frequent with IP addresses having to be checked against the list upon every request such that the waiting time to generate the list of IPs is negligible relative to its usage.
Here is a script that expands on the very same idea by adding multiple sources:
#!/usr/bin/env bash ########################################################################### ## Copyright (C) Wizardry and Steamworks 2024 - License: MIT ## ########################################################################### # This script is meant to run from cron and it will generate a database # # of IP addresses based on various generators defined within the array # # meant for network address generation. # ########################################################################### DATABASE=/etc/apache2/ip-block.dbm NETWORK_GENERATORS=( # Amazon AWS "curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | sed -r 's/^\"(.+?)\"$/\1/g'" # Peerblock Level 1 "curl -s -L 'http://list.iblocklist.com/?list=bt_level1&fileformat=p2p&archiveformat=gz' -o - | gunzip | cut -d: -f2 | grep -E "^[-0-9.]+$" | xargs ipcalc -rnb | grep '/'" ) # Acquire a lock. LOCK_FILE='/var/lock/apacheipblockdatabase' if mkdir $LOCK_FILE 2>&1 >/dev/null; then trap '{ rm -rf $LOCK_FILE; }' KILL QUIT TERM EXIT INT HUP else exit 0 fi if [ ! -f "$DATABASE" ]; then gdbmtool --quiet --newdb "$DATABASE" reorganize 2>&1 >/dev/null fi printf "Collecting networks and adding to database.\n" IFS=$' ' COUNT=1 for COMMAND in "${NETWORK_GENERATORS[@]}"; do printf "Using generator: $COUNT/${#NETWORK_GENERATORS[@]}\n" COUNT=$((COUNT+1)) eval $COMMAND | while read NETWORK; do printf "Processing network $NETWORK\n" prips $NETWORK | while read ADDRESS; do RESULT=$(gdbmtool --read-only "$DATABASE" fetch "$ADDRESS" 2>/dev/null) if [ "$RESULT" = "$ADDRESS" ]; then break fi gdbmtool "$DATABASE" store "$ADDRESS" "$ADDRESS" 2>/dev/null >/dev/null if [ "$?" -eq 1 ]; then continue fi printf "\rAdded $ADDRESS " done printf "\rDone. \n" done done printf "All networks collected.\n"
where NETWORK_GENERATORS
is an array that contains commands that produce IP address and netmasks that will be processed within a loop, expanded to a full list of IP addresses via prips
and then added to a database via gdbmtool
.
The script is meant to run for a long time, depending on the size of the list produced by the gernators array, given that linear time complexity is needed to get the lists inserted into a database. However, in most cases, the script is meant to run rarely such that adding the script a weekly cron should be sufficient given generators that do not change much.
The other script and configuration from the apache section within this document apply and can then be used to reference the database created by this script along with Apache in order to block or apply whatever different policy is required to connecting IP addresses that can be found within the generated database.