About

One of the problems with blocking IP addresses with Apache is that conventionally the IP addresses are to be listed as some enumeration such that the Apache process will end up going through numerous entries until it might reach a matching entry. This implies a time complexity of $O(1)$ at best, when the entry being looked for being the first in the list and a time complexity of $O(n)$, with $n$ being the size of the list, in case the entry to be found is the last on the list. This makes dealing with large lists very difficult and computational expensive due to the stateless nature of the HTTP protocol resulting in the need to check the list upon every single request. Large blacklists such as the apache ultimate bad bot blocker are ambitious lists but they suffer from the same time complexity problem as any other lists in Apache that will end up being scanned linearly.

Ultimately, a better solution is to use IP sets via ipset, that provide network hashes by default and matches will be performed by the kernel itself, along with the usage of iptables in order to conditionally ban or allow connections from various IP addresses. However, there are times when ipset and iptables cannot be used, such as when a website is heavily reverse-proxied, for example via Cloudflare, when the real IP address is only revealed by decapsulating the protocol (in this case HTTP). In such cases, Apache must apply filters itself and cannot rely on the incoming connecting IP addresses to be the real connecting IP address.

Hashmaps

The fastest way would be to hash all the IP addresses (or domains) to a hashmap such that an $O(1)$ lookup can be confirmed that will determine instantaneously whether the IP address is on the list. This will alleviate having to run through the entire list upon every request and expend a terrible amount of computational time.

Some leviathan blogpost from 2017 details the efforts to get a blocklist working using an external program. However, the blog post does not seem to solve the issue. The author describes using an external Perl script that seems to have the same linear time complexity in looking up the IP address.

Using a Database, Instant Lookups and Apache

The GNU dbm is a good candidate due to being designed with hashmaps in mind, such that lookups on keys have an $O(1)$ time complexity. This is not mentioned explicitly in the gdbm documentation, yet looking at the source-code revelals that gdbm_fetch calls _gdbm_findkey that computes a hash and then locates the key using buckets.

The gdbm command-line tools can be installed on Debian via:

aptitude install gdbmtool

As an example, a script can be written, to generate a database containing the Amazon IP addresses that are publicly listed and then apply some policy to all connections from the Amazon IP range. One could block rented machines on Amazon that are used by spammers such as Semrush to access content whilst seeming legitimate.

The script can be scheduled with cron in order to run periodically and update the database of IP addresses every time it runs:

#!/usr/bin/env sh
###########################################################################
##  Copyright (C) Wizardry and Steamworks 2024 - License: MIT            ##
###########################################################################
# This script is meant to run from cron and it will generate a database   #
# of IP addresses belonging to the Amazon IP address range that can then  #
# be used with Apache rules in order to block Amazon machines.            #
###########################################################################
 
EXTRACT_NETWORKS=`curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | sed -r 's/^"(.+?)"$/\1/g'`
truncate -s 0 /etc/apache2/aws.dbm
gdbmtool --quiet --newdb /etc/apache2/aws.dbm reorganize 2>&1 >/dev/null
for NETWORK in $EXTRACT_NETWORKS; do
    for ADDRESS in $(prips $NETWORK); do
        gdbmtool /etc/apache2/aws.dbm store "$ADDRESS" "$ADDRESS"
    done
done

The script additionally uses prips, a software tool, that expands and lists all IP addresses from the IP and netmask provided by Amazon because the resulting database should contained all the IPs spelled out in order to achieve $O(1)$ lookups.

The database file is placed at /etc/apache2/aws.dbm and is then read-in by Apache. For example, a blacklist can be constructed with Apache that will load the database and then drop the connection from any Amazon IP address previously generated whilst spending $O(1)$ in order to accomplish the task.

The following configuration follows the idea of using RewriteMap to match IP addresses from a database and uses a script as well that is placed at /usr/local/bin/aws2apache2:

# based on https://www.ispcolohost.com/2017/02/03/keeping-amazon-ec2-crap-off-your-website/
<IfModule mod_rewrite.c>
    RewriteMap aws prg:/usr/local/bin/aws2apache
    RewriteCond ${aws:%{REMOTE_ADDR}} ^1$
    RewriteRule .* - [F,L]
</IfModule>

However the script placed at /usr/local/bin/aws2apache2 uses gdbmtool in order to perform a lookup if the IP / network address thereby achieving the $O(1)$ time complexity that was sought after:

#!/usr/bin/env sh
###########################################################################
##  Copyright (C) Wizardry and Steamworks 2024 - License: MIT            ##
###########################################################################
# This is a short script that will lookup an entry in a DMB database      #
# passed as the first argument to the script and then terminate with exit #
# code 0 if the entry cannot be found or 1 otherwise. The script is meant #
# to work with the apache configuration listed at:                        #
#   * https://grimore.org/fuss/apache                                     #
#         #black_or_whitelist_ip_addresses_without_slowdown               #
# in order to batch block a large number of IP addresses without slowdown #
###########################################################################
 
if [ -z "$1" ]; then
    echo 0
    exit
fi
 
FETCH=`gdbmtool --read-only /etc/apache2/aws.dbm fetch "$1" 2>/dev/null`
if [ "$FETCH" != "$1" ]; then
    echo 1
    exit
fi
 
# explicit
echo 0

Scaling-Up

The generator script in the previous section needs to list the entire IP range for every network such that most of the time taken is not algorithmic but is lost on starting and stopping gdbmtool. However, the purpose of the script is to generate IP addresses seldomly, perhaps once every week, yet use the list of IP addresses frequently, as in algorithmically-frequent with IP addresses having to be checked against the list upon every request such that the waiting time to generate the list of IPs is negligible relative to its usage.

Here is a script that expands on the very same idea by adding multiple sources:

#!/usr/bin/env bash
###########################################################################
##  Copyright (C) Wizardry and Steamworks 2024 - License: MIT            ##
###########################################################################
# This script is meant to run from cron and it will generate a database   #
# of IP addresses based on various generators defined within the array    #
# meant for network address generation.                                   #
###########################################################################
 
DATABASE=/etc/apache2/ip-block.dbm
NETWORK_GENERATORS=(
   # Amazon AWS
   "curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | .ip_prefix' | sed -r 's/^\"(.+?)\"$/\1/g'"
   # Peerblock Level 1
   "curl -s -L 'http://list.iblocklist.com/?list=bt_level1&amp;fileformat=p2p&amp;archiveformat=gz' -o - | gunzip | cut -d: -f2 | grep -E "^[-0-9.]+$" | xargs ipcalc -rnb | grep '/'"
)
 
# Acquire a lock.
LOCK_FILE='/var/lock/apacheipblockdatabase'
if mkdir $LOCK_FILE 2>&1 >/dev/null; then
    trap '{ rm -rf $LOCK_FILE; }' KILL QUIT TERM EXIT INT HUP
else
    exit 0
fi
 
if [ ! -f "$DATABASE" ]; then
    gdbmtool --quiet --newdb "$DATABASE" reorganize 2>&1 >/dev/null
fi
 
printf "Collecting networks and adding to database.\n"
IFS=$' '
COUNT=1
for COMMAND in "${NETWORK_GENERATORS[@]}"; do
    printf "Using generator: $COUNT/${#NETWORK_GENERATORS[@]}\n"
    COUNT=$((COUNT+1))
    eval $COMMAND | while read NETWORK; do
        printf "Processing network $NETWORK\n"
        prips $NETWORK | while read ADDRESS; do
            RESULT=$(gdbmtool --read-only "$DATABASE" fetch "$ADDRESS" 2>/dev/null)
            if [ "$RESULT" = "$ADDRESS" ]; then
                break
            fi
            gdbmtool "$DATABASE" store "$ADDRESS" "$ADDRESS" 2>/dev/null >/dev/null
            if [ "$?" -eq 1 ]; then
                continue
            fi
            printf "\rAdded $ADDRESS               "
        done
        printf "\rDone.                     \n"
    done
done
printf "All networks collected.\n"

where NETWORK_GENERATORS is an array that contains commands that produce IP address and netmasks that will be processed within a loop, expanded to a full list of IP addresses via prips and then added to a database via gdbmtool.

The script is meant to run for a long time, depending on the size of the list produced by the gernators array, given that linear time complexity is needed to get the lists inserted into a database. However, in most cases, the script is meant to run rarely such that adding the script a weekly cron should be sufficient given generators that do not change much.

The other script and configuration from the apache section within this document apply and can then be used to reference the database created by this script along with Apache in order to block or apply whatever different policy is required to connecting IP addresses that can be found within the generated database.


apache/processing_large_lists.txt ยท Last modified: 2024/11/08 16:28 by office

Access website using Tor Access website using i2p Wizardry and Steamworks PGP Key


For the contact, copyright, license, warranty and privacy terms for the usage of this website please see the contact, license, privacy, copyright.