In computer engineering, more precisely pertaining more to the design of user interfaces, the "principle of least surprise" is a design principle where an interface is supposed to be similar to previously designed interfaces for the sake of continuity.
One of the giant problems of our times is the lack of "documentation" for seemingly good software which in turn makes the software itself unusable in many cases. Even though automated tools are available for generating Application Programmer Interface (API) documentation, more than often, developers consider that the API itself is equivalent to documentation. This is applicable to the very many software packages that use Swagger and other tools to generate API documentation, especially where REST and other interoperability is involved. In reality, documentation is a superset of the API and involves much more than an API can provide such as:
that are essential to any developer thinking to use the software as a component in their own workflow. Documentation is tough to write and needs some dedication to maintain given the changes between various releases of a software package such that it is no wonder the developers just prefer the tool output that generates API documentation instead of dedicating time to write proper documentation.
Either way, as it stands, it is important to remember that "documentation" is not synonymous with "API documentation" and that "documentation" consists in a superset that includes "API documentation" but along with much more.
One of the ideas being discussed in the NAS transformation of a hard-drive enclosure coupled with a mini/micro-PC is the idea of split-storage for different purposes that is worth mentioning separately. The idea is that very large and low-cost storage is typically to be found in formats that are not that technologically advanced, for example, in terms of money per byte, spinning platter disks with a 3.5" format will always be way cheaper than, say, a NVME device of a comparable size. This means that when talking about extremely large storage quantities, the costs for an NVME-like storage device will rise exponentially whereas a very classic spinning drive will increase more linearly with the amount of storage it provides. You could go the Riche-Rich way and blow away all the money and store everything on NVME but given the differences between technologies and scaling prices, that is just deliberately wasting money for no reason.
Another concern is wear-and-tear where it does not make too much sense to use a high-cost and very fast storage medium whose cycles would end up diminished by being hammered with writes of discardable data such as log files or temporary downloads that need to be cleaned up and are also fairly expendable. Similarly, hammering spinning disks with write cycles, just like any other medium, diminishes their lifespan and for most usage patterns it does not make sense to, say, not split the operating system from the large spinning drive and place files on the same storage medium. One critical application where we have found this storage contextual split very useful has been the IoT automatic recording with Desktop notifications project where it was found that automatically-stored video clips were all extremely volatile but also required a high IO usage pattern. The camera would continuously record in many ways, only stopped by motion but most of the recordings were unimportant or false positives and could just be thrown away such that it became a form of technological "sacrilege" to abuse expensive storage with minute-by-minute IO from storing camera footage that would even automatically end up deleted. In order to not only loosen the usage of the storage devices, but to also optimize the recordings, it was decided to use a temporary filesystem in RAM to store the recordings temporarily and only after being curated, the recordings could have been committed to long term storage. Not only did that alleviate write cycles but it was also some sort of speed-optimization of saving video streams because saving video clips to RAM did not create any bottleneck and the videos that were stored to RAM never ended up being corrupt. So, in some ways, it was determined that an ideal NVR would benefit from a large amount of RAM to temporarily store recordings that only after being curated by a human, for example, following an incident, should the recordings be relocated to long-term storage.
The next-up observation for this idea is that various usage patterns require or would allow different types of storage such that the cost of storage can be optimized contextually. Let's say, a Linux operating system filesystem could even run as a read-only storage due to only very few parts of the operating system that require write access to the drive, for instance, log files or other temporary files that even have a temporary profile and could instead by piped over the network to some centralized service that can store and analyze them. There is another problem regarding storage and that is the "myth" of "cheap RAID", which is a "blasphemic" concept starting with the unavailability of hardware technology like NCQ for SATA drives and up to Mean-Time-Between-Failures (MTBF) of commercial drives that is simply trash compared to industrial storage that is very expensive. The idea that you can buy several "cheap" SATA drives, maybe even external drives that are cheaper, use a screwdriver to pop out the SATA drive and then build a "cheap" RAID, simply does not hold with the quality of these SATA drives actually matching the usage pattern that they were designed for, namely as external drives to just store some files "now and then" but not really, say, to be part of a ZFS tank that re-silvers all the time! The thing is that even if MTBF could provide a probabilistic model that would allow determining when a drive would fail, the MTBF does not account for random failures that given large-scale production seem ever more predominant. Random failure in this context would be, for example, a very expensive NVME that just stops working for no reason as a gross underestimation of its MTBF and not as a consequence of its usage cycle. This means that investing in expensive storage and then using that storage for purposes that either destroy the medium for no reason (ie: hammered with log files that are not even read) artificially raises the price of the project artificially for the very reason of misunderstanding the various usage cases of different technologies. While MTBF relies on some preset environmental conditions, it also the case that the environmental conditions used for determining the MTBF rarely meet the environment when the product is used and with lots of external conditions that turn out to have a massive influence on the actual failure rate. For example, when gathering the motivation for dropping RAID solutions in favor of a monolithic build during the NAS transformation one of the sources of inspiration has been a relatively old write-up on the Internet from AKCP that showcased a paper from National Instruments that would claim that exceeding the thermal design range of a hard-drive even by 5C would be the functional equivalent of using the drive for more than two years continuously which shows that environmental parameters have a hefty effect on hardware. Interestingly, the observation based on the environment is that environmental parameters are never even semantically perceived as part of the usage pattern of hardware but rather something that varies wildly between applications.
For Servarr usage, a "download" folder or partition is typically a very frequently-accessed folder that is also fairly dirty in terms of stability with lots of broken downloads that need to be cleaned periodically and most of the data within that folder being a hit-or-miss ala Schroedinger's cat where a download either succeeds, in which case it can be moved to permanent storage or it is a failed download, in which case it is a giant wear-and-tear bomb. Fortunately, deleting a file on an operating system does not additionally imply zeroing out the bytes but rather just unlinking the file node from the filesystem tree such that the data could be overwritten but even so you have to consider that some downloads are large and can reach up into the 100s of GB which is byte data that ends up committed to the drive and, in case the download is not what expected, those 100s of GB just end up deleted as garbage while burning into the storage cycles for no useful purpose. A "download" folder hence would be very different in terms of storage constraints than long-term storage or even the root filesystem of the operating system that runs the software, which means that the underlying technology could, and more than likely should, be different. We would unapologetically recommend storing downloads on a cheap USB thumb drive that is just connected via an USB port. Modernly USB thumb drives reach up into the terabytes and the flash storage is fairly cheap but also not great in terms of performance. Furthermore, downloads should only be a temporary buffer and the total space requirements should only scale, say, with the seeding requirements specified by various trackers, but holding onto failed downloads and other garbage just fills up the drive for no purpose. In the case of catastrophic failure of the USB drive, the drive can just be tossed whereas if the downloads were placed on some expensive NVME then tossing out the NVME would not have been too great. As a fully-working example, for a Servarr stack, one could settle, maybe, with a read-only root filesystem, an extremely cheap USB drive for say up to 1TB to store the "download" folder on and a large 3.5" hard-drive to store the final files after they have been processed by the Servarr stack.
Another good example where it is hinted to that a filesystem should be contextually or semantically partitioned given its varying usage is installing the Debian Linux distribution where one of the prompts asks whether all the files should go into the same partition and marks the option as "beginner" which shows that a correct planning is usually performed where various sub-trees of the filesystem are mapped to different storage devices with varying properties. One of the common-practices, for example, is to relocate ephemeral files such as log files or temporary files in RAM via tmpfs and it is clear that the FSH can be partitioned to mount various top-level directories like /var
or /tmp
on different storage mediums while even ignoring the typical corporate diskless setups via NFS and NIS but just for the purpose of saving up money on storage by not being ignorant about storage mediums and their recommended usage patterns.
The following section describes security anti-patterns exhibited in various environments where security for the sake of security has the effect of hampering development or counter-intuitively making a system even less secure.
Some opensource software packages have had updates over the years that pertain to security with the following being the two main highlights:
root
accountwith both of these expectations consisting in shameless anti-patterns.
First, requiring permissions for dependent files makes the software dependent on a filesystem that is able to have permissions in the first place which does not cover all filesystems but rather a restricted set of filesystems with most of them being meant for multi-user systems.
For example, NTFS does not have a corresponding POSIX compatibility layer that would allow a mounted NTFS filesystem to store Linux or POSIX permissions which means that an NTFS filesystem mounted on Linux will default to allowing all users to read-and-write each and every file. A good instance of the anti-pattern consists in daemons that then require certain files on the filesystem to be given a certain set of permissions, for example, the MySQL daemon will refuse to read configuration files if they are world-readable even though, the functionality of MySQL itself is not contingent upon the permissions of its configuration files. This leads to a problem where the MySQL health-check script healthcheck.sh
will never work for an MySQL daemon running on top of a Linux-mounted NTFS filesystem, even if the MySQL daemon runs within a Docker container, given that bind-mounts just pass permissions through. Instead, healthcheck.sh
will error with:
Warning: World-writable config file '/var/lib/mysql/.my-healthcheck.cnf' is ignored ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) Unknown error Warning: World-writable config file '/var/lib/mysql/.my-healthcheck.cnf' is ignored ERROR 1045 (28000): Access denied for user 'root'@'::1' (using password: NO) healthcheck connect failed
and simply refuse to run at all.
Second, requiring that the daemon runs under this-or-that privileged or unprivileged account is a matter of relativism given containerization. Typically, software runs within Docker containers as root because they are launched from the very init script that starts when the container starts, but the root account within the container does not map to the root outside the container and given that one of the explicit goals of containerization being privilege separation, the requirement from a software daemon to not run under a particular user, let alone root, is as much superfluous as an impediment to interoperability with other components.
Lastly, both of these requirements massively decrease the portability of the software itself, with the code having to branch on the platform, and, in case the software runs under Linux, to add an exception and check whether this-or-that file has certain permissions or whether the daemon runs under a user in particular. This is very ugly, is based on assumptions about the layout of the operating system, precludes the goals of containerization and defeats the purpose of writing platform-agnostic portable code, just for the sake of cargo-loading some security trope.
Outsourcing security has the implicit effect of inheriting all the flaws, habits, incompetences and particularities of the target enterprise that the security is outsourced to. For example, Cloudflare's "security" is inherited from Project Honeypot, that long ago used to be a moderate blacklist of IP addresses that would contain machines across the Internet that were either known to be compromised or have participated in a recent attack on the Internet. Unfortunately, blacklists are not used at all anymore in 2025, even by pirates because they are unreliable given that TCP/IP defines an IP address as public information and not personally-identifying information. What happens often, for example, is that many regions of the planet have clients behind carrier-grade NAT, where a whole building or even smaller settlement are routed through a single IP address, which makes attributing behavior to that IP address very shallow given that the IP address masks a large number of clients. We had many times to lower the security settings of Cloudflare ourselves because Cloudflare kept blacklisting Chinese IP addresses but those addresses eventually ended up migrating to other customers that then became blocked.
The same applies to hardware, for example, Fortigate, founded by two Chinese brothers is a hardware-level firewall that can perform things such as deep-packet inspection or work as a transparent seamless proxy. Ironically, the default blacklist supplied by Fortigate blocks all major human rights organizations such as transparency.org but without any reason in particular. Whilst that should be more or less irrelevant for most purposes, it does not match the context where the hardware is meant to be used, say, for people that work for human rights organizations, research or journalism because the firewall blocks any human rights activism by default.
Similarly, both solutions blanket-ban anonymizing networks such as "tor" or "i2p", or attempt to do so via deep-packet inspection, but mainly due to these companies having a contradictory dual-stance where they monetize security but at the same time claim to provide security, such that whilst an anonymizing network like "tor" might be secure, it also alludes the data collection that Cloudflare performs and is thus inconvenient business-wise. There is little reason to block anonymizing network, except for the fact that they would not provide very good data if intercepted because in terms of bandwidth these networks are unable to carry out large-scale attacks and would collapse way before. For example, "tor" on its own does not even support torrenting because the bandwidth would saturate the network to the point of being unusable while "i2p" implements torrenting but only within its own "i2p" network with not contact to outside trackers.
One global phenomenon that pertains both to computer security and to physical security in general is that for most of these bulk-security companies the very top-level organizations are granted in many case blanket-passthroughs to the point of being completely whitelisted. Google, for example, is a company that is mostly whitelisted and bypassed by blacklisting filters because it is deemed to be too large to be a security risk. This relative judgement ends up with funny consequences such as Google vans being allowed right about anywhere or spam filters outright whitelisting GMail to the point that some percentage of spam ends up coming from GMail itself for the very bonus of being able to bypass most automated spam filters via the gratuitous reputation.
Both of these examples just go to highlight that outsourcing security, particularly when the checks being implemented rely more on a matter of preference, is a heavy anti-pattern and that anyone would be better off implementing their own security policy, protections and response mechanisms. This is why some of the more valuable blocklists being generated dynamically as attacks happen and have a fast expiry time per entry such that they can be used as a first response buffer without casting a too tall a shadow onto IP address that are not stable by definition.
Copying large files is a time consuming operation using regular command-line tools like cp
, dd
, copy
or just copying files using a graphical interface, because these tools are meant to perform a bit-exact copy of a file. The concept of "copying files" reaches back to Kernighan & Ritchie's "The C Programing Language", with the most basic example being found in chapter 1.5.1 "File Copying":
#include <stdio.h> /* copy input to output; 1st version */ main() { int c; c = getchar(); while (c != EOF) { putchar(c); c = getchar(); } }
which copies standard-input to standard output by reading from stdin character-by-character (4 bytes). Most tools, regardless how sophisticated, follow the same pattern more or less. For example, dd
just adds some control in terms of skipping some bytes from the input before actually starting to copy or copying an exact number of bytes and then terminating.
Nevertheless all these tools do not account for a situation where two files already exist and the user would just like to synchronize the changes between a source file and a destination file. Imagine that two large files exist, such as two ISO files, and the user would like to copy or move one ISO file onto the other (perhaps the other one is "broken"). When the copy or move operation is started (moving files just being more or less a matter of copying and then deleting the older file), regardless whether the copying is performed on the console or using the graphical interface (like Explorer in Windows), the operations performed could be reduced to the simple K&R example where the ISO file is copied onto another ISO file by copying all bytes.
In case two bytes between the source and the destination ISO files are identical, then the copying tool completely disregards this detail and just blindly copies the data over, by overwriting the existing blocks with, well, the exact same data. While this might have been acceptable many decades ago due to storage being extremely expensive and scarce and files consisting in mainly documents or being short in terms of length, modernly copying over existing data on top of the very same existing data is a very inefficient operation that aside from the massive waste of time might also have the impact of reducing the lifespan of the storage medium.
Curiously, many decades ago a cracking utility named X-Copy appeared on the Commodore Amiga scene that boasted its capabilities of performing a sector-exact copy between two floppy disks.
The utility of such a program pertained to recovering data from broken mediums or copying diskettes that had some software copy protection that would be spent in case the diskette would be copied the regular way. X-Copy had a set of features where it could skip identical sectors, perform exact copies of sectors by hashing and re-reading the sectors till they are identical and quickly became a referential tool to copy diskettes to the point that everyone had to have a copy of X-Copy.
In modern times where storage is abundant and files can be large, one can think of the problem in terms of streams where not only would there be bit-by-bit similarities between two large files but rather entire sequences or sub-sequences of bits between a source and destination would be identical, such that copying over data that already exists results in a waste of time and the burning of the life-cycle of the storage medium.
While the Commodore Amiga came and left like an alien landing on the planet and then vanishing, many years later a tool called rsync
came to be, as if nothing using the same principles ever existed, that managed to create differences between a source and destination and then only copy over those differences without copying over all the data if it already existed on the destination. CVS, the predecessor to Subversion, a predecessor of more modern Git, did not have the capability of telling the difference between two binary files and Subversion was the first source-code management tool that managed to only store binary differences between revisions, such that when Subversion came out, CVS was abandoned quickly because repositories hosted on CVS would double their size every time binary files were committed between revisions (in fact, it became a rule of "good practice" to not commit binary data on CVS and keep them separate with only source-code being committed to CVS). At the same time, the cracking scene promoted tools such as bsdiff
, a tool that is able to create patches, by comparing two files and then generating a binary delta between time, very much like the simple diff
tool does for source code. Torrents are great technological showcase that have a built-in awareness of the bitwise layout of a file, such that large files are transferred by segmenting the file into pieces and then only requesting pieces of the file from peers, thereby overall speeding up the operation by eliminating single-points of failure where the source might stall or go away.
With all things considered, all these tools had the capability to create a difference file between two binary files and only store or apply the difference such that already existing blocks that were the same between a source and destination would not have to be copied or updated in any way.
To this date, it is quite surprising that the notion of "copying files" has not been beefed up and that "innovation" in this area remains moot, when, the former history has been established and there exist a theoretical background as well as referential tools like rsync
that are able to work with differences, whether binary deltas or entire source-destination differences in order to minimize the expenditure and time spent copying files.
Windows, for example, regardless of its hyped up releases of Windows 10 and then 11, does not innovate on this topic at all, and the "file copy" operation is the same as it ever was since the very start of "The C Programming Language" by Kernighan & Ritchie. Copying a large file on top of another large file on Windows, even if the files are very similar on a binary level, is a dumb operation where Windows sets a cursor at the beginning of the source file and then churns all the way to the end while copying everything from the source to the destination while disregarding any similitude between the files.
Luckily there are ways to perform partial copies between two files manually, by comparing them and generating file differences that can then be applied on top of the destination such that only differences are changed. The tools to mention are:
xdelta3 is perhaps the most recent and also available on Windows where it is popular in the ROM cracking scene where people modify games. Using these tools is very similar to each-other with the workflow being along the lines of first creating a patch between two files and then applying the patch to the destination file.
For the purpose of copying over a large binary file, the procedure can then be reduced to:
which describes theoretically the working mode of these tools. There are some properties here, namely:
One could go on but the gains seem clear, so here is an instantiation as an example of restoring a file from a previously stored BtrFS snapshot. Namely, when a snapshot exists, in order to restore a large file, from the point of view of BtrFS, restoring the file is a matter of just copying it over from the snapshot folder, however since the file is presumably very large (assume for example, a terrabyte large file, or a disk image in terms of hundreds of gigabytes) and we would like to restore the file without copying it entirely, then a difference delta file is first created between the snapshotted or backed up file and the current file:
xdelta3 -e -s /mnt/volume/.snapshots/20250811/S /mnt/volume/S /tmp/dS.xdelta3
where:
/mnt/volume/.snapshots/20250811/S
is the path to a file within the BtrFS snapshot volume that the user wants to copy,/mnt/volume/S
is the path to an existing file onto which the user would like to copy the source file,/tmp/dS.xdelta3
is the path to a file where the difference between /mnt/volume/.snapshots/20250811/S
and /mnt/volume/S
should be stored
After the file /tmp/dS.xdelta3
is generated by the xdelta3 tool, then the file /tmp/dS.xdelta3
can be applied onto /mnt/volume/S
which is the destination file onto which the user wanted to copy the snapshotted or backed-up file /mnt/volume/.snapshots/20250811/S
:
xdelta3 -d -s /mnt/volume/S /tmp/dS.xdelta3 /mnt/volume/S
where:
/mnt/volume/S
is the file to modify,/tmp/dS.xdelta3
is the file containing the delta differencewhich would represent an in-place patch.
Otherwise, the process of transferring two large directory trees whilst minding data that might already be transferred is covered by tools such as rsync
that, iff. the -W
(whole file) parameter is omitted, then rsync
will perform file checks and then update files based on their differences:
rsync -vaxHAKS --partial --append-verify --delete /source /destination
/source
is a source directory,/destination
is a target directory
The parameters –partial
and –append-verify
ensure that file transfers can be resumed in case they are interrupted, namely, the flag –partial
allows preserving partially transferred files (like large files when the transfer is interrupted, in case "rsync" is closed or crashes) such that when the transfer is issued again –append-verify
will check whether the hash of the data in the source file matches the hash of the data in the interrupted partial file, and iff. the hashes match, then "rsync" will resume transferring the file by appending to the end of the file.
Note that rsync just linearly compares two files to determine where to seek into the partially transferred file in order to continue copying the file. With –append-verify
, in case there is a difference between the source and the partially transferred file, for the size of the partially transferred file, then rsync will resort to transferring over the whole file again without creating a patch file. An even more dangerous option that is deprecated is –append
which just blindly copies over the source file by appending to the end of the partially transferred file, which will generate non-equal copies between the source and destination in case either the source or the destination has been changed in the meanwhile, which is why –append-verify
is preferred.
Teracopy on Windows manages to account for file pieces and then uses multiple threads in order to transfer larger files, more than likely with the theoretical hope of the copying threads to be distributed among the CPUs or cores thereby, just like torrents, eliminating the possibility that a single thread might be outscheduled on an operating system and eliminating a single point of failure. However, as far as we know, Teracopy does not implement delta transfers so in terms to transferring already existing files, more than likely Teracopy acts like the Unix rsync
tool that appends at the end of the file in case the partially transferred file matches the source for its file length, in order to implement its "resume" feature.
Historically, at the inception of the WWW, tools that were made for downloading, like "wget" (or later "curl") did not even have a way to resume partial transfers which made downloading a nightmarish operation when you were on dialup and the phone accidentally hung up such that you'd have to dial the ISP again and then restart the transfer from the very beginning. HTTP technology has since changed with Range Request (seek into file and partially transfer) and Chunked Transfers (transfer pieces) and wget got its -c
parameter allowing a large transfer to be resumed. What is interesting between a tool like wget
and Teracopy or rsync
is that resuming wget relies on first checking the local file for its size, asking a HTTP server for the source file size in order to perform some arithmetic and then determine from where the source file should be requested from the HTTP server in order to resume the transfer. However, up to this date, there is no built-in hashing ala "rsync" within the HTTP protocol, so compared to rsync
, wget only relies on arithmetic to determine whether the source file has changed and, if it has, then wget just restarts the download. This means that wget has the weakest resume of all the tools and that is because its resume operation is contingent upon the HTTP protocol that only implements reply to requests to transfer pieces of files, such that wget does not have the possibility to check the consistency of the transfer.
Maybe in ulterior versions of HTTP, delta transfers will be implemented, which is what the current development is being geared to with the notion of distributed websites that leverage torrent technology as a transfer technology. One very cute free tool to use to watch movies is Popcorn Time that works a treat for streaming movies directly using torrent technology because all that the client has to do is to transfer movie file pieces in linear and contiguous slabs of data while prioritizing the pieces closer to the current movie cursor compared to the pieces to be found later in the file.