The Principle of Least Surprise
Documentation vs. Application Programmer Interface
Contextualizing Storage Technologies by Usage Patterns

The Principle of Least Surprise

In computer engineering, more precisely pertaining more to the design of user interfaces, the "principle of least surprise" is a design principle where an interface is supposed to be similar to previously designed interfaces for the sake of continuity.

Documentation vs. Application Programmer Interface

One of the giant problems of our times is the lack of "documentation" for seemingly good software which in turn makes the software itself unusable in many cases. Even though automated tools are available for generating Application Programmer Interface (API) documentation, more than often, developers consider that the API itself is equivalent to documentation. This is applicable to the very many software packages that use Swagger and other tools to generate API documentation, especially where REST and other interoperability is involved. In reality, documentation is a superset of the API and involves much more than an API can provide such as:

coding examples,
case scenarios,
limitations,
etc

that are essential to any developer thinking to use the software as a component in their own workflow. Documentation is tough to write and needs some dedication to maintain given the changes between various releases of a software package such that it is no wonder the developers just prefer the tool output that generates API documentation instead of dedicating time to write proper documentation.

Either way, as it stands, it is important to remember that "documentation" is not synonymous with "API documentation" and that "documentation" consists in a superset that includes "API documentation" but along with much more.

Contextualizing Storage Technologies by Usage Patterns

One of the ideas being discussed in the NAS transformation of a hard-drive enclosure coupled with a mini/micro-PC is the idea of split-storage for different purposes that is worth mentioning separately. The idea is that very large and low-cost storage is typically to be found in formats that are not that technologically advanced, for example, in terms of money per byte, spinning platter disks with a 3.5" format will always be way cheaper than, say, a NVME device of a comparable size. This means that when talking about extremely large storage quantities, the costs for an NVME-like storage device will rise exponentially whereas a very classic spinning drive will increase more linearly with the amount of storage it provides. You could go the Riche-Rich way and blow away all the money and store everything on NVME but given the differences between technologies and scaling prices, that is just deliberately wasting money for no reason.

Another concern is wear-and-tear where it does not make too much sense to use a high-cost and very fast storage medium whose cycles would end up diminished by being hammered with writes of discardable data such as log files or temporary downloads that need to be cleaned up and are also fairly expendable. Similarly, hammering spinning disks with write cycles, just like any other medium, diminishes their lifespan and for most usage patterns it does not make sense to, say, not split the operating system from the large spinning drive and place files on the same storage medium. One critical application where we have found this storage contextual split very useful has been the IoT automatic recording with Desktop notifications project where it was found that automatically-stored video clips were all extremely volatile but also required a high IO usage pattern. The camera would continuously record in many ways, only stopped by motion but most of the recordings were unimportant or false positives and could just be thrown away such that it became a form of technological "sacrilege" to abuse expensive storage with minute-by-minute IO from storing camera footage that would even automatically end up deleted. In order to not only loosen the usage of the storage devices, but to also optimize the recordings, it was decided to use a temporary filesystem in RAM to store the recordings temporarily and only after being curated, the recordings could have been committed to long term storage. Not only did that alleviate write cycles but it was also some sort of speed-optimization of saving video streams because saving video clips to RAM did not create any bottleneck and the videos that were stored to RAM never ended up being corrupt. So, in some ways, it was determined that an ideal NVR would benefit from a large amount of RAM to temporarily store recordings that only after being curated by a human, for example, following an incident, should the recordings be relocated to long-term storage.

The next-up observation for this idea is that various usage patterns require or would allow different types of storage such that the cost of storage can be optimized contextually. Let's say, a Linux operating system filesystem could even run as a read-only storage due to only very few parts of the operating system that require write access to the drive, for instance, log files or other temporary files that even have a temporary profile and could instead by piped over the network to some centralized service that can store and analyze them. There is another problem regarding storage and that is the "myth" of "cheap RAID", which is a "blasphemic" concept starting with the unavailability of hardware technology like NCQ for SATA drives and up to Mean-Time-Between-Failures (MTBF) of commercial drives that is simply trash compared to industrial storage that is very expensive. The idea that you can buy several "cheap" SATA drives, maybe even external drives that are cheaper, use a screwdriver to pop out the SATA drive and then build a "cheap" RAID, simply does not hold with the quality of these SATA drives actually matching the usage pattern that they were designed for, namely as external drives to just store some files "now and then" but not really, say, to be part of a ZFS tank that re-silvers all the time! The thing is that even if MTBF could provide a probabilistic model that would allow determining when a drive would fail, the MTBF does not account for random failures that given large-scale production seem ever more predominant. Random failure in this context would be, for example, a very expensive NVME that just stops working for no reason as a gross underestimation of its MTBF and not as a consequence of its usage cycle. This means that investing in expensive storage and then using that storage for purposes that either destroy the medium for no reason (ie: hammered with log files that are not even read) artificially raises the price of the project artificially for the very reason of misunderstanding the various usage cases of different technologies. While MTBF relies on some preset environmental conditions, it also the case that the environmental conditions used for determining the MTBF rarely meet the environment when the product is used and with lots of external conditions that turn out to have a massive influence on the actual failure rate. For example, when gathering the motivation for dropping RAID solutions in favor of a monolithic build during the NAS transformation one of the sources of inspiration has been a relatively old write-up on the Internet from AKCP that showcased a paper from National Instruments that would claim that exceeding the thermal design range of a hard-drive even by 5C would be the functional equivalent of using the drive for more than two years continuously which shows that environmental parameters have a hefty effect on hardware. Interestingly, the observation based on the environment is that environmental parameters are never even semantically perceived as part of the usage pattern of hardware but rather something that varies wildly between applications.

For Servarr usage, a "download" folder or partition is typically a very frequently-accessed folder that is also fairly dirty in terms of stability with lots of broken downloads that need to be cleaned periodically and most of the data within that folder being a hit-or-miss ala Schroedinger's cat where a download either succeeds, in which case it can be moved to permanent storage or it is a failed download, in which case it is a giant wear-and-tear bomb. Fortunately, deleting a file on an operating system does not additionally imply zeroing out the bytes but rather just unlinking the file node from the filesystem tree such that the data could be overwritten but even so you have to consider that some downloads are large and can reach up into the 100s of GB which is byte data that ends up committed to the drive and, in case the download is not what expected, those 100s of GB just end up deleted as garbage while burning into the storage cycles for no useful purpose. A "download" folder hence would be very different in terms of storage constraints than long-term storage or even the root filesystem of the operating system that runs the software, which means that the underlying technology could, and more than likely should, be different. We would unapologetically recommend storing downloads on a cheap USB thumb drive that is just connected via an USB port. Modernly USB thumb drives reach up into the terabytes and the flash storage is fairly cheap but also not great in terms of performance. Furthermore, downloads should only be a temporary buffer and the total space requirements should only scale, say, with the seeding requirements specified by various trackers, but holding onto failed downloads and other garbage just fills up the drive for no purpose. In the case of catastrophic failure of the USB drive, the drive can just be tossed whereas if the downloads were placed on some expensive NVME then tossing out the NVME would not have been too great. As a fully-working example, for a Servarr stack, one could settle, maybe, with a read-only root filesystem, an extremely cheap USB drive for say up to 1TB to store the "download" folder on and a large 3.5" hard-drive to store the final files after they have been processed by the Servarr stack.

Another good example where it is hinted to that a filesystem should be contextually or semantically partitioned given its varying usage is installing the Debian Linux distribution where one of the prompts asks whether all the files should go into the same partition and marks the option as "beginner" which shows that a correct planning is usually performed where various sub-trees of the filesystem are mapped to different storage devices with varying properties. One of the common-practices, for example, is to relocate ephemeral files such as log files or temporary files in RAM via tmpfs and it is clear that the FSH can be partitioned to mount various top-level directories like /var or /tmp on different storage mediums while even ignoring the typical corporate diskless setups via NFS and NIS but just for the purpose of saving up money on storage by not being ignorant about storage mediums and their recommended usage patterns.

Table of Contents

The Principle of Least Surprise

Documentation vs. Application Programmer Interface

Contextualizing Storage Technologies by Usage Patterns