This post documents changes I made to my zfs server setup to resolve the issue of slow hard disk access to my performance-sensitive datasets.
When you access random data on hard disks, the disks have to seek to find the data. If you are lucky, the data will already be in a cache. If you are unlucky the disk will have to seek to find it. The average seek time on my WD Red disks is 30ms.
So although the disks are capable of perhaps hundress of MB/s, given an optimal read request, given a typical read requests of a virtual hard disk from one of my virtual machine clients, performance is very much lower.
ZFS already provides
ZFS provides performance optimations to help to alleviate this. A ZIL (ZFS intention log) is written by ZFS before writing data proper. This redundant writing provides integrity against loss of power part way through a write cycle, but it also increases the load on the hard disk.
The ZIL can be moved to a separate disk, when it is called an SLOG (separate log). Putting it on a faster (e.g. SSD) disk can improve performance of the system as a whole by making writes faster. The SLOG doesn’t need to be very big – just the amount of data that would be written in a few seconds. With a quiet server, I see that the used space on my SLOG is 20 MB.
Secondly there is a read cache. ZFS attempts to predict reads based on frequency of access, and caches data in something called an ARC. You can also provide a cache on an SSD (or NVME) device, which is called a level 2 ARC (L2ARC). Adding an L2ARC effectively extends the size of the cache. On my server, when it’s not doing anything disk-intensive, I see a used L2ARC of about 50 GB.
A benefit an SSD is that it doesn’t have a physical seek time. So the performance of random reads is much better than a rotating disk. The transfer rate is limited by its electronics, not the rotational speed of the physical disk platter.
NVMEs have an advantage over SATA that they can use multi-lane PCI interfaces and increase the transfer rate substantially over the 6 Gbps limit of today’s SATA.
I wanted to improve my ZFS performance over and above the limitations of my Western Digital Red hard disks. Replacing the 16TB mirrored pool (consisting of 4 x 8 TB disks, plus a spare) would take 17 x 2 TB disks. A 2TB Samsung Evo Pro disk in early 2020 costs £350, and is intended for server applications (5 years warrantee or 2,400 TB written). At this cost, replacing the entire pool would be almost £6,000 – which is way too expensive for me. Perhaps I’ll do this in years to come when the cost has come down.
My current approach is to create a fast pool based on a single 2TB SSD, and host only those datasets that need the speed on this pool. The problem this approach then creates is that the 2TB SSD pool has no redundancy.
I already had a backup server in a different physical location. The main server wakes the backup server once a day and pushes an incremental backup of all zfs datasets.
However, I wanted a local copy on the slower pool that could be synchronized with the fast pool fairly frequently, and more importantly, which I could revert to quickly (e.g. by switching a virtual machine hard disk) if the fast pool datatset was hosed.
So I decided to move the speed-critical datasets to the fast pool, and perform an hourly incremental backup to a copy in the slow pool.
I already used zrep to back up most of my datasets to my backup server.
I added zrep backups from the fast pool datasets to their slow pool backups. As all these datasets already had a backup on the backup server, I set the ZREPTAG environment variable to a new value “zrep-local” for this purpose so that zrep could treat the two backup destinations as distinct.
“I added” above hides some subtlely. Zrep is not designed for local backup like this, even though it treats a “localhost” destination as something special. But the zrep init command with a localhost destination creates a broken configuration such that zrep subsequently consider both the original and the backup to be masters. It is necessary to go one level under the hood of zrep to set the correct configuration thus:
zrep changeconfig -f $fastPool/$1 localhost $slowPool/$1
zrep changeconfig -f -d $slowPool/$1 localhost $fastPool/$1
zrep sentsync $fastPool/$1@zrep-local_000001
zrep sync $fastPool/$1
A zrep backup can fail for various reasons, so it is worth keeping an eye on it and making sure that failures are reported to you via email. One reason it can fail is because some process has modified the backup destination. If the dataset is not mounted, such modification should not occur, but my experience was that zrep found cause to complain anyway. So I made as part of my local backup a rollback to the latest zrep snapshot before issuing a zrep sync.
Interaction with zfs-auto-snapshot
If you are running zfs-auto-snapshot on your system (and if not, why not?), this tool has two implications for local backup. Firstly, it attempts to modify your backup pool, which upsets zrep. Secondly, if you address the first problem, you end up with lots of zfs-auto-snapshot snapshots on the backup pool as there is then no reason why these should expire.
You solve the first problem by setting the zfs attribute com.sun:zfs-auto-snapshot=false on all such datasets.
You solve the second problem by creating an equivalent of the zfs-auto-snapshot expire behaviour and running it on the slow pool after performing a backup.
The following code performs this operation:
# process snapshots for stated zrep-auto-snap category keeping stated number
snapsToDelete=`zfs list -rt snapshot -H -o name $slowPool/$ds | grep $zfsCategoryLabel | head –lines=-$keep`
for snap in $snapsToDelete
zfs destroy $snap
# echo processing $ds
process_category $ds “frequent” 4
process_category $ds “hourly” 24
process_category $ds “daily” 7
process_category $ds “weekly” 4
process_category $ds “monthly” 12
# get list of datasets in fast pool
dss=`zfs get -r -s local -o name -H zrep-local:master $fastPool`
for ds in $dss
# remove pool name