I recently installed a new home NAS server. For data protection all disks should be encrypted using dm-crypt. However performance was far from what I expected. After searching some time I found the reason (and a proposed solution) in a very interesting article of Ignat Korchagin (also a video talk about this topic is available here). His analysis is very detailed and definitely worth a read.
So instead of repeating all that, I’ll concentrate on the practical aspects and add some real world performance numbers.
Modifications
For now (until the required patches are accepted by the kernel community) we need to build and replace two kernel modules: a modified version of dm-crypt and xtsproxy. Details about how to do that can be found here.
Benchmarking
But what performance gain can be expected and how can this be monitored on a real world system? First let’s see what our hardware is capable of doing in different involved areas.
Basic performance data
The first thing we’ll look at is the crypto performance: Fortunately the cryptsetup framework does provide us with a simple benchmark (the following test was done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, AES-NI available):
linux # cryptsetup benchmark -c aes-xts
# Tests are approximate using memory only (no storage IO).
# Algorithm | Key | Encryption | Decryption
aes-xts 256b 1431,9 MiB/s 1386,5 MiB/s
The second thing we care about is block I/O performance. We’ll be using a RAM disk for that, so we won’t be limited by disk or SSD performance. To do that let’s create a 4 GB RAM disk (/dev/ram0, hint: make sure to adapt that value – I’d suggest about 50% of your total RAM):
linux # modprobe brd rd_nr=1 rd_size=$[ 4 * 1024 * 1024 ]
And now use fio for some basic benchmarks (I reduced the output for better readability and added some inline comments).
linux # apt install fio
# Random 4k block I/O, read / write
linux # fio --filename=/dev/ram0 --readwrite=read --bs=4k --direct=1 --name=plain --runtime=60 --time_based
# read
<...>
lat (usec) : 2=99.83%, 4=0.13%, 10=0.03%, 20=0.01%, 50=0.01%
<...>
READ: bw=2541MiB/s (2664MB/s), 2541MiB/s-2541MiB/s (2664MB/s-2664MB/s), io=16.0GiB (17.2GB), run=6449-6449msec
# write
<...>
lat (usec) : 2=99.86%, 4=0.11%, 10=0.03%, 20=0.01%, 50=0.01%
<...>
WRITE: bw=2484MiB/s (2604MB/s), 2484MiB/s-2484MiB/s (2604MB/s-2604MB/s), io=16.0GiB (17.2GB), run=6597-6597msec
So we get about 2.6 GB/s in read and write operations (mixing read / write operations results in roughly the same numbers if summed up).
cryptsetup | fio (/dev/ram0) | |
Read / Decrypt [MB/s] | 1387 | 2664 |
Write / Encrypt [MB/s] | 1432 | 2604 |
The theoretical maximum of encrypted I/O is somewhere around 924 MB/s (look here for explanation and formula).
Performance data with crypto applied
We’ll now do the same with an encrypted RAM disk block device (using default dm-crypt module):
# read
<...>
lat (usec) : 10=0.34%, 20=88.26%, 50=11.37%, 100=0.02%, 250=0.01%
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 4=0.01%
<...>
READ: bw=263MiB/s (276MB/s), 263MiB/s-263MiB/s (276MB/s-276MB/s), io=15.4GiB (16.6GB), run=60001-60001msec
# write
<...>
lat (usec) : 20=98.60%, 50=1.38%, 100=0.03%, 250=0.01%, 1000=0.01%
lat (msec) : 10=0.01%
<...>
WRITE: bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=13.0GiB (15.0GB), run=60001-60001msec
As you can see we get only about 27% of the throughput possible (according to theory) – that’s really bad. For details about where this performance gets lost and why please consult Ignat’s article.
And now with the patches applied:
# read
<...>
lat (usec) : 10=99.94%, 20=0.06%, 50=0.01%, 100=0.01%
<...>
READ: bw=676MiB/s (709MB/s), 676MiB/s-676MiB/s (709MB/s-709MB/s), io=15.0GiB (17.2GB), run=24223-24223msec
# write
<...>
lat (usec) : 10=99.95%, 20=0.05%, 50=0.01%
<...>
WRITE: bw=612MiB/s (642MB/s), 612MiB/s-612MiB/s (642MB/s-642MB/s), io=15.0GiB (17.2GB), run=26744-26744msec
That’s still quite a bit lower than theory predicted, but at least we get now 66% of the optimum. That’s way better than before. And what’s even better: not only the throughput increased massively: also latency decreased by about a factor of 2!
So in direct comparison:
Plain Kernel 5.4 | With Patches | Improvement | |
Read[MB/s] | 276 | 709 | ~250 % |
Write [MB/s] | 250 | 642 | ~250 % |
Latency [us] | ~10 | ~20 | ~200 % |
This is about the same speedup as mentioned in the original article.