Linux dm-crypt Performance Improved – Linux

I recently installed a new home NAS server. For data protection all disks should be encrypted using dm-crypt. However performance was far from what I expected. After searching some time I found the reason (and a proposed solution) in a very interesting article of Ignat Korchagin (also a video talk about this topic is available here). His analysis is very detailed and definitely worth a read.

So instead of repeating all that, I’ll concentrate on the practical aspects and add some real world performance numbers.

Modifications

For now (until the required patches are accepted by the kernel community) we need to build and replace two kernel modules: a modified version of dm-crypt and xtsproxy. Details about how to do that can be found here.

Benchmarking

But what performance gain can be expected and how can this be monitored on a real world system? First let’s see what our hardware is capable of doing in different involved areas.

Basic performance data

The first thing we’ll look at is the crypto performance: Fortunately the cryptsetup framework does provide us with a simple benchmark (the following test was done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, AES-NI available):

linux # cryptsetup benchmark -c aes-xts
# Tests are approximate using memory only (no storage IO).
# Algorithm |       Key |      Encryption |      Decryption
    aes-xts        256b      1431,9 MiB/s      1386,5 MiB/s

The second thing we care about is block I/O performance. We’ll be using a RAM disk for that, so we won’t be limited by disk or SSD performance. To do that let’s create a 4 GB RAM disk (/dev/ram0, hint: make sure to adapt that value – I’d suggest about 50% of your total RAM):

linux # modprobe brd rd_nr=1 rd_size=$[ 4 * 1024 * 1024 ]

And now use fio for some basic benchmarks (I reduced the output for better readability and added some inline comments).

linux # apt install fio

# Random 4k block I/O, read / write
linux # fio --filename=/dev/ram0 --readwrite=read --bs=4k --direct=1 --name=plain --runtime=60 --time_based
# read
<...>
  lat (usec)   : 2=99.83%, 4=0.13%, 10=0.03%, 20=0.01%, 50=0.01%
<...>
  READ: bw=2541MiB/s (2664MB/s), 2541MiB/s-2541MiB/s (2664MB/s-2664MB/s), io=16.0GiB (17.2GB), run=6449-6449msec
# write
<...>
  lat (usec)   : 2=99.86%, 4=0.11%, 10=0.03%, 20=0.01%, 50=0.01% 
<...>
  WRITE: bw=2484MiB/s (2604MB/s), 2484MiB/s-2484MiB/s (2604MB/s-2604MB/s), io=16.0GiB (17.2GB), run=6597-6597msec

So we get about 2.6 GB/s in read and write operations (mixing read / write operations results in roughly the same numbers if summed up).

	cryptsetup	fio (/dev/ram0)
Read / Decrypt [MB/s]	1387	2664
Write / Encrypt [MB/s]	1432	2604

Theoretical performance maxima

The theoretical maximum of encrypted I/O is somewhere around 924 MB/s (look here for explanation and formula).

Performance data with crypto applied

We’ll now do the same with an encrypted RAM disk block device (using default dm-crypt module):

# read
<...>
  lat (usec)   : 10=0.34%, 20=88.26%, 50=11.37%, 100=0.02%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 4=0.01%
<...>
   READ: bw=263MiB/s (276MB/s), 263MiB/s-263MiB/s (276MB/s-276MB/s), io=15.4GiB (16.6GB), run=60001-60001msec

# write
<...>
  lat (usec)   : 20=98.60%, 50=1.38%, 100=0.03%, 250=0.01%, 1000=0.01%
  lat (msec)   : 10=0.01%
<...>
  WRITE: bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=13.0GiB (15.0GB), run=60001-60001msec

As you can see we get only about 27% of the throughput possible (according to theory) – that’s really bad. For details about where this performance gets lost and why please consult Ignat’s article.

And now with the patches applied:

# read
<...>
  lat (usec)   : 10=99.94%, 20=0.06%, 50=0.01%, 100=0.01%
<...>
   READ: bw=676MiB/s (709MB/s), 676MiB/s-676MiB/s (709MB/s-709MB/s), io=15.0GiB (17.2GB), run=24223-24223msec

# write
<...>
  lat (usec)   : 10=99.95%, 20=0.05%, 50=0.01%
<...>
  WRITE: bw=612MiB/s (642MB/s), 612MiB/s-612MiB/s (642MB/s-642MB/s), io=15.0GiB (17.2GB), run=26744-26744msec

That’s still quite a bit lower than theory predicted, but at least we get now 66% of the optimum. That’s way better than before. And what’s even better: not only the throughput increased massively: also latency decreased by about a factor of 2!

So in direct comparison:

	Plain Kernel 5.4	With Patches	Improvement
Read[MB/s]	276	709	~250 %
Write [MB/s]	250	642	~250 %
Latency [us]	~10	~20	~200 %

Measured performance values

This is about the same speedup as mentioned in the original article.

Modifications

Benchmarking

Basic performance data

Performance data with crypto applied

Leave a Reply Cancel reply