FREEDOM DINOSAUR FLATFOOT
[1/2] GPG & performance: a deep-dive
From: "Robin H. Johnson" <email@example.com> Date: Sun, 19 Jun 2016 08:39:04 +0000 Hi all, I've been looking at backup options for a deployment, and in considering obnam, I like it's general speed, but found that it dropped unacceptably when encryption was enabled. TL;DR: suggestions - Right now: set '-z 0' in obnam symmetric crypto call, immediate 10% performance boost. - Plan for moving to PyCrypto or other for symmetric crypto A first pass examination pointed strongly to obnam's of GPG symmetric encryption. I improved the obnam-benchmark tool to help take these measurements below, the changes are on GitHub ; but first let's look at how GPG does symmetric encryption. GPG symmetric encryption (S2K) does the following: - takes a passphrase & data input, - optionally transforms the passphrase. (see s2k-digest-algo, s2k-mode, s2k-count) - optionally compresses the input (compress-level) - enciphers the output (see s2k-cipher-algo) - emit output in the S2K structure This records all of the above s2k-* parameters, as well. Stock obnam simply calls 'gpg -c' for symmetric encryption. In the absence of any other configuration, this generally has the following defaults: - s2k-digest-algo=SHA1 - s2k-mode=3 (key stretching by repeated hashing) - s2k-count=varies, my systems are 25M..65M - compress-level=6 (ZLIB level 6) - s2k-cipher-algo=AES128 It uses the cipher in a modified CFB mode [RFC4880, sec 5.7, ... "Tag 9"] Naively, you might think that GPG is fast enough. Sure, take a 1GB incompressible input, as a single file. 0.6s | cat in > out 0.8s | gpg --store -z 0 22.3s | gpg --store -z 6 5.6s | gpg --symmetric -z 0 27.4s | gpg --symmetric -z 6 # Default settings! S2K packet encoding: 33% slower Compression, used by default: 5-28x performance hit Symmetric enciphering: ~7x performance hit Overall: 45x slower The catch is that obnam calls gpg many many times, with much smaller inputs, so we have to pay the startup costs many times over. I set out to measure the cost breakdown of using gpg: - exec overhead - S2K packet overhead - symmetric encryption - S2K compression With the stock codebase, the gpg encryption plugin has this approx performance effect for me: - many_files benchmark, it's only a 20% hit, but there are only 256 unique values. - On the big_file benchmark, it's ~45x slower (than cat) First, the reference/stock runs: A: rsync -a live backup && rsync -a backup restore B: stock obnam, run with obnam-benchmark, production.yaml, no encryption C: stock as B, with gpg encryption (compressed symmetric encryption) Now the modified code variants: W: gpg, symmetric encryption, uncompressed X: gpg, s/--symmetric/--store/, compressed Y: gpg, s/--symmetric/--store/, uncompressed Z: HACK gpg-symmetric to just return the raw block PYC: quick hack for PyCrypto AES-256-CTR B->Z: obnam's overhead in asymmetric encryption only. Z->Y: this is the overhead added by obnam using GPG symmetric encryption (on top of the asymmetric management). Y->X, C->W: this is the overhead added by S2K compression on stored & enciphering. Y->W: this is the overhead added by enciphering, with no compression Timing data: ------------ in seconds, average of 3 runs. B1 = many_files B2 = one_big_file Benchmark Test| B1 | B2 ----+------+------- A | 15.0| 5.1 B | 225.4| 10.2 C | 284.8| 272.5 ----+------+------ W | 288.3| 246.4 X | 266.1| 57.7 Y | 266.4| 33.9 Z | 249.9| 14.4 ----+------+------ PYC | 236.4| 33.6 ----+------+------  https://github.com/robbat2/obnam-benchmarks/tree/robbat2/flexibility
From: Jan Niggemann <firstname.lastname@example.org> Date: Sun, 19 Jun 2016 15:08:51 +0200 Robin, I don't understand everything you wrote, but I can see the effort you put into this. If I understand correctly, then this will boost backing up large files by orders of magnitude and speed up backing up lots of smaller files by some degree. This definitively sounds like something worth considering. Thank you very much, I'm sure Lars will get in touch with you! Jan
From: Adrien CLERC <email@example.com> Date: Sun, 19 Jun 2016 22:53:37 +0200 Le 19/06/2016 à 10:39, Robin H. Johnson a écrit : > - Plan for moving to PyCrypto or other for symmetric crypto I totally understand the goal, and can only support you. However, PyCrypto is not compatible with PyPy, as 'Cryptography' is. That's why paramiko did the switch recently (http://www.paramiko.org/changelog.html). Based on https://github.com/paramiko/paramiko/pull/394 it seems that the switch is better on every aspect. Do you have the time to inspect this solution? Adrien _______________________________________________ obnam-support mailing list firstname.lastname@example.org http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: "Robin H. Johnson" <email@example.com> Date: Tue, 21 Jun 2016 22:43:29 +0000 On Sun, Jun 19, 2016 at 10:53:37PM +0200, Adrien CLERC wrote: > Le 19/06/2016 à 10:39, Robin H. Johnson a écrit : > > - Plan for moving to PyCrypto or other for symmetric crypto > I totally understand the goal, and can only support you. However, > PyCrypto is not compatible with PyPy, as 'Cryptography' is. That's why > paramiko did the switch recently (http://www.paramiko.org/changelog.html). > Based on https://github.com/paramiko/paramiko/pull/394 it seems that the > switch is better on every aspect. Do you have the time to inspect this > solution? If you look at email 2/2, the big concern isn't which toolkit we pick, but that the data can be linked to the correct toolkit when it comes time to decrypt it later, and that old chunks are also still correctly handled. I have used Crytography.io elsewhere, and it was on par with PyCrypto performance for what I was doing, but I didn't benchmark heavily. At that point, with files including a header that describes how we need to handle them, encryption.py can become an interface, and we can implement many variations. Example variations: 1. Pure GPG (present state) 2. GPG for asymmetric, PyCrypto/Cryptography.io/PyNaCl for symmetric 3. Pure PyCrypto/Cryptography.io/PyNaCl 4. Upgrade path variations - Read ANY, write Y #2 would be a trivial upgrade for existing repos: just upgrade all of your clients and enable it. Old chunks would continue to be encrypted with GPG, and still readable, while new chunks are generated much faster. If you don't care about the above, you can have a PyCrypto or Cryptography.io implementation tomorrow, but recovering your data in future disasters will be much more painful.
From: Adrien CLERC <firstname.lastname@example.org> Date: Wed, 22 Jun 2016 10:29:21 +0200 Le 22/06/2016 à 00:43, Robin H. Johnson a écrit : > Example variations: > 1. Pure GPG (present state) > 2. GPG for asymmetric, PyCrypto/Cryptography.io/PyNaCl for symmetric > 3. Pure PyCrypto/Cryptography.io/PyNaCl > 4. Upgrade path variations - Read ANY, write Y This seems a good thinking. I also think that #2 is a good option for migration and compatibility (for some reasons, people might trust one piece of software and not another one). Adrien _______________________________________________ obnam-support mailing list email@example.com http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org