Front page

[1/2] GPG & performance: a deep-dive

cb75a21b4a874f86ba49e06ae8d887fc
SPHEROID IMPARTIAL REBIRTH

From: "Robin H. Johnson" <robbat2@gentoo.org>
Date: Sun, 19 Jun 2016 08:39:04 +0000

   Hi all,
   
   I've been looking at backup options for a deployment, and in considering
   obnam, I like it's general speed, but found that it dropped unacceptably
   when encryption was enabled.
   
   TL;DR: suggestions
   - Right now: set '-z 0' in obnam symmetric crypto call, immediate 10%
     performance boost.
   - Plan for moving to PyCrypto or other for symmetric crypto
   
   A first pass examination pointed strongly to obnam's of GPG symmetric
   encryption. 
   
   I improved the obnam-benchmark tool to help take these measurements
   below, the changes are on GitHub [1]; but first let's look at how GPG
   does symmetric encryption.
   
   GPG symmetric encryption (S2K) does the following:
   - takes a passphrase & data input,
   - optionally transforms the passphrase.
     (see s2k-digest-algo, s2k-mode, s2k-count)
   - optionally compresses the input
     (compress-level)
   - enciphers the output
     (see s2k-cipher-algo)
   - emit output in the S2K structure
     This records all of the above s2k-* parameters, as well.
   
   Stock obnam simply calls 'gpg -c' for symmetric encryption.
   In the absence of any other configuration, this generally has the
   following defaults:
   - s2k-digest-algo=SHA1
   - s2k-mode=3 (key stretching by repeated hashing)
   - s2k-count=varies, my systems are 25M..65M
   - compress-level=6 (ZLIB level 6)
   - s2k-cipher-algo=AES128
   
   It uses the cipher in a modified CFB mode [RFC4880, sec 5.7, ... "Tag 9"]
   
   Naively, you might think that GPG is fast enough. Sure, take a 1GB
   incompressible input, as a single file.
   
    0.6s | cat in > out
    0.8s | gpg --store -z 0
   22.3s | gpg --store -z 6
    5.6s | gpg --symmetric -z 0
   27.4s | gpg --symmetric -z 6 # Default settings!
   
   S2K packet encoding: 33% slower
   Compression, used by default: 5-28x performance hit
   Symmetric enciphering: ~7x performance hit
   Overall: 45x slower
   
   The catch is that obnam calls gpg many many times, with much smaller
   inputs, so we have to pay the startup costs many times over.
   
   I set out to measure the cost breakdown of using gpg:
   - exec overhead
   - S2K packet overhead
   - symmetric encryption
   - S2K compression
   
   With the stock codebase, the gpg encryption plugin has this approx
   performance effect for me:
   - many_files benchmark, it's only a 20% hit, but there are only 256
   unique values. 
   - On the big_file benchmark, it's ~45x slower (than cat)
   
   First, the reference/stock runs:
   A: rsync -a live backup && rsync -a backup restore
   B: stock obnam, run with obnam-benchmark, production.yaml, no encryption
   C: stock as B, with gpg encryption (compressed symmetric encryption)
   
   Now the modified code variants:
   W: gpg, symmetric encryption, uncompressed
   X: gpg, s/--symmetric/--store/, compressed
   Y: gpg, s/--symmetric/--store/, uncompressed
   Z: HACK gpg-symmetric to just return the raw block
   PYC: quick hack for PyCrypto AES-256-CTR
   
   B->Z: obnam's overhead in asymmetric encryption only.
   Z->Y: this is the overhead added by obnam using GPG symmetric encryption
         (on top of the asymmetric management).
   Y->X, 
   C->W: this is the overhead added by S2K compression on stored &
         enciphering.
   Y->W: this is the overhead added by enciphering, with no compression
   
   Timing data:
   ------------
   in seconds, average of 3 runs.
   B1 = many_files
   B2 = one_big_file
   
         Benchmark
   Test|  B1  |  B2
   ----+------+-------
   A   |  15.0|   5.1
   B   | 225.4|  10.2
   C   | 284.8| 272.5
   ----+------+------
   W   | 288.3| 246.4
   X   | 266.1|  57.7
   Y   | 266.4|  33.9
   Z   | 249.9|  14.4
   ----+------+------
   PYC | 236.4|  33.6
   ----+------+------
   
   [1] https://github.com/robbat2/obnam-benchmarks/tree/robbat2/flexibility
From: Jan Niggemann <jn@hz6.de>
Date: Sun, 19 Jun 2016 15:08:51 +0200

   Robin,
   
   I don't understand everything you wrote, but I can see the effort you  
   put into this.
   If I understand correctly, then this will boost backing up large files  
   by orders of magnitude and speed up backing up lots of smaller files  
   by some degree. This definitively sounds like something worth  
   considering.
   
   Thank you very much, I'm sure Lars will get in touch with you!
   Jan
From: Adrien CLERC <adrien@antipoul.fr>
Date: Sun, 19 Jun 2016 22:53:37 +0200

   Le 19/06/2016 à 10:39, Robin H. Johnson a écrit :
   > - Plan for moving to PyCrypto or other for symmetric crypto
   I totally understand the goal, and can only support you. However,
   PyCrypto is not compatible with PyPy, as 'Cryptography' is. That's why
   paramiko did the switch recently (http://www.paramiko.org/changelog.html).
   Based on https://github.com/paramiko/paramiko/pull/394 it seems that the
   switch is better on every aspect. Do you have the time to inspect this
   solution?
   
   Adrien
   
   _______________________________________________
   obnam-dev mailing list
   obnam-dev@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-dev-obnam.org
From: "Robin H. Johnson" <robbat2@gentoo.org>
Date: Tue, 21 Jun 2016 22:43:29 +0000

   On Sun, Jun 19, 2016 at 10:53:37PM +0200, Adrien CLERC wrote:
   > Le 19/06/2016 à 10:39, Robin H. Johnson a écrit :
   > > - Plan for moving to PyCrypto or other for symmetric crypto
   > I totally understand the goal, and can only support you. However,
   > PyCrypto is not compatible with PyPy, as 'Cryptography' is. That's why
   > paramiko did the switch recently (http://www.paramiko.org/changelog.html).
   > Based on https://github.com/paramiko/paramiko/pull/394 it seems that the
   > switch is better on every aspect. Do you have the time to inspect this
   > solution?
   If you look at email 2/2, the big concern isn't which toolkit we pick,
   but that the data can be linked to the correct toolkit when it comes
   time to decrypt it later, and that old chunks are also still correctly
   handled.
   
   I have used Crytography.io elsewhere, and it was on par with PyCrypto
   performance for what I was doing, but I didn't benchmark heavily.
   
   At that point, with files including a header that describes how we need
   to handle them, encryption.py can become an interface, and we can
   implement many variations. 
   
   Example variations:
   1. Pure GPG (present state)
   2. GPG for asymmetric, PyCrypto/Cryptography.io/PyNaCl for symmetric
   3. Pure PyCrypto/Cryptography.io/PyNaCl
   4. Upgrade path variations - Read ANY, write Y
   
   #2 would be a trivial upgrade for existing repos: just upgrade all of
   your clients and enable it. Old chunks would continue to be encrypted
   with GPG, and still readable, while new chunks are generated much
   faster.
   
   If you don't care about the above, you can have a PyCrypto or
   Cryptography.io implementation tomorrow, but recovering your data in
   future disasters will be much more painful.
From: Adrien CLERC <adrien@antipoul.fr>
Date: Wed, 22 Jun 2016 10:29:21 +0200

   Le 22/06/2016 à 00:43, Robin H. Johnson a écrit :
   > Example variations:
   > 1. Pure GPG (present state)
   > 2. GPG for asymmetric, PyCrypto/Cryptography.io/PyNaCl for symmetric
   > 3. Pure PyCrypto/Cryptography.io/PyNaCl
   > 4. Upgrade path variations - Read ANY, write Y
   This seems a good thinking. I also think that #2 is a good option for
   migration and compatibility (for some reasons, people might trust one
   piece of software and not another one).
   
   Adrien
   
   _______________________________________________
   obnam-dev mailing list
   obnam-dev@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-dev-obnam.org