Front page

Variable Chunksize

11c32688f6ae4c039e5fef65b5007a88
ATHENS REPLICA BOOKSHELF

From: "Laurence Perkins (OE)" <lperkins@openeye.net>
Date: Wed, 19 Jul 2017 17:23:16 +0000

   Stumbled across a couple of rsync/backup type programs that use
   variable-sized chunks to improve backup performance.  It might be worth
   considering adding as an option.
   
   The concept is actually relatively simple.  Instead of splitting chunks
   at fixed sizes, chunks are split when the data matches some heuristic. 
   The common way to do it is to do a byte-by-byte hash of the stream and
   put in a chunk boundary wherever the hash meets some criteria.  With a
   bit of statistics you can set what the average chunksize will be just
   by tweaking the criteria slightly.  (So, read say the first 1KB, hash
   it, then drop the first byte from the data to hash and add 1KB+1B to
   the end of the data to hash and repeat.  You walk over the data a byte
   at a time while still keeping a big enough amount of data going into
   the hash algorithm to produce interesting output.  Split the chunk when
   the first X bytes of the hash are zero or whatever's convenient.)
   
   Other than that, the rest of the chunking and indexing routines work
   the same way they do now.
   
   The advantage is that adding a single byte to the start of a file
   doesn't cause the entire file to be re-uploaded.  It will just increase
   the size of the first chunk by one byte and re-use the rest. (Or, if
   it's exactly in the chunk boundary, it will change the first two
   chunks.  But still, that's a lot less data to re-upload.)  It will also
   increase the amount of deduplication between nearly-identical files.
   
   The downside is the overhead involved in walking a hash algorithm
   across the data, but it's not like you need a particularly CPU
   intensive hash, or even any significant collision resistance, you just
   need one that produces enough entropy to get your chunk sizes in the
   realm you want.
   
   This feature would significantly increase Obnam's deduplication
   capabilities, without affecting the complexity of the datastore (It
   already supports multiple chunk sizes in the repo.)
   
   I will take a look and see if I can figure out how to add this in, but
   I'm not a particularly tidy coder on the best of days, so someone with
   a bit more Python design experience than I have might be quicker at it.
    Like...  A lot quicker...
   
   LMP
   _______________________________________________
   obnam-support mailing list
   obnam-support@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: Lars Wirzenius <liw@liw.fi>
Date: Wed, 19 Jul 2017 21:12:32 +0300

   This has, in fact, been discussed before, and it's my intention to
   look at implementing this once FORMAT GREEN ALBATROSS is finished. The
   issue is that variable-sized chunking may increase the number of
   chunks by a lot, and I'm not sure Obnam's data structures will deal
   with that sufficiently efficiently.
   
   On Wed, Jul 19, 2017 at 05:23:16PM +0000, Laurence Perkins (OE) wrote:
   > Stumbled across a couple of rsync/backup type programs that use
   > variable-sized chunks to improve backup performance.  It might be worth
   > considering adding as an option.
   > 
   > The concept is actually relatively simple.  Instead of splitting chunks
   > at fixed sizes, chunks are split when the data matches some heuristic. 
   > The common way to do it is to do a byte-by-byte hash of the stream and
   > put in a chunk boundary wherever the hash meets some criteria.  With a
   > bit of statistics you can set what the average chunksize will be just
   > by tweaking the criteria slightly.  (So, read say the first 1KB, hash
   > it, then drop the first byte from the data to hash and add 1KB+1B to
   > the end of the data to hash and repeat.  You walk over the data a byte
   > at a time while still keeping a big enough amount of data going into
   > the hash algorithm to produce interesting output.  Split the chunk when
   > the first X bytes of the hash are zero or whatever's convenient.)
   > 
   > Other than that, the rest of the chunking and indexing routines work
   > the same way they do now.
   > 
   > The advantage is that adding a single byte to the start of a file
   > doesn't cause the entire file to be re-uploaded.  It will just increase
   > the size of the first chunk by one byte and re-use the rest. (Or, if
   > it's exactly in the chunk boundary, it will change the first two
   > chunks.  But still, that's a lot less data to re-upload.)  It will also
   > increase the amount of deduplication between nearly-identical files.
   > 
   > The downside is the overhead involved in walking a hash algorithm
   > across the data, but it's not like you need a particularly CPU
   > intensive hash, or even any significant collision resistance, you just
   > need one that produces enough entropy to get your chunk sizes in the
   > realm you want.
   > 
   > This feature would significantly increase Obnam's deduplication
   > capabilities, without affecting the complexity of the datastore (It
   > already supports multiple chunk sizes in the repo.)
   > 
   > I will take a look and see if I can figure out how to add this in, but
   > I'm not a particularly tidy coder on the best of days, so someone with
   > a bit more Python design experience than I have might be quicker at it.
   >  Like...  A lot quicker...
   > 
   > LMP
   > _______________________________________________
   > obnam-support mailing list
   > obnam-support@obnam.org
   > http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
   >
From: "Laurence Perkins (OE)" <lperkins@openeye.net>
Date: Thu, 20 Jul 2017 17:23:37 +0000

   Hmmm...  it shouldn't actually increase the number of chunks on
   average.  The ones that come out a little smaller will get balanced out
   by ones that come out a little bigger.  I suppose a really unlucky
   dataset could end up with a whole pile of tiny chunks, but adding a
   minimum and maximum chunk size wouldn't change the algorithm that much,
   and would mitigate that problem.
   
   To illustrate, say we're storing a string of random digits 0-9. 
   Statistically, if the set is large enough, splitting every ten
   characters should yield approximately the same number of chunks as
   splitting every time a 1 comes up. The bigger the repo, the more likely
   it is to come out the same.
   
   Since the data going in obviously won't be random, the key is to use a
   hash function where the output is (at least statistically).
   
   When I get some spare time I'll see if I can hack it into the chunking
   algorithm and test what happens with some real data sets.  I suspect
   the real challenge will be finding a hash algorithm with sufficient
   entropy that can be computed that many times without a backup taking a
   month or having to resort to loading the calculations into the GPU...
   
   
   
   On Wed, 2017-07-19 at 21:12 +0300, Lars Wirzenius wrote:
   > This has, in fact, been discussed before, and it's my intention to
   > look at implementing this once FORMAT GREEN ALBATROSS is finished.
   > The
   > issue is that variable-sized chunking may increase the number of
   > chunks by a lot, and I'm not sure Obnam's data structures will deal
   > with that sufficiently efficiently.
   > 
   > On Wed, Jul 19, 2017 at 05:23:16PM +0000, Laurence Perkins (OE)
   > wrote:
   > > Stumbled across a couple of rsync/backup type programs that use
   > > variable-sized chunks to improve backup performance.  It might be
   > > worth
   > > considering adding as an option.
   > > 
   > > The concept is actually relatively simple.  Instead of splitting
   > > chunks
   > > at fixed sizes, chunks are split when the data matches some
   > > heuristic. 
   > > The common way to do it is to do a byte-by-byte hash of the stream
   > > and
   > > put in a chunk boundary wherever the hash meets some
   > > criteria.  With a
   > > bit of statistics you can set what the average chunksize will be
   > > just
   > > by tweaking the criteria slightly.  (So, read say the first 1KB,
   > > hash
   > > it, then drop the first byte from the data to hash and add 1KB+1B
   > > to
   > > the end of the data to hash and repeat.  You walk over the data a
   > > byte
   > > at a time while still keeping a big enough amount of data going
   > > into
   > > the hash algorithm to produce interesting output.  Split the chunk
   > > when
   > > the first X bytes of the hash are zero or whatever's convenient.)
   > > 
   > > Other than that, the rest of the chunking and indexing routines
   > > work
   > > the same way they do now.
   > > 
   > > The advantage is that adding a single byte to the start of a file
   > > doesn't cause the entire file to be re-uploaded.  It will just
   > > increase
   > > the size of the first chunk by one byte and re-use the rest. (Or,
   > > if
   > > it's exactly in the chunk boundary, it will change the first two
   > > chunks.  But still, that's a lot less data to re-upload.)  It will
   > > also
   > > increase the amount of deduplication between nearly-identical
   > > files.
   > > 
   > > The downside is the overhead involved in walking a hash algorithm
   > > across the data, but it's not like you need a particularly CPU
   > > intensive hash, or even any significant collision resistance, you
   > > just
   > > need one that produces enough entropy to get your chunk sizes in
   > > the
   > > realm you want.
   > > 
   > > This feature would significantly increase Obnam's deduplication
   > > capabilities, without affecting the complexity of the datastore (It
   > > already supports multiple chunk sizes in the repo.)
   > > 
   > > I will take a look and see if I can figure out how to add this in,
   > > but
   > > I'm not a particularly tidy coder on the best of days, so someone
   > > with
   > > a bit more Python design experience than I have might be quicker at
   > > it.
   > >  Like...  A lot quicker...
   > > 
   > > LMP
   > > _______________________________________________
   > > obnam-support mailing list
   > > obnam-support@obnam.org
   > > http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-sup
   > > port-obnam.org
   > > 
   > 
   >
From: Lars Wirzenius <liw@liw.fi>
Date: Sat, 22 Jul 2017 17:17:56 +0300

   For more precise deduplication, it's almost certainly necessary to use
   smaller chunks than the megabyte sized ones Obnam currently uses by
   default. Smaller chunks mean more chunks.
   
   On Thu, Jul 20, 2017 at 05:23:37PM +0000, Laurence Perkins (OE) wrote:
   > When I get some spare time I'll see if I can hack it into the chunking
   > algorithm and test what happens with some real data sets.
   
   Please do.
From: Ben Boeckel <mathstuf@gmail.com>
Date: Mon, 24 Jul 2017 13:19:42 -0400

   On Mon, Jul 24, 2017 at 17:12:15 +0000, Laurence Perkins (OE) wrote:
   > Which just made me realise what kind of space savings I could get on
   > one of my archives...  I guess this just moved up the priority list a
   > ways...  Any suggestions for a hash that's fast enough to crawl over
   > large files without hurting performance too much while still giving
   > reasonable control over the average chunk size?
   
   What about the hashing used by casync in its chunking? From its
   announcement[1]:
   
       The "chunking" algorithm is based on a the buzhash rolling hash
       function. SHA256 is used as strong hash function to generate digests
       of the chunks. xz is used to compress the individual chunks.
   
   It sounds like you use a rolling hash to find the chunks and then store
   them according to their SHA256 hash.
   
   --Ben
   
   [1]http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
   
   _______________________________________________
   obnam-support mailing list
   obnam-support@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: "Laurence Perkins (OE)" <lperkins@openeye.net>
Date: Mon, 24 Jul 2017 17:12:15 +0000

   On Sat, 2017-07-22 at 17:17 +0300, Lars Wirzenius wrote:
   > For more precise deduplication, it's almost certainly necessary to
   > use
   > smaller chunks than the megabyte sized ones Obnam currently uses by
   > default. Smaller chunks mean more chunks.
   > 
   > 
   Smaller chunk size makes deduplication more precise regardless of the
   type of splitting, but it should generate some pretty big savings on
   similar data without reducing the chunk size because it will be better
   at finding identical chunks of data since it's not relying on them
   being at fixed offsets.
   
   To illustrate, consider the following two datasets:
   
   aaaaabbbbb
   
   caaaaabbbbb
   
   With a fixed chunk size of 5, the first dataset makes two chunks and
   the second makes three with no chunks shared.
   
   With a variable chunk algorithm that splits when it sees "ab", they
   both make two chunks, and they share one.
   
   As with all deduplication, the chunk size must be tuned to the type of
   data you're working with, but even with the current 1MB chunk size,
   files larger than that will benefit any time they get a value inserted
   into the middle.  Which is a common thing for files like sparse VM
   images and the like.
   
   Which just made me realise what kind of space savings I could get on
   one of my archives...  I guess this just moved up the priority list a
   ways...  Any suggestions for a hash that's fast enough to crawl over
   large files without hurting performance too much while still giving
   reasonable control over the average chunk size?
   
   LMP
From: "Laurence Perkins (OE)" <lperkins@openeye.net>
Date: Mon, 24 Jul 2017 18:15:48 +0000

   On Mon, 2017-07-24 at 20:20 +0300, Lars Wirzenius wrote:
   > On Mon, Jul 24, 2017 at 05:12:15PM +0000, Laurence Perkins (OE)
   > wrote:
   > > Smaller chunk size makes deduplication more precise regardless of
   > > the
   > > type of splitting, but it should generate some pretty big savings
   > > on
   > > similar data without reducing the chunk size because it will be
   > > better
   > > at finding identical chunks of data since it's not relying on them
   > > being at fixed offsets.
   > 
   > If you have actual measurements of this, please report them. Some
   > years ago when this idea first came up in the Obnam context, using
   > the proposed type of chunking without reducing chunk size
   > significatnly didn't much help in de-duplication. Only when the
   > average chunk size became much smaller, did de-deuplication get a lot
   > better, but then the number of chunks became a problem.
   > 
   > Guessing isn't helpful here. Even if it were, now is not a good time
   > for me to spend any time on this, and I don't want to even consider a
   > patch for this until green albatross is in shape.
   > 
   
   Mathematically it's going to depend a lot on your dataset. You will
   never see any benefit on files smaller than two chunks, nor on files
   where new data is simply appended to the end.  This is probably the
   kind of data most users have at this point, (along with data that is
   never modified) so it's definitely not worth diverting your attention
   away from completing green albatross.  
   
   However, when backing up sparse disk images, cloned VMs,  chroot
   tarballs, certain kinds of database file, or anything that spans
   multiple chunks and is routinely subjected to random insertions I
   commonly see fixed chunked algorithm tools (including Obnam) go quickly
   until they hit the first inserted bit of data, and then proceed to re-
   transfer the entire rest of the file because all the chunks are off. 
   For one of my machines, this often results in hundreds of gigabytes of
   unnecessary data transfer per backup run.  This is probably not
   representative of the typical Obnam user, but there have been a few
   people asking for advice about how to optimize such things on the
   mailing list, so I doubt I'm the only one.
   
   I strongly suspect that all the extra hashing is going to crater
   performance regardless, but I'll give it a shot.  Since the repo
   formats are chunksize-agnostic impact on other sections of the program
   should be virtually nil.
From: Lars Wirzenius <liw@liw.fi>
Date: Mon, 24 Jul 2017 20:20:43 +0300

   On Mon, Jul 24, 2017 at 05:12:15PM +0000, Laurence Perkins (OE) wrote:
   > Smaller chunk size makes deduplication more precise regardless of the
   > type of splitting, but it should generate some pretty big savings on
   > similar data without reducing the chunk size because it will be better
   > at finding identical chunks of data since it's not relying on them
   > being at fixed offsets.
   
   If you have actual measurements of this, please report them. Some
   years ago when this idea first came up in the Obnam context, using
   the proposed type of chunking without reducing chunk size
   significatnly didn't much help in de-duplication. Only when the
   average chunk size became much smaller, did de-deuplication get a lot
   better, but then the number of chunks became a problem.
   
   Guessing isn't helpful here. Even if it were, now is not a good time
   for me to spend any time on this, and I don't want to even consider a
   patch for this until green albatross is in shape.