Front page

Obnam performance anecdote

8a0f7a162f6940fa83ca52a8a0ffa8a7
OAKLAND ATMOSPHERE KEYBOARD

From: Teddy Hogeborn <teddy@recompile.se>
Date: Mon, 19 Dec 2016 16:04:04 +0100

   I just finished a full first-time backup run of an entire system,
   totalling ~530 Gigabytes, using an SFTP repository, local Gigabit
   Ethernet, no encryption.  It took more than two weeks to complete:
   
   2016-12-19 13:27:30 INFO VFS: baseurl=sftp://user@server/~/path/repository roundtrips=396218366
   2016-12-19 13:27:30 INFO Backup performance statistics:
   2016-12-19 13:27:30 INFO * files found: 12551555
   2016-12-19 13:27:30 INFO * files backed up: 12551554
   2016-12-19 13:27:30 INFO * uploaded chunk data: 426849686120 bytes (397 GiB)
   2016-12-19 13:27:30 INFO * total uploaded data (incl. metadata): 3160151243393 bytes (2 TiB)
   2016-12-19 13:27:30 INFO * total downloaded data (incl. metadata): 7084787046861 bytes (6 TiB)
   2016-12-19 13:27:30 INFO * transfer overhead: 9818088604134 bytes (8 TiB)
   2016-12-19 13:27:30 INFO * duration: 1349326.5834 s (374h48m47s)
   2016-12-19 13:27:30 INFO * average speed: 308.928469748 KiB/s
   
   It also took more than 25 Gigabytes of RAM, with spikes going above
   that - see attached graph.  The green part (labeled "apps") is what to
   look at.  After the first week, it stabilized to around 20Gb.
   
   I will assume that two weeks and 25 Gb RAM aren't expected behavior, so
   I wonder where on the roadmap this is planned to be fixed?  I.e. how can
   I know when I should try Obnam again?
   
   Will "green albatross" ameliorate any of these problems?
   
   /Teddy Hogeborn
From: Lars Wirzenius <liw@liw.fi>
Date: Mon, 19 Dec 2016 16:09:48 +0100

   On Mon, Dec 19, 2016 at 04:04:04PM +0100, Teddy Hogeborn wrote:
   > Will "green albatross" ameliorate any of these problems?
   
   Possibly. Why don't you try it and tell us?
From: Teddy Hogeborn <teddy@recompile.se>
Date: Sun, 22 Jan 2017 15:16:05 +0100

   Lars Wirzenius <liw@liw.fi> writes:
   
   > On Mon, Dec 19, 2016 at 04:04:04PM +0100, Teddy Hogeborn wrote:
   > > Will "green albatross" ameliorate any of these problems?
   >
   > Possibly. Why don't you try it and tell us?
   
   I have.  Unfortunetely, I did not run with --debug, so no detailed
   statistics, but even though it only took about 6 Gigabytes of RAM at
   most, it took closer to *three weeks* to complete this time.
   
   /Teddy Hogeborn
From: Teddy Hogeborn <teddy@recompile.se>
Date: Sun, 22 Jan 2017 16:09:23 +0100

   Lars Wirzenius <liw@liw.fi> writes:
   
   > > > On Mon, Dec 19, 2016 at 04:04:04PM +0100, Teddy Hogeborn wrote:
   > > > > Will "green albatross" ameliorate any of these problems?
   > > >
   > > > Possibly. Why don't you try it and tell us?
   > > 
   > > I have.  Unfortunetely, I did not run with --debug, so no detailed
   > > statistics, but even though it only took about 6 Gigabytes of RAM at
   > > most, it took closer to *three weeks* to complete this time.
   >
   > Can you share you config, please? 6 gigabytes seems excessive for the
   > current version of Obnam.
   
   No config file was used, just command line options:
   
   obnam backup --repository \
   sftp://obnam@$BACKUPSERVER/~/$HOSTNAME/repository \
   --output=/root/obnam-ga.log --ssh-key=/root/.ssh/id_obnam \
   --repository-format=green-albatross-20160813 --exclude=/sys \
   --exclude=/proc --exclude=/dev --exclude=/run --exclude=/net \
   --exclude=/home /
   
   This was on the same server and same set of files that I ran the other
   backup on, so you can look there for some statistics on the number of
   files, etc.
   
   "obnam --version" gives "1.20.2".  "dpkg -s obnam" gives "Version:
   1.20.2-1.debian8".
   
   I could live with 6 Gigs of memory, but not with close to three weeks of
   runtime for a full backup of ~550 Gb.
   
   /Teddy Hogeborn
From: Lars Wirzenius <liw@liw.fi>
Date: Sun, 22 Jan 2017 16:56:33 +0200

   On Sun, Jan 22, 2017 at 03:16:05PM +0100, Teddy Hogeborn wrote:
   > Lars Wirzenius <liw@liw.fi> writes:
   > 
   > > On Mon, Dec 19, 2016 at 04:04:04PM +0100, Teddy Hogeborn wrote:
   > > > Will "green albatross" ameliorate any of these problems?
   > >
   > > Possibly. Why don't you try it and tell us?
   > 
   > I have.  Unfortunetely, I did not run with --debug, so no detailed
   > statistics, but even though it only took about 6 Gigabytes of RAM at
   > most, it took closer to *three weeks* to complete this time.
   
   Can you share you config, please? 6 gigabytes seems excessive for the
   current version of Obnam.
From: Lars Wirzenius <liw@liw.fi>
Date: Sun, 22 Jan 2017 20:46:48 +0200

   On Sun, Jan 22, 2017 at 04:09:23PM +0100, Teddy Hogeborn wrote:
   > Lars Wirzenius <liw@liw.fi> writes:
   > 
   > > > > On Mon, Dec 19, 2016 at 04:04:04PM +0100, Teddy Hogeborn wrote:
   > > > > > Will "green albatross" ameliorate any of these problems?
   > > > >
   > > > > Possibly. Why don't you try it and tell us?
   > > > 
   > > > I have.  Unfortunetely, I did not run with --debug, so no detailed
   > > > statistics, but even though it only took about 6 Gigabytes of RAM at
   > > > most, it took closer to *three weeks* to complete this time.
   > >
   > > Can you share you config, please? 6 gigabytes seems excessive for the
   > > current version of Obnam.
   ...
   > I could live with 6 Gigs of memory, but not with close to three weeks of
   > runtime for a full backup of ~550 Gb.
   
   I don't see anything in your configuration that would make things
   particularly slow. I assume you have a very large number of files in
   your live data. How many files? Run "obnam generations" with the same
   options as you use for "obnam backup" to see.
   
   How long is the ping time to your backup server? What's the bandwidth
   like?
   
   Did the six weeks for the initial backup include removing checkpoint
   generations made along the way? (That's currently a very slow and
   memory hungry operation.) If so, using --leave-checkpoints will tell
   Obnam to not remove them.
From: Teddy Hogeborn <teddy@recompile.se>
Date: Mon, 23 Jan 2017 09:26:12 +0100

   Lars Wirzenius <liw@liw.fi> writes:
   
   > > > > > > Will "green albatross" ameliorate any of these problems?
   > > > > >
   > > > > > Possibly. Why don't you try it and tell us?
   > > > > 
   > > > > I have.  Unfortunetely, I did not run with --debug, so no
   > > > > detailed statistics, but even though it only took about 6
   > > > > Gigabytes of RAM at most, it took closer to *three weeks* to
   > > > > complete this time.
   > > > 
   > > > Can you share you config, please? 6 gigabytes seems excessive for
   > > > the current version of Obnam.
   > ...
   > > I could live with 6 Gigs of memory, but not with close to three
   > > weeks of runtime for a full backup of ~550 Gb.
   >
   > I don't see anything in your configuration that would make things
   > particularly slow. I assume you have a very large number of files in
   > your live data. How many files? Run "obnam generations" with the same
   > options as you use for "obnam backup" to see.
   
   Output attached.
   
   > How long is the ping time to your backup server?
   
   Same local network, Gigabit ethernet.  0.4 milliseconds.
   
   > What's the bandwidth like?
   
   1 Gigabit Ethernet switch, not otherwise under heavy use.
   
   > Did the six weeks for the initial backup
   
   That's two and almost three weeks for full backups using the default
   repository format and the green albatross, respectively; not six weeks.
   
   > include removing checkpoint generations made along the way?
   > (That's currently a very slow and memory hungry operation.) If so,
   > using --leave-checkpoints will tell Obnam to not remove them.
   
   Yes, but I don't think the checkpoint removal kicked in until the end of
   the backup procedure?  As the graph shows, the memory consumption was
   climbing and large during most of the initial two weeks, not only at the
   end.
   
   /Teddy Hogeborn
From: Lars Kruse <lists@sumpfralle.de>
Date: Tue, 24 Jan 2017 11:06:58 +0100

   Hello Teddy,
   
   thank you for providing more details of your interesting case!
   
   
   The timestamps of the checkpoints of your backup show that the backup
   procedure was slowing down significantly at some points (with "slow" being a
   measurement of checkpoints per time).
   
   Here is a sample of checkpoints which took more than 24 hours to complete:
   
   > [..]
   > 55	2017-01-01 19:04:09 +0100 .. 2017-01-09 06:32:49 +0100 (0 files, 0 bytes)  (checkpoint)
   > [..]
   > 65	2017-01-09 23:08:23 +0100 .. 2017-01-10 05:25:19 +0100 (0 files, 0 bytes)  (checkpoint)
   > [..]
   > 110	2017-01-11 07:51:06 +0100 .. 2017-01-13 02:16:47 +0100 (0 files, 0 bytes)  (checkpoint)
   > [..]
   > 128	2017-01-13 16:53:34 +0100 .. 2017-01-19 18:21:48 +0100 (0 files, 0 bytes)  (checkpoint)
   
   I ran the output through a script and received the following distribution:
   * the fastest checkpoint took 39 seconds
   * 50% of the checkpoints took less than two minutes
   * 90% took less than ten minutes
   * 99% took less than ten hours
   
   I assume, that a large number of small files will need much more network
   traffic, IO and processing time than a few big files. I could imagine that this
   is the reason for the huge difference between the quickest (40s) and the slowest
   (646120s) checkpoint of your backup.
   
   Maybe you could show us the size distribution of your files in the backup?
    obnam ls | awk '{print int(log($5)/log(10))}' | sort -n | uniq -c
   (this will probably take quite a while)
   
   
   Additionally maybe you could show us more munin graphs of that period? Probably
   IO, CPU and network traffic would be interesting. Both sides (the repository
   host and the backup client) would be perfect.
   
   Thank you!
   
   Cheers,
   Lars
From: Lars Wirzenius <liw@liw.fi>
Date: Tue, 24 Jan 2017 11:31:53 +0200

   On Mon, Jan 23, 2017 at 09:26:12AM +0100, Teddy Hogeborn wrote:
   > Lars Wirzenius <liw@liw.fi> writes:
   > > I don't see anything in your configuration that would make things
   > > particularly slow. I assume you have a very large number of files in
   > > your live data. How many files? Run "obnam generations" with the same
   > > options as you use for "obnam backup" to see.
   > 
   > Output attached.
   
   12.5 million files. That's a lot, much more than I've tested.
   
   > > How long is the ping time to your backup server?
   > 
   > Same local network, Gigabit ethernet.  0.4 milliseconds.
   
   That's fast enough that it's not a concern.
   
   > > Did the six weeks for the initial backup
   > 
   > That's two and almost three weeks for full backups using the default
   > repository format and the green albatross, respectively; not six
   > weeks.
   
   OK.
   
   > > include removing checkpoint generations made along the way?
   > > (That's currently a very slow and memory hungry operation.) If so,
   > > using --leave-checkpoints will tell Obnam to not remove them.
   > 
   > Yes, but I don't think the checkpoint removal kicked in until the end of
   > the backup procedure?  As the graph shows, the memory consumption was
   > climbing and large during most of the initial two weeks, not only at the
   > end.
   
   Yeah, checkpoint removal happens at the end. If memory consumption is
   that large during the backup phase, I'm a bit worried. In my tests
   memory use is much, much less. Are you measuring process size or
   resident memory size? Do you get memory use from the Obnam log file
   (grep for "VmRSS:") or some other way?
   
   How big is your checkpoint interval? I guess it's the default 1 GiB.
From: Teddy Hogeborn <teddy@recompile.se>
Date: Tue, 24 Jan 2017 22:58:10 +0100

   Lars Wirzenius <liw@liw.fi> writes:
   
   > > > I don't see anything in your configuration that would make things
   > > > particularly slow. I assume you have a very large number of files
   > > > in your live data. How many files? Run "obnam generations" with
   > > > the same options as you use for "obnam backup" to see.
   > > 
   > > Output attached.
   >
   > 12.5 million files. That's a lot, much more than I've tested.
   
   A lot of those are probably Maildir folders full of years' accumulated
   spam, i.e. a lot of small files in the same directory.  Could this
   explain why it's so slow?  I mean, could it be that it's reading and
   re-reading the same directory listing over and over for every new file
   to open?  (File system is ext4.)  I'm just thinking out loud here, but I
   guess it's something you could test.
   
   > > > Did the [two] weeks for the initial backup include removing
   > > > checkpoint generations made along the way?  (That's currently a
   > > > very slow and memory hungry operation.) If so, using
   > > > --leave-checkpoints will tell Obnam to not remove them.
   > > 
   > > Yes, but I don't think the checkpoint removal kicked in until the
   > > end of the backup procedure?  As the graph shows, the memory
   > > consumption was climbing and large during most of the initial two
   > > weeks, not only at the end.
   >
   > Yeah, checkpoint removal happens at the end. If memory consumption is
   > that large during the backup phase, I'm a bit worried. In my tests
   > memory use is much, much less. Are you measuring process size or
   > resident memory size?
   
   Virtual, I think.  But I don't think there was a big difference.
   
   > Do you get memory use from the Obnam log file (grep for "VmRSS:") or
   > some other way?
   
   I simply used "top" and the munin graph I attached to my previous mail.
   (I can not find that string anywhere in the log file.)
   
   > How big is your checkpoint interval? I guess it's the default 1 GiB.
   
   Whatever the default is.
   
   /Teddy Hogeborn
From: Lars Kruse <lists@sumpfralle.de>
Date: Thu, 26 Jan 2017 21:21:14 +0100

   Hello Teddy,
   
   
   Am Tue, 24 Jan 2017 22:58:10 +0100
   schrieb Teddy Hogeborn <teddy@recompile.se>:
   
   > Lars Wirzenius <liw@liw.fi> writes:
   >  [...]
   > > 12.5 million files. That's a lot, much more than I've tested.  
   > 
   > A lot of those are probably Maildir folders full of years' accumulated
   > spam, i.e. a lot of small files in the same directory.  Could this
   > explain why it's so slow?
   
   I was doing backups with obnam for a mailserver, as well.
   It had around 2,5 million files consuming 340 GB of space (on ext4).
   If I remember correctly, it took well below one day of time (but was still too
   long for this very exotic use-case).
   
   
   Could you please run the filesize distribution evaluation, that I proposed in
   my other email? I am curious, if this could give an indication what is causing
   the weirdness:
   
    obnam ls | awk '{print int(log($5)/log(10))}' | sort -n | uniq -c
   
   In case you have access to more munin graphs: the month graphs for network
   traffic, IO and CPU would be interesting, as well.
   
   Cheers,
   Lars