Front page

Incomplete files in the backup after SFTP connection losses

a17ab3c6b58248a3af3d5fc81c4b2fab
RATCHET INFANCY SCALLION

From: Ulrich Klauer <ulrich@chirlu.de>
Date: Wed, 19 Oct 2016 00:54:50 +0200

   Hi!
   
   I'm using obnam 1.19.1 (on Debian testing) to back up a few hundred  
   GiB of data, mostly large audio files somewhere between 200 MiB and 4  
   GiB each, to a remote repository via SFTP.
   
   Unfortunately, the network connection is somewhat unreliable; from  
   time to time, there will be up to 100% packet loss for seconds or  
   minutes. When it's several minutes, the SFTP connection will usually  
   time out, resulting in an SSHException from paramiko ("Server  
   connection dropped"). I then resume by running "obnam force-lock" and  
   repeating the aborted "obnam backup". (I've also reduced the  
   "checkpoint" parameter to 150Mi in .obnam.conf.)
   
   I got somewhat suspicious when I saw that the progress indicator had  
   increased considerably just after such a restart (e.g. immediately  
   before the crash, "found N files, 34.13 GiB", but shortly after the  
   start of the new backup "found M files, 36.38 GiB" when there simply  
   had not been the time to upload more than 2 GiB). When the backup was  
   finished after several cycles, I was therefore not too surprised that  
   "obnam verify" found a number of truncated files, like so:
   
   > verify failure: RDF30DX: /path/to/file.wav: more data at end of file
   
   For example, one such file is 2234613896 B (approx. 2131.1 MiB) long  
   in the live data, and that's also exactly what "obnam ls" reports for  
   the backup. However, restoring this file from the backup with "obnam  
   restore" results in a file that is 650117120 B (620 MiB exactly) long.  
   This is the output:
   
   > File checksum restore error: /path/to/file.wav  
   > (b512a39af2a2a3ad5425b24d95153b9d vs )
   > Restored 1 files, downloaded 620.0 MiB in 15m3s at 703.0 KiB/s average speed
   > ERROR: RD6259X: There were errors when restoring
   
   (The checksum given is that of the truncated file, I verified it by  
   running "md5sum".)
   
   A comparison (with "cmp") between the original file and the restored  
   one confirms the result of "obnam verify": The restored data is  
   correct as far as it exists, i.e. the restored file is exactly the  
   first 620 MiB of the original file.
   
   
   So it seems to me that obnam, when the SFTP connection is lost, in  
   certain cases marks a file as finished long before all of its contents  
   have actually been uploaded. It's not just about the last MiB block;  
   hundreds of blocks can be missing from a file. I consider this a  
   serious bug, because it gives the user a false sense of security ("My  
   data is safely backed up!") when the backup is, in fact, incomplete.
   
   Is this a known issue? And is there a way to complete the backup?  
   Another "obnam backup" run finishes immediately because it believes  
   the data to be completely backed up already and unchanged. My best  
   idea so far (but I haven't actually tried it yet) is to repeat the  
   backup under a new client name and rely on the block deduplication to  
   avoid re-uploading the data that is already there. However, the backup  
   will still take some time and might fail again due to connectivity  
   issues, and then I would need a third client name ... or a fourth ...  
   or fifth.
   
   Ulrich
   
   
   _______________________________________________
   obnam-support mailing list
   obnam-support@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: Maximilian Gaukler <development@maxgaukler.de>
Date: Sat, 22 Oct 2016 22:34:10 +0200

   Hi,
   
   under similar conditions as you describe, I had the problem that `obnam 
   fsck` complained about varying data structure (not content) errors. The 
   messages varied from time to time. I suspect that the transaction 
   handling in obnam is not as complete as you would expect e.g. from a 
   real filesystem or database, although a short test in which I simulated 
   random connection aborts failed to put up data corruption as you describe.
   
    From an outside view I would wish for the following at the first run 
   after a connection was aborted:
   
   - cleanup stale lock ( = if lock was certainly made by the connecting 
   client and not someone else )
   - acquire lock
   - roll back uncommitted transaction so that `obnam fsck` is completely 
   happy and no junk data is left behind.
   - start new transaction
   - do work ...
   - commit transaction
   
   Of course, it's not that easy to actually implement this in a way that 
   works via SFTP.
   
   Max
   
   _______________________________________________
   obnam-support mailing list
   obnam-support@obnam.org
   http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: Lars Wirzenius <liw@liw.fi>
Date: Mon, 26 Dec 2016 17:32:56 +0200

   On Wed, Oct 19, 2016 at 12:54:50AM +0200, Ulrich Klauer wrote:
   > I got somewhat suspicious when I saw that the progress indicator had
   > increased considerably just after such a restart (e.g. immediately before
   > the crash, "found N files, 34.13 GiB", but shortly after the start of the
   > new backup "found M files, 36.38 GiB" when there simply had not been the
   > time to upload more than 2 GiB).
   
   You've misunderstood what the Obnam progress output says. It says that
   it's found N files that together contain S amount of data, not that
   it's uploaded that many files or that much data. The uploaded amount
   is shown separately.
   
   > So it seems to me that obnam, when the SFTP connection is lost, in certain
   > cases marks a file as finished long before all of its contents have actually
   > been uploaded.
   
   I don't see, from code, how this could happen, or why the test suite,
   which explicitly tests for crashes, would fail to notice this. I'm not
   saying there isn't a bug like that, but I'm going go need someone to
   give me a to reliably, and preferably automatically, reproduce the
   problem before I spend much time investigating it. Sorry.