RATCHET INFANCY SCALLION
Incomplete files in the backup after SFTP connection losses
From: Ulrich Klauer <email@example.com> Date: Wed, 19 Oct 2016 00:54:50 +0200 Hi! I'm using obnam 1.19.1 (on Debian testing) to back up a few hundred GiB of data, mostly large audio files somewhere between 200 MiB and 4 GiB each, to a remote repository via SFTP. Unfortunately, the network connection is somewhat unreliable; from time to time, there will be up to 100% packet loss for seconds or minutes. When it's several minutes, the SFTP connection will usually time out, resulting in an SSHException from paramiko ("Server connection dropped"). I then resume by running "obnam force-lock" and repeating the aborted "obnam backup". (I've also reduced the "checkpoint" parameter to 150Mi in .obnam.conf.) I got somewhat suspicious when I saw that the progress indicator had increased considerably just after such a restart (e.g. immediately before the crash, "found N files, 34.13 GiB", but shortly after the start of the new backup "found M files, 36.38 GiB" when there simply had not been the time to upload more than 2 GiB). When the backup was finished after several cycles, I was therefore not too surprised that "obnam verify" found a number of truncated files, like so: > verify failure: RDF30DX: /path/to/file.wav: more data at end of file For example, one such file is 2234613896 B (approx. 2131.1 MiB) long in the live data, and that's also exactly what "obnam ls" reports for the backup. However, restoring this file from the backup with "obnam restore" results in a file that is 650117120 B (620 MiB exactly) long. This is the output: > File checksum restore error: /path/to/file.wav > (b512a39af2a2a3ad5425b24d95153b9d vs ) > Restored 1 files, downloaded 620.0 MiB in 15m3s at 703.0 KiB/s average speed > ERROR: RD6259X: There were errors when restoring (The checksum given is that of the truncated file, I verified it by running "md5sum".) A comparison (with "cmp") between the original file and the restored one confirms the result of "obnam verify": The restored data is correct as far as it exists, i.e. the restored file is exactly the first 620 MiB of the original file. So it seems to me that obnam, when the SFTP connection is lost, in certain cases marks a file as finished long before all of its contents have actually been uploaded. It's not just about the last MiB block; hundreds of blocks can be missing from a file. I consider this a serious bug, because it gives the user a false sense of security ("My data is safely backed up!") when the backup is, in fact, incomplete. Is this a known issue? And is there a way to complete the backup? Another "obnam backup" run finishes immediately because it believes the data to be completely backed up already and unchanged. My best idea so far (but I haven't actually tried it yet) is to repeat the backup under a new client name and rely on the block deduplication to avoid re-uploading the data that is already there. However, the backup will still take some time and might fail again due to connectivity issues, and then I would need a third client name ... or a fourth ... or fifth. Ulrich _______________________________________________ obnam-support mailing list firstname.lastname@example.org http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: Maximilian Gaukler <email@example.com> Date: Sat, 22 Oct 2016 22:34:10 +0200 Hi, under similar conditions as you describe, I had the problem that `obnam fsck` complained about varying data structure (not content) errors. The messages varied from time to time. I suspect that the transaction handling in obnam is not as complete as you would expect e.g. from a real filesystem or database, although a short test in which I simulated random connection aborts failed to put up data corruption as you describe. From an outside view I would wish for the following at the first run after a connection was aborted: - cleanup stale lock ( = if lock was certainly made by the connecting client and not someone else ) - acquire lock - roll back uncommitted transaction so that `obnam fsck` is completely happy and no junk data is left behind. - start new transaction - do work ... - commit transaction Of course, it's not that easy to actually implement this in a way that works via SFTP. Max _______________________________________________ obnam-support mailing list firstname.lastname@example.org http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org
From: Lars Wirzenius <email@example.com> Date: Mon, 26 Dec 2016 17:32:56 +0200 On Wed, Oct 19, 2016 at 12:54:50AM +0200, Ulrich Klauer wrote: > I got somewhat suspicious when I saw that the progress indicator had > increased considerably just after such a restart (e.g. immediately before > the crash, "found N files, 34.13 GiB", but shortly after the start of the > new backup "found M files, 36.38 GiB" when there simply had not been the > time to upload more than 2 GiB). You've misunderstood what the Obnam progress output says. It says that it's found N files that together contain S amount of data, not that it's uploaded that many files or that much data. The uploaded amount is shown separately. > So it seems to me that obnam, when the SFTP connection is lost, in certain > cases marks a file as finished long before all of its contents have actually > been uploaded. I don't see, from code, how this could happen, or why the test suite, which explicitly tests for crashes, would fail to notice this. I'm not saying there isn't a bug like that, but I'm going go need someone to give me a to reliably, and preferably automatically, reproduce the problem before I spend much time investigating it. Sorry.