Andre Noll [Tue, 9 Nov 2010 17:43:49 +0000 (18:43 +0100)]
Change default program for all hooks from /bin/true to true.
At least on Mac OS, true is /usr/bin/true, not /bin/true. So the old default
/bin/true causes all hooks to fail on these systems. Since we execute external
programs via execvp() anyway, there is no need to hardcode the path.
Andre Noll [Fri, 14 May 2010 12:43:51 +0000 (14:43 +0200)]
Introduce snapshot recycling.
When snapshotting large file systems whose contents do not change much between
snapshots, we end up removing large amounts of files just to recreate (hard
links to) most of them afterwards. This patch changes snapshot creation so that
outdated, redundant and orphaned snapshots are reused as the basis for new
snapshots. Only if no existing snapshot is suitable for recycling, a new one is
created.
Andre Noll [Wed, 12 May 2010 09:00:55 +0000 (11:00 +0200)]
Unify sending of signals.
This patch introduces dss_kill(), a wrapper for kill(2) which
prints a nice log message and checks the return value of the
the underlying call to kill().
Andre Noll [Fri, 16 Apr 2010 11:39:28 +0000 (13:39 +0200)]
Invalidate create_pid if create process has died.
We're checking create_pid against zero at several places, for example before
sending a signal to the create process. So set create_pid is zero in
handle_sigchld() if the create process just died.
Andre Noll [Thu, 25 Mar 2010 13:49:47 +0000 (14:49 +0100)]
Reuse old rsync argv if rsync has to be restarted.
If rsync must be restarted due to an exit code of 12 or 13,
create_rsync_argv() was called even if the old rsync_argv should
be reused in this case. This (correctly) triggers the assertion
assert(!name_of_reference_snapshot);
in create_rsync_argv(). Fix this by not calling create_rsync_argv()
if there is a reference snapshot.
Andre Noll [Fri, 12 Mar 2010 14:47:07 +0000 (15:47 +0100)]
Avoid busy loop on rsync exit status 12 or 13.
Although we set the next snapshot time to now + 60 seconds in case
rsync exits with exit status 12 or 13, we miss to check this time
barrier in case the snapshot creation status is HS_NEEDS_RESTART.
Fix this by adding an additional check in the switch() statement
of the select loop. As this change would trigger the assertion
Andre Noll [Mon, 1 Feb 2010 09:21:35 +0000 (10:21 +0100)]
Introduce --no-resume.
If the dss daemon (or the rsync process) is killed while a snaphot
is being created, e.g. because of a server shutdown, the latest
snapshot remains incomplete until it is removed by the usual shapshot
pruning mechanism.
This patch changes the snapshot creation behaviour if the
most recently created snapshot happens to be incomplete and the
new --no-resume option is not given. In this case the directory
of the incomplete snapshot is reused as the destination directory
for the the new snapshot.
This change saves disk space and reduces the snapshot creation time,
depending of course on how far the previous rsync process got before
it was interrupted.
Andre Noll [Fri, 28 Aug 2009 13:23:57 +0000 (15:23 +0200)]
Properly invalidate create_pid also for the post-create hook.
If the process associated with the create_pid dies, handle_sigchld()
investigates snapshot_creation_status to tell whether the pre-create
hook, the rsync process or the post-create hook has died.
In the first two cases, handle_pre_create_hook_exit() and
handle_rsync_exit() are called, respectively. Both functions correctly
invalidate create_pid (by resetting it to zero).
However, the post-create hook handling code misses to reset
create_pid. This causes dss to send SIGTERM to this pid on exit,
which might be fatal as the pid might have been reassigned to some
unrelated process in the meanwhile.
Fix this bug by moving the invalidation of create_pid to the end of
the "if (pid == create_pid)" clause, which even saves a line of code.
Many thanks to Sebastian Stark who pointed out that bug.
Andre Noll [Fri, 28 Aug 2009 09:28:57 +0000 (11:28 +0200)]
Improve error diagnostics.
When parsing the command line options we must not error out if a
required option was not given because that option might be specified
in the config file. Therefore we have to call cmdline_parser_ext()
with params->check_required = 0.
However, if --config-file is not given and the default config file
(~/.dssrc) does not exist, we end up with no check for required
options at all.
In particular, if the required --dest-dir option is not given,
conf.dest_dir is NULL and we call chdir(NULL) which returns EBADADRESS
at least on Linux. This causes dss to print the error message
Aug 28 11:35:07 main: Bad address
which is not really helpful. Fix this shortcoming by calling
cmdline_parser_ext() _again_ if no config file was read by
parse_config_file(). This second call uses params->check_required =
1, so that a proper error message is printed if any required options
are missing.
Andre Noll [Fri, 28 Aug 2009 09:12:30 +0000 (11:12 +0200)]
Improve next_snapshot_is_due().
Currently it's a bit weird how next_snapshot_is_due() decides whether
the next snapshot time has to be (re-)computed:
On startup, next_snapshot_time is zero as it is declared
static.
next_snapshot_is_due() checks whether next_snapshot_time is
greater than the current time. If yes, then next_snapshot_time
needs not be updated and the function returns false.
Otherwise (e.g. if it is called for the first time),
next_snapshot_time is recomputed, next_snapshot_is_due()
checks again if it is greater than the current time and
returns false if it is, true otherwise.
Consequently, dss computes the next snapshot time twice per snapshot.
Moreover, it compares next_snapshot_time twice against the current time
where one comparison would suffice. The code is thus less efficient
and harder to understand than necessary. This patch addresses both
issues. It introduces the two trivial helper functions
next_snapshot_time_is_valid() and invalidate_next_snapshot_time().
The former function simply tests next_snapshot_time against zero. It
is called from next_snapshot_is_due(). If it returns false, the new
compute_next_snapshot_time() is called (which makes next_snapshot_time
valid). Next, the usual comparison against the current time is
performed.
invalidate_next_snapshot_time() sets next_snapshot_time to zero. It
is called from pre_create_hook() and from handle_sighup(), the latter
call is necessary because changes in the config file might lead to
different snapshot creation times.
Andre Noll [Thu, 27 Aug 2009 14:27:29 +0000 (16:27 +0200)]
Simplify computation of next snapshot time.
Using an int64_t rather than a struct timeval for the next snapshot time
makes the code simpler and more readable as we don't have to use the
tv_xxx() functions to perform manipulations.
Andre Noll [Thu, 27 Aug 2009 12:55:02 +0000 (14:55 +0200)]
Fix off-by-one bug in find_outdated_snapshot().
The man page sayeth:
"dss removes any snapshots older than n times u",
where n is the number of unit intervals and u is the duration of
a unit interval. As intervals count from zero, this means that a
snapshot should be considered outdated if its interval number
is greater _or equal_ than n.
However, the current code only removes snapshots in intervals
strictly greater than n. Fix this bug and clarify the documentation.
Andre Noll [Wed, 26 Aug 2009 11:24:12 +0000 (13:24 +0200)]
Make snapshot creation trump snapshot removal.
This patch changes the behaviour of dss in case the following three
conditions are all met:
(1) There is at least one snapshot which could be deleted
(orphaned, redundant, outdated).
(2) Disk space is not low.
(3) A new snapshot is due.
A common case where these conditions are fulfilled is when dss is
started after it was not running for some time, for example due to
a server crash on Friday evening...
In this situation the old code created a new snapshot only after all
orphaned/redundant/outdated have been removed, which can take hours
and is not what one usually wants to happen in the above mentioned
scenario. Instead, it is desirable to create a new snapshot ASAP,
and only after this snapshot has been created, the removal of old
snapshots should take place.
This patch implements this behaviour and goes even one step further:
If disk space is not low and a new snapshot is due or being created,
dss won't trigger snapshot removal any more.
Another positive side effect of this change is that snapshot creation
times become more stable since the rsync process will only be
interrupted by a rm process if disk space is low.
Andre Noll [Mon, 29 Jun 2009 09:36:25 +0000 (11:36 +0200)]
Try harder to avoid removing the reference snapshot.
With the old code, it was possible that dss decided to remove the
snapshot which is currently being used as the hardlink directory for
the current rsync process. This patch changes the behaviour so that
reference snapshots are never removed.
The downside of this approach is of course that it is now easier to
fill up the disk..
Andre Noll [Thu, 4 Jun 2009 11:47:35 +0000 (13:47 +0200)]
Clean up snapshot removal logic.
Replace the remove_xxx_snapshot() functions by find_xxx_snapshot()
and start the actual removal in the caller. This simplifies the
code a bit and makes it much more readable. It also simplifies the
--dry-run handling for com_prune().
Andre Noll [Fri, 29 May 2009 20:09:06 +0000 (22:09 +0200)]
Remove orphaned snapshots first if disk space is low.
If the dss process gets killed, an orphaned snapshot might result.
Detect this case and prefer to remove such orphaned snapshots before
resorting to remove the oldest snapshot.
Andre Noll [Fri, 29 May 2009 11:49:13 +0000 (13:49 +0200)]
Restart rsync also on exit value 12.
An exit value of 12 means "Error in rsync protocol data stream",
and it happens regularly for rsync-3 without apparent reason. So
restart the rsync process also in this case.
Andre Noll [Wed, 8 Apr 2009 13:23:01 +0000 (15:23 +0200)]
Implement rm-hooks.
This adds calls for the pre-remove and the post-remove hooks, similar to
the pre-create/post-create hooks. If the pre-remove hook fails, snapshot
deletion is deferred until the hook succeeds.
Andre Noll [Mon, 30 Mar 2009 10:49:03 +0000 (12:49 +0200)]
Fix the pre-create hook.
Returning non-zero from the pre-create hook caused dss to exit with
"unexpected exit code" rather than waiting until the hook returns
zero.
Fix this bug and also reduce the verbostity of the log messages caused
by executing the pre-create hook: It should be enough to tell the user
only once per hour that no more snapshots are going to be created.
Andre Noll [Mon, 16 Mar 2009 15:57:56 +0000 (16:57 +0100)]
Use only one global variable for snapshot creation pids.
There's no need to have pre_create_hook_pid, rsync_pid and
post_create_hook_pid because only one of them can be running at
any point in time. We can always tell which it is by examining the
snapshot_creation_status.
So replace these three variables by the single create_pid variable.
Besides of killing two global variables, this change also fixes a
real bug: If the dss process catches SIGINT or SIGTERM, the old code
would only kill a running rsync process but not the pre-create or
post-create hook. However, the new code kills whatever create process
is currently running, which is the right thing to do.
Andre Noll [Mon, 8 Dec 2008 16:17:50 +0000 (17:17 +0100)]
Fix check when to use rsync locally.
We can do this if (a) remote_host_arg is "localhost" and (b)
remote_user_arg is the same as logname. The old code only
looked at the logname and thus tried to use rsync locally even
if a remote_host_arg was specified.
Andre Noll [Thu, 6 Nov 2008 09:32:01 +0000 (10:32 +0100)]
Prevent busy loops on rsync exit code 13.
We restart the rsync process in case it returned with exit code 13
which unfortunately happens for some unknown reasons even with a
valid configuration.
This may lead to a busy loop, so wait at least one minute before
restarting rsync.
Sebastian Stark [Wed, 22 Oct 2008 12:59:11 +0000 (14:59 +0200)]
open /dev/null for reading AND writing when executing rsync.
This is needed for child processes to be able to write to fd 2 without failing.
For example, rsync will not be able to write an error message because of "Bad
file descriptor" which in turn leads to rsync exiting with meaningless exit
code 13 ("Errors with program diagnostics"), masking the actual error and exit
code.
The fact that rsync uses exit code 13 in that case makes this bug particularly
painful since 13 is interpreted by dss as a temporary rsync error that can be
"fixed" by simply restarting rsync. This can lead to an infinite loop,
obviously.
Andre Noll [Wed, 21 May 2008 14:36:13 +0000 (16:36 +0200)]
Fix the exit hook.
As dss_exec_cmdline_pid() uses the space character as a separator
to split the command line, the words of the error message were
passed as separate parameters to the exit hook.
Use dss_exec() directly to avoid this flaw, i.e. to pass the full
error message as $1 to the exit hook.