[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: fsfs revprop packing in f5 Re: Does fsfs revprop packing no longer allow usage of traditional backup software?

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Thu, 7 Jul 2011 01:33:27 +0300

Some discussion on IRC concerning editing manifest records in-place,
rather than via move-into-place of a tempfile, boiled down to "manifest
records should not cross OS page boundaries", and therefore "manifest
record length (i.e., the number of bytes for one revision's manifest
entry) should be a power of 2".

This came up in context of a proposal involving in-place editing of
manifests which hopefully one of us will bring to this list on
a separate email if it's fruitable.

00:54:59 @danielsh | when you say "I/O reordering", do you mean
00:55:11 @danielsh | "write()s by any one process are (not) done in chronological order"?
00:55:23 @stefan2 | exactly
00:55:32 @danielsh | ok
00:55:36 @danielsh | so, yes, that's one point I had in mind
00:55:41 @danielsh | today we only rely on move-into-place
00:55:52 @danielsh | any [pg]-esque suggestion means we require something new
00:56:05 @stefan2 | but they should be come visible in chronological order between processes - at least within the same
                        | address page
00:56:06 @danielsh | in this case, correct handling of overwriting of bytes in a file
00:56:34 @danielsh | re what you just said
00:56:40 @danielsh | why? also between threads of the same process
00:56:53 @danielsh | [ and I don't follow how/why address spaces factor in ]
00:57:24 @stefan2 | file caches use memory?
00:57:56 @danielsh | ah.
00:58:01 @stefan2 | the OS might reorder stuff for different pages but not for the same
00:58:15 @stefan2 | (no idea how relevant / likely that actually is)
00:58:18 @danielsh | so you said should == 'will probably be' rather than 'should == need to be'
00:58:38 @stefan2 | yes
00:59:02 @danielsh | so. if the manifest record crosses a page boundary we might have a VERY edge case bug?
00:59:42 @danielsh | have to admit I wouldn't have considered that... I would have stopped at the file level not at the
                        | page-that-file-is-cached-in level
01:01:16 @stefan2 | I have no idea how an OS makes shared file content visible to the respective processes. the edge
                        | case might happen only if a "record" crosses the page boundary
01:02:16 @danielsh | ack
01:02:49 @danielsh | you're saying the OS might not be ACID'ing the view of the file. (since it presents a view that
                        | never was present on disk)
01:02:59 @danielsh | who knows, that may be a risk we'll take
01:03:25 @stefan2 | there is copy-on-write semantics for (some forms of) memory mapped files under windows. Depending on
                        | when the respective page is mapped it may see the old or new content (the old page got copied while
                        | the new one didn't need to back then)
01:03:45 @peterS | danielsh: if the manifest record crosses a page boundary, the reason this might matter is in case of
                        | an OS crash. one block is written, the next not
01:04:29 @stefan2 | danielsh: yes. but I have no actual facts / pointers to support that suspicion.
01:04:34 @peterS | but if we're using 16-byte records and aligning to 16 byte boundaries, it would take a very strange
                        | disk block size to make this possible
01:05:34 @peterS | if we think it's unreasonable to assume 1000 revprop blobs will always average less than 256 GB per
                        | blob, though, then we'll need more than 16 bytes. or a binary representation as I've argued for
                        | before
01:05:34 @danielsh | stefan2, peterS: so between the two of you you're arguing that records should not cross page
                        | boundaries
01:05:56 @danielsh | fine, and I'll raise that point on dev@ for posterity
01:06:18 @peterS | I think that's reasonable. and we don't even know the disk or filesystem block size, i.e., we can't
                        | _really_ assume it's greater than, say, 512 bytes
01:06:31 @peterS | so we really want to align on a power-of-two
01:06:33 @stefan2 | peterS: any 2^N size should work
01:06:42 @peterS | ...stefan2 agreed (:
01:06:48 @stefan2 | indeed ;)
01:07:09 @danielsh | +1
Received on 2011-07-07 00:35:08 CEST

This is an archived mail posted to the Subversion Dev mailing list.