[RFC] FSFS filesystem options (long, sorry)

From: Malcolm Rowe <malcolm-svn-dev_at_farside.org.uk>
Date: 2007-03-05 08:08:51 CET

[Summary: I'd like to implement per-repository options for FSFS, stored
in a new config file at creation time. Search for 'the plan'
if you want to know how, or read on if you want to know why.]

I've been thinking a bit recently about FSFS's scalability and
performance, and there are two non-backward-compatible things that I'd
like to be able to implement.

The first is the ability to split your revs/ and revprops/ directories
into separate 'buckets' or 'shards', so that we don't require a
repository with a million revs to contain a million files in one
directory.

I know that our (well, my) normal position on this has been to suggest
that the admin use a better-suited filesystem (say, ext3/htree), but
in some cases the administrator may not have the ability to change the
filesystem - particularly if they're using some form of NAS.

For large repositories, the ability to easily copy a block of revisions
off to another storage device and then remount that into the main
revision tree by means of a symlink or mount-point is also quite nice
(instead of requiring that they do it rev-by-rev; I'll handwave past the
question of how you do that switch atomically).

Finally, and again dependent on the filesystem, there may be a hard
limit on the maximum number of entries allowed in a directory.

So the solution is pretty straightforward: just change the structure so
that r0,r1,r10000,r10001 go into revs/0/0, revs/0/1, revs/1/10000,
revs/1/10001, etc.

But at the same time, I don't want to force this change upon anyone
who does have a decent OS filesystem. The scheme above would almost
certainly be less efficient on any tree-based filesystem, since we're
adding another level to the hierarchy. Additionally, the correct maximum
number of files will differ depending upon the filesystem.

Okay, that's the first thing I'd like to do. The second is a little
more specialised.

For those people who are using a NAS to store the repository, FSFS
really really really sucks. I did some trivial profiling (with strace)
recently to measure the amount of open/read-or-write/close cycles that
FSFS goes through to make a small commit (an add of two files and one
directory containing another file, in my case).

The resultant number of open-close style operations was something like
130 or so, for four nodes updated. Each node was read fully three or
four times, and rewritten fully three times over. The txn root node was
fully re-read 18 times!

(Opening and closing files over a network file system - and in particular,
NFS - really really sucks, in case you didn't know.)

Now some of those problems I've identified as fixable, and so I hope to
be able to fix them in due course. The main problems though, are design
ones, and very much harder to fix - the noderevs need to be marshalled
to disk and back again several times, for example.

So I've a much simpler way to fix this problem: allow the repository
administrator to specify an area of the _local_ filesystem to use for
marshalling transaction data. Only the proto-rev and revprops files
actually need to be stored on the NAS: the others can just remain on the
local disk until it's time to do the final commit.

(So we might be writing to, say, /var/tmp/myrepo/1-1.txn/node._1.0 and
/nfs/repos/myrepo/db/transactions/1-1.txn/rev).

But again, this isn't suitable for everyone. And in this case, there's
not even a reasonable default location for the local storage.

So, here's the plan:

- Accept FSFS filesystem options at 'svnadmin create' time.
   (perhaps in the cases above we'd name them --fsfs-max-files-per-dir=N
   and --fsfs-local-txn-dir=/foo).
- Store those options into a new db/config file, probably in our normal
   config format. If neither option is used, don't create a file.
- Bump the fs format number if the db/config file is created, because
   older clients won't know to read it.
- The compatibility rule is: if a client encounters an option it doesn't
   know about, abort: it can't safely read the filesystem.
- Don't offer any way to change the config manually after creation save
   for a dump/load cycle, at least not initially.

In the future, there might also be some options that _can_ be readily
changed at runtime (writing out svndiff0 in preference to svndiff1,
for example). I'm not proposing any of those at the moment, but if
they did come along, I'd probably put them into a separate section of
the config ("[hints]" comes to mind). Unknown options in that section
would not be a fatal error for clients.

Thoughts?

Regards,
Malcolm

application/pgp-signature attachment: stored

Received on Mon Mar 5 08:09:10 2007

This message: [ Message body ]
Next message: Ph. Marek: "Re: [RFC] FSFS filesystem options (long, sorry)"
Previous message: Chris Frost: "Re: wc atomic rename safety on non-ext3 file systems"
Next in thread: Ph. Marek: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Ph. Marek: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Karl Fogel: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Mattias Engdegård: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Peter Lundblad: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Greg Hudson: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Ph. Marek: "Re: [RFC] FSFS filesystem options (long, sorry)"
Reply: Malcolm Rowe: "Sharded FSFS repositories - summary"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]