Re: Sharded FSFS repositories - summary

From: Ph. Marek <philipp.marek_at_bmlv.gv.at>
Date: 2007-03-14 08:54:14 CET

On Tuesday 13 March 2007 15:34, Malcolm Rowe wrote:
...
> But neither of those are the main reason to do this. (realistically,
> how many times a month will a typical admin do an 'ls' in revs/ ?)
>
> 4000 revs is a good compromise: it's big enough that it scales to large
> repositories (ASF's repository would be halfway towards needing another
> level if we went with 1000 files-per-shard), and it's small enough that
> it works everywhere we need it to (even on Coda, it seems :-)).
Well, with 4000 you don't know where r454513 is, do you?
As I said, I'd prefer an integer power-of-ten.

> It doesn't look like multi-level trees would be needed for performance
> until you hit somewhere around c.100M revisions, and I'm not aware of
> anyone who's anywhere near that level yet :-)
As you said above, ASF (and KDE) are about to get a million revisions ... so
with 1000 three levels would be better.

> > If this number would go into an existing file (format, fs-type), it would
> > not require another read;
>
> Sure, but it's the complexity that concerns me - we really need to
> demonstrate a tangible benefit to make it that much more complex.
Ok.

> > and if we allowed not one, but two such numbers here,
> > the repository could be re-arranged on-line.
> > (The fs-layer had to be looking for both files until one was found - as
> > was already recommended).
> Why would you ever need to do something like that?
>
> (What Karl recommended was online conversion from the 'flat' to
> 'sharded' scheme, which I still think is too complex for the slight
> benefit [of having a slightly faster upgrade] it gives).
For on-line rearrangement. That may be a seldom used operation - but just for
the big repositories it should be as painless as possible.
And if you already look for two paths (with/without sharding), you might as
well look for 2 sharded.

> > Have you seen my mail regarding the transaction-directories? Maybe the
> > naming there could be done with the same function.
> They could, but how frequently do you commit transactions with 100,000
> changed files? Maybe on an initial import, but in that case the time
> spent writing the data is going to dwarf the time spent looking up the
> entries, or at least that's my intuition. You're quite welcome to
> benchmark the difference to see what it actually is.
I did, some time ago.
Testing ext3 with or without dir-index against bdb for a large number of files
(10 000 or more) showed bdb to be the fastest.
Whether that's because bdb is a bit better with file handling (deltification
*after commit, IIRC?) or just because the number of files is much smaller, I
can't say.

The difference between ext3 without dir-index and bdb was: bdb finished in 3
hours, ext3 got killed after 12 without having finished.
ext3 with dir-index? Don't recall, maybe 4 hours.

As a side note: just a "dpkg-query -L <packages>" of the changed packages in
debian-unstable (from yesterday to today) gives 2638 lines. That includes
directories -- which are not files -- but they have properties too, like
normal files.
So if you dist-upgrade only once a week, you're likely to get 10 000 files
changed.

Regards,

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Mar 14 08:54:39 2007

This message: [ Message body ]
Next message: Justin Erenkrantz: "Re: summer of code 2007"
Previous message: Erik Huelsmann: "Re: Subversion 1.5"
In reply to: Malcolm Rowe: "Re: Sharded FSFS repositories - summary"
Next in thread: Malcolm Rowe: "Re: Sharded FSFS repositories - summary"
Reply: Malcolm Rowe: "Re: Sharded FSFS repositories - summary"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]