[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Sharded FSFS repositories - summary

From: Matthias Wächter <matthias.waechter_at_tttech.com>
Date: 2007-03-15 13:49:27 CET

On 13.03.2007 13:47, Ph. Marek wrote:
>> - We'll create shards of 4000 entries each. That's large enough that
>> someone would have to hit 16M revisions before a larger value would be
>> an improvement, but small enough that it has reasonable performance
>> (and support) on all filesystems that I'm aware of. It's also a
>> power-of-ten, so easier for humans to understand.
> 4000 is no (integer) power of ten, so would not really be better.
> Quick, in which directory is revision 421712? (see KDEs repository)
>
> If I understand you correctly, you want to have
> 0/1
> 0/2
> 0/3
> ...
> 0/3999
> 1/4000
> 1/4001
> ...
> 2/8000
> and so on. Right?
>
> I'd prefer to have a *real* (integer :-) power of ten, eg. 1000. And TBH, 4000
> is a bit too much (for me, at least) - 1000 would be high, but acceptable.
> (I'd really prefer 100 and 3 or 4 levels - but I seem to be alone with that.)

How about:

        0/
        1/1
        2/2
        3/3
        ...
        3999/3999
        0/4000
        1/4001
        ...
        3999/7999
        0/8000

and so on?

Advantage: One could store each directory on a separate storage devices probably increasing bandwidth since revisions are typically read sequentially in number. (Q: Is this true? Or is this presuming a pipelined file access that is not yet implemented?)

Disadvantage 1: All top-level directories are created before the 4000th revision, you don't see the repository "grow up" on the top level by numbers of sub-directories.

Disadvantage 2: You cannot take "finished" directories to put them on non-backuped storage space (considering good archive for it), since each directory may receive new files every now and then.

I like the idea of having the divisor be a power of 10 (or let the revision stored in hex? Then take 4096 which is 3-digit hex :)).

Beside that, multiple levels would be fine, too, and could reduce the impact of Disadvantage 2 from above. I would suggest having the top-level directories being used in a round-robin fashin for throughput maximization, the second level would be used according to the base proposal:

        0/
        1/0/1
        2/0/2
        3/0/3
        ...
        9/0/9
        0/0/10
        ...
        9/0/9999
        0/1/10000
        ...
        9/1/19999
        0/2/20000

and so on.

In this scheme, I only have 10 top-level directories (easier to split over multiple disks) and one sub-directories in each of them with each step of 10,000 revisions, but each directory only containing 1,000 revisions. With each step of 10,000 revisions all "finished" second-level directories could be excluded from subsequent backups.

Of course, this all only makes sense if there is a performance benefit for splitting sequential accesses over multiple storage spaces.

- Matthias

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 15 13:49:55 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.