On 3/13/07, John Peacock <jpeacock@rowman.com> wrote:
>
> Malcolm Rowe wrote:
> > - We'll create shards of 4000 entries each. That's large enough that
> > someone would have to hit 16M revisions before a larger value would be
> > an improvement, but small enough that it has reasonable performance
> > (and support) on all filesystems that I'm aware of. It's also a
> > power-of-ten, so easier for humans to understand.
>
> I have to say that I find "revs/N/12345 where N = 12345/constant" to be
> most
> human unfriendly, where N isn't an actual power of 10. I can't divide
> large
> numbers by 4000 in my head, but I could if it were 1000. I'm also
> concerned
> about the performance characteristics of NTFS (in particular) which seems
> to
> degrade much more quickly (to the point where I find it hard to even get a
> directory of the parent folder when a child folder has thousands of
> entries).
>
> I suggest that we write a quick script to generate a variety of sharding
> schemes
> and test it on multiple filesystems, rather than just picking something
> out of
> thin air. It may be that a multilevel system that is closer to a hashing
> algorithm will be superior to any arbitrary [fixed] division.
I think that some kind of multi-level system would be best. Either
something like a new top level folder every 10,000 revisions, with
sub-folders grouping every 1,000 revisions, or maybe even a top level folder
every 100,000 revisions, with a sub folder every 10,000 revisions with
additional sub-folders every 1,000 revisions. This latter option might be
over optimizing a bit for really large repositories, so I would probably go
with a folder every 10,000 revisions, and then inside those folders, break
it down by every 1,000 revisions.
--
Thanks
Mark Phippard
http://markphip.blogspot.com/
Received on Tue Mar 13 15:00:17 2007