On Tue, May 22, 2012 at 12:35 PM, Stefan Fuhrmann <eqfox_at_web.de> wrote:
>> Having a stable hash function sure doesn't seem like this would
>> account for that reduction. Can you please elaborate?
>
>
> Subversion up to and including 1.7 will serialize directories
> as string->string hashes in FSFS. wordpress.org uses projects
> as the top-level of its repository (just like Apache). So, every
> commit writes a new version of that. At >26k projects, that's
>>1.4MB per revision.
>
> In 1.8, one may activate directory deltification. After serialization,
> the resulting text will be deltified just like any other node and
> the result be zip-compressed. Many revisions are now about 2KB.
> However, that hinges on successive versions of the directories
> to produce serialized text. Even a random "shift" by a larger
> number of entries will leave no 64 byte matches (our xdelta
> granularity) within the 100k text windows used by xdelta.
Thanks for the explanation. However, this is a very specific use case
- why can't we just use an explicitly sorted data structure here? I
know Greg coded up a red-black tree for pocore...I'm sure there are
plenty of guaranteed-sorted data structures under an ALv2 license. =)
> Well, I am a developer and reproducibility between test runs
> *does* matter to me.
If ordering matters, then the presentation layer should enforce
ordering. I would never expect a hash table to produce sorted
ordering across runs. (In fact, if it does, it probably means it has
a flaw in the implementation...which is exactly what happened with
APR.)
> On a more general note: We don't use hashes as a means to
> randomize our data. For us, they are simply containers with
> an average O(1) insertion and lookup behavior. The APR interface
> also allows for iterating that container - so it *has* an ordering
> and it has been quite stable under different operations and
> over many versions of APR.
Iterating isn't the same as ordering.
> The change in 1.4.6 did *not* solve the fundamental performance
> problem but it makes our life harder - at least for a while.
> If we want a reproducible UI behavior, we must now eliminate
> the use of hashes in all relevant output functions and replace
> them with e.g. sorted arrays. That may take some time.
Sure.
>> And, the third point doesn't make any sense to me without a further
>> explanation. -- justin
>>
> When we e.g. do an "svn ls -v" (TSVN repo browser), we will
> create and fill the revprop hash for the respective revision
> multiple times for each entry in the directory - just to produce
> a few bytes of output. The hash function showed up in profiles
> with 10% of the total runtime.
Sorry - I'm not following how having a sorted hash table achieves
this; it sounds like reuse of a data structure (not requiring memory
allocs or something) - not that it is guaranteed to be stable on every
call.
> So, I tuned that. Because apr_hash_t is so popular in our code,
> that very localized improvement will give (small) returns in
> improved performance all over the place.
Yes, but there may very well be penalties in other places that you
haven't looked at yet. I know that I've seen shift+incr appear in
traces before - switching to a much more expensive hash algorithm is
going to make those cases worse. -- justin
Received on 2012-05-23 02:30:58 CEST