[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Using svn_hash__make instead of apr_hash__make

From: Stefan Fuhrmann <eqfox_at_web.de>
Date: Tue, 22 May 2012 21:35:04 +0200

Am 20.05.2012 17:41, schrieb Justin Erenkrantz:
> On Sun, May 20, 2012 at 4:37 AM, Stefan Fuhrmann
> <stefanfuhrmann_at_alice-dsl.de> wrote:
>> Directory deltification making wordpress.org
>> go from 400+GB to 10GB *is* a reason.
>> Without stable hashes, we would need special
>> code for hash deltification.
> Having a stable hash function sure doesn't seem like this would
> account for that reduction. Can you please elaborate?

Subversion up to and including 1.7 will serialize directories
as string->string hashes in FSFS. wordpress.org uses projects
as the top-level of its repository (just like Apache). So, every
commit writes a new version of that. At >26k projects, that's
>1.4MB per revision.

In 1.8, one may activate directory deltification. After serialization,
the resulting text will be deltified just like any other node and
the result be zip-compressed. Many revisions are now about 2KB.
However, that hinges on successive versions of the directories
to produce serialized text. Even a random "shift" by a larger
number of entries will leave no 64 byte matches (our xdelta
granularity) within the 100k text windows used by xdelta.
>> Again, these are my reasons for using svn_hash__make:
>>
>> * consistent behavior of SVN across different APR versions
>> * give devs time to check all the 500+ places that create
>> hashes throughout SVN for implicit assumptions on
>> ordering and such.
>> * performance improvement; particularly with directory-
>> or property-related operations
> I don't believe the first two matter in any tangible way.

Well, I am a developer and reproducibility between test runs
*does* matter to me.

On a more general note: We don't use hashes as a means to
randomize our data. For us, they are simply containers with
an average O(1) insertion and lookup behavior. The APR interface
also allows for iterating that container - so it *has* an ordering
and it has been quite stable under different operations and
over many versions of APR.

The change in 1.4.6 did *not* solve the fundamental performance
problem but it makes our life harder - at least for a while.
If we want a reproducible UI behavior, we must now eliminate
the use of hashes in all relevant output functions and replace
them with e.g. sorted arrays. That may take some time.
> And, the third point doesn't make any sense to me without a further
> explanation. -- justin
>
When we e.g. do an "svn ls -v" (TSVN repo browser), we will
create and fill the revprop hash for the respective revision
multiple times for each entry in the directory - just to produce
a few bytes of output. The hash function showed up in profiles
with 10% of the total runtime.

So, I tuned that. Because apr_hash_t is so popular in our code,
that very localized improvement will give (small) returns in
improved performance all over the place.

-- Stefan^2.
Received on 2012-05-22 23:35:40 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.