Re: Comparison testing: { FSFS, BDB } x { 1.5.4, trunk }

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: Thu, 30 Oct 2008 20:59:23 -0500

C. Michael Pilato wrote:
> Hyrum K. Wright wrote:
>> C. Michael Pilato wrote:
>>> C. Michael Pilato wrote:
>>>> I did a little bit of simple comparison testing between FSFS and BDB in
>>>> 1.5.4 and trunk. My testing involved loading a dumpfile of 5000 revisions
>>>> (taken from our own repository), then doing some single-revision dumps at a
>>>> few time-slices across the loaded repository.
>>>>
>>>> Here are the highlights (percentages are ballpark estimates).
>>>>
>>>> In trunk, FSFS:
>>>>
>>>> is significantly slower (30%) for writes operations. I have no idea why.
>>>>
>>>> is a bit faster for reads (20%).
>>>>
>>>> showed no meaningful disk usage changes. But I'm pretty sure this is
>>>> an artifact of the testing dataset, which isn't as up-to-date-branch-heavy
>>>> as more recent revision ranges in our source tree are.
>>>>
>>>> In trunk, Berkeley DB:
>>>>
>>>> is significantly faster (50%) for write operations. This is almost
>>>> certainly because post-commit deltification is doing a single
>>>> deltification instead of touching a chain of files.
>>>>
>>>> is significantly slower (300%) for read operations. Distance to
>>>> nearest fulltext?
>>>>
>>>> showed significant improvement in disk usage (20% savings) in trunk.
>>>> For the same reasons that FSFS didn't show much improvement here, I
>>>> must assume rep-sharing wasn't the real win here. More likely the
>>>> minimization of fulltexts (one per line of history) is the win here.
>>>>
>>>> In all things except disk usage (now in trunk), FSFS remains a clear winner
>>>> over BDB in this testing.
>>>>
>>>> Attached are the script I used and a spreadsheet with the actual findings.
>>> I've got an uncommitted patch which causes Berkeley DB to store *both* MD5
>>> and SHA1 checksums, and to be able to cough up the one required by callers.
>>> I re-ran the numbers with this patch, and have attached an updated
>>> spreadsheet. What I find is that the space and performance cost for
>>> calculating and storing both checksums is minimal (atop what the trunk code
>>> was already doing). But the read costs dropped by half! I suspect this is
>>> because svn_fs_file_md5_checksum() forces a walk over the file contents if
>>> the MD5 checksum isn't readily available in the database, which is the case
>>> in the current trunk code.
>> Where do we still use svn_fs_file_md5_checksum() explicitly? I thought most of
>> those calls had been switched to svn_fs_file_checksum(), which can have the same
>> behavior you describe, but isn't forced to.
>
> Turns out the calls I was thinking of *are* using svn_fs_file_checksum().
> But they also pass TRUE for force. (These are in libsvn_repos/dump.c.) Six
> of one, a half-dozen of the other...

That's what I figured. I *think* that's the only place we force a checksum
calculation, and that's because we want the md5 to be there for older clients if
somebody's doing a dump-load from 1.6 to pre-1.6. Otherwise, we could just put
whatever checksum we had, sha1 or md5, and then let the loader put the same kind
of checksum in the target repo. That would also save a few of our "ignore this
checksum 'cause it ain't the right kind" conditionals.

-Hyrum

application/pgp-signature attachment: OpenPGP digital signature

Received on 2008-10-31 02:59:54 CET

This message: [ Message body ]
Next message: Ed Price: "epigrams in programming (was Re: Why does "update" treat an obstruction differently if adding without history?)"
Previous message: C. Michael Pilato: "Re: Comparison testing: { FSFS, BDB } x { 1.5.4, trunk }"
In reply to: C. Michael Pilato: "Re: Comparison testing: { FSFS, BDB } x { 1.5.4, trunk }"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]