Re: Compressed Pristines (Summary)

From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Sat, 31 Mar 2012 09:16:33 -0700 (PDT)

----- Original Message -----
> From: Stefan Sperling <stsp_at_elego.de>
> To: Ashod Nakashian <ashodnakashian_at_yahoo.com>
> Cc: "dev_at_subversion.apache.org" <dev_at_subversion.apache.org>
> Sent: Saturday, March 31, 2012 7:47 PM
> Subject: Re: Compressed Pristines (Summary)
>
> On Sat, Mar 31, 2012 at 01:31:15AM -0700, Ashod Nakashian wrote:
>> ::Summary::
>
> Thanks for summarizing the discussion.
> This summary was very useful to me.
>
>> 3) Sqlite has a very reputable code-base and performance, in addition
>> to already being utilized by SVN. A proposal to use it for small
>> pristine file storage has been proposed. The major advantage is the
>> possibility of abusing and/or overloading Sqlite with this kind of
>> usage that it probably isn't optimized to handle.
>
> Sqlite should be capable of handling this well.
> Are you aware of fossil (http://fossil-scm.org)?
> This is the version control system used to manage sqlite, and not
> surprisingly it stores all of its revision and working copy meta
> data in sqlite. Fossil is kept very simple at its core and might
> have scalability issues with very large projects (such as webkit)
> but nevertheless it is very inspiring. It might be worthwhile to
> take a look at the details of how fossil stores revision file data
> (because fossil is a distributed version control system the pristine
> files are simply stored in the last committed revision).

I hadn't known that fossil uses sqlite, although I was familiar with it (if nothing, because Sqlite uses it!).

So it's fair to say I'm ignorant about the details, but I must say this: A repository, precisely like Git pack files, don't necessarily need good (if at all) support of deletion. This is a very critical issue that I can see why it might not be obvious at first. At least one person (sorry, I don't readily have a name) raised a possible reinventing-the-wheel flag with the proposed pack format and Git's pack file. Git's pack *is* the repository. Technically, a repository needs to at least emulate the behavior of deletes, but it certainly doesn't have the performance issue as they almost never support arbitrary deletions (that is, history rewriting by removing a historical versions of files). Git has some support of history rewriting (typically via rebase) but even then, it can defer housekeeping cleanup to git-gc, which is a slow and by definition an offline operation. Compare this with our case where we keep the latest revision (known to us) only and all
files are subject to updates/modification and deletion upon invoking SVN commands. Whatever our pristine storage, it must do these rather swiftly. (I know I'm oversimplifying, but I trust I'm not misrepresenting.)

If we find some way to go around this requirement (say by doing things "quick'n'dirty" and deferring cleanups) then we probably can reuse some of the numerous archives, be it Git's pack or any other. But that comes at a cost and, unlike Git, we don't have any advantages to offer in return. Git can keep deleted items until git-gc is invoked, should we support something similar, we need to be consistent and probably support arbitrary revision history, which is out of scope. Sqlite (which internally uses a b-tree pointing to fixed-size pages that overflow using linked-lists) is designed for fast additions/modifications/deletions of typically tiny data (a row is reasonably assumed to be -much- less than a page in most cases) and *without* promising a compact footprint, which we dearly care about. We will be doing the same on KBytes worth of data for each entry. This is something that we must certainly research more with actual data. However in my mind our
use-case is quite different from what Sqlite is designed to do best, which is why I'm suggesting we do some benchmarking if we go with Sqlite.

Just wanted to make this clear just to be sure we're not talking cross purposes at this point.

>
> I haven't yet spent enough time thinking about compressed pristines
> to have an opinion about which approach we should take. But I'm very
> much looking forward to seeing a patch that implements a first step
> or first milestone of whichever approach we're going to settle on.
>
> I'm glad to see somebody driving this feature forward. Thanks!
>

Thanks Stephan! I'm excited about the prospects myself.

-Ash
Received on 2012-04-01 01:23:49 CEST

This message: [ Message body ]
Next message: Stefan Sperling: "Re: Compressed Pristines (Summary)"
Previous message: Ashod Nakashian: "Re: Compressed Pristines (Summary)"
In reply to: Stefan Sperling: "Re: Compressed Pristines (Summary)"
Reply: Branko ÄŒibej: "Re: Compressed Pristines (Summary)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]