[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: RFC: Delta indexing and composition

From: Branko Èibej <brane_at_xbc.nu>
Date: 2001-07-30 23:13:53 CEST

kfogel@collab.net wrote:

>I presume the OFFSET in "(OFFSET WINDOW)" is offset in to the
>reconstructed fulltext, and that the series of OFFSETs seen in
>
> DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
>
>must be in increasing order, so that there is a reasonable efficient
>way to track down the window(s) one needs for a given range? Might it
>not be better to give the full range reconstructed by each window, as
>in
>
> DELTA ::= (("delta" FLAG ...) (OFFSET LEN WINDOW) ...) ;
>
There's the SIZE field in the DIFF production ...

>or even
>
> DELTA ::= (("delta" FLAG ...) (RANGE-START RANGE-END WINDOW) ...) ;
>
>That way, you'd be able to find out the total size of a reconstructed
>text without actually having to reconstruct it. Well, perhaps if
>that's the problem, it would be simpler just to stick with your
>original scheme and add one more piece of data:
>
> DELTA ::= (("delta" FLAG ...) RECONSTRUCTED-SIZE (OFFSET WINDOW) ...) ;
>
... but yes, the whole idea of this is to have an ordered sequence of
windows that you can do binary searches on, keyed by offsets into the
reconstructed plaintext.

>Anyway:
>
>The main concern is implementation schedule right now, of course.
>Branko, it would be awesome if you could make this change, but would
>you have time to do it before August 15th?
>
Oh, no. This is strictly post-M3 stuff. It doesn't add any
functionality, it's only a performance improvement.

> It's okay if you don't;
>just check in your design to the notes/ directory somewhere, and Mike
>and/or I will get to it.
>
Will do that.

>Semi-related issue:
>
>The other thing I've been thinking about is planning ahead for storing
>large files outside the repository db tables... or, perhaps, storing
>all "real" data outside the tables? That is, perhaps we should
>reserve db tables for metadata, and store file contents (and prop
>lists? and dir entries lists?) out in the fs, using the usual sort of
>directory/file hash scheme that one works up for such circumstances.
>
>So instead of a `strings' table, there would be a new subdir in the
>repository, "strings/" or whatever, and everything that now lives in
>the `strings' table would live there instead. Or, alternatively, file
>contents would live out there, but dir entries and prop lists might
>still live in a `strings' table...
>
>The motivation for this is the presumed greater efficiency of reaching
>far into a file in the filesystem as opposed to a value in a Berkeley
>DB table. Does someone here know these tradeoffs fairly well?
>
The Sleepycat home page only mentiones database and object size limits,
but doesn't say anything about performance. I think we should ask them.

My gut feeling is accessing the data in the database should in general
be faster than your common or garden filesystem, and that supporting
external storage is more important for administrators than for users.

>Note that this stuff doesn't necessarily need to be finalized before
>M3. We can change the repository format even after we start
>self-hosting, although it might be a mild pain to do so if it means
>exporting and re-importing old data (or losing history) when we
>upgrade. Anyway, the point is we shouldn't put off self-hosting even
>if we think the storage layout might change.
>
>Thoughts?,
>

We'll definitely have to think about this if we want to support large
(i.e., several GB + years of history) repositories, distributed over
several filesystems. Every repository-like database I've seen does
things this way.

We'll also have to invent a way to expunge old revisions and metadata
from the repository to secondary storage, or for archiving; but the two
aren't strictly related.

In any case, our schema is flexible enough that we can implement
out-of-DB storage on top of it (e.g., by extending the semantics of
string keys to include external references). But personally I'd like to
keep the option of having the whole thing in a single database, because
it's much handier for smaller repositories.

-- 
Brane �ibej
    home:    <brane_at_xbc.nu>             http://www.xbc.nu/brane/
    work:    <branko.cibej_at_hermes.si>   http://www.hermes-softlab.com/
      ACM:   <brane_at_acm.org>            http://www.acm.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:33 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.