[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

From: Qi Fred <fred.qi_at_gmail.com>
Date: 2006-05-08 17:43:56 CEST

On 5/8/06, Peter N. Lundblad <peter@famlundblad.se> wrote:

> Hi,
>
> As this was posted here, I reply on the list. I know there are other
> applications for this as well. I hope all applicants will be able to
> benefit from this information. (Also, note that what I say may not be
> the consensus of the project - I'm only one member.)
>
> In short, I think the proposal is a good starting point, but there are
> things needing more thought or reconsideration.
>
> Qi Fred writes:
> > The following features are planned to be implemented:
> >
> > - By setting options in the runtime configuration files, users can
> > (a) switch between using original and compressed text bases, and
>
> I assume these options will determinee which method gets used when
> checking out? Do you imagine the user being able to switch existing
> working copies?

Sure, users may use a svn client supports compressed copy when they create
theire initial check outs. But some others may not. Switch means the user
can
upgrade the client smoothly without re-do check out. Another problem is that
uncompress can be very time consuming, and users would like to work with
original text-bases. So a mechanism supports switching is necessary.

> > (b) enable or disable caching large binary files.
> >
> > - By specifying a special property on a certain file, one of the
> > three caching mechanisms can be chosen: original, compressed, and
> > excluded (caching disabled). Note that the text bases can be
> > excluded on client side only if the file is a binary one.
> >
> Do you propose to use versioned properties for this? I'd say this
> should only a client-side option.
>
> Why limit optional text bases to binary files? Many small files also
> take up much disk space on many filesystems.
>
> > But disabling the caching of text bases changes the work model of
> > Subversion because comparison (diff) and generation of deltas depend
> > directly on text bases.
>
> Note that you don't strictly need the text base to generate a text
> delta, it would just be a delta containing only new data, making
> effectively a compressed fulltext. There is nothing saying that a
> delta sent to the server must be minimal.

Thanks. This is very useful. I am not clear how deltas are genereated and
whether minimal deltas are used in commitments. Do you mean that we
need not modify any code of the server, if the client sends full text as a
delta
in a commitment?

> > If a file without cached text base has been modified and intend to
> > be committed, there are three (or more) potential working cycles:
> >
> > 1) abort and warn the user
>
> That's not good. This makes the feature pretty useless except for
> read-only working copies...
> > 2) temporarily download the base revision
> >
> Could as well send a fulltext delta to the server.

It would be better if the server accepts compressed delta.

> > 3) make Subversion work without cached text bases
> > - split large binary files into small blocks, for example, 32KB
> > - stores locally the very short message digests of all blocks
> > - detect changes by comparing digests of corresponding blocks
> > - send only the changed blocks to the server or request and
> > download only the changed blocks to the client.
> > - generate deltas and commit changes (on server or client side).
>
> What happens when someone inserts one byte near the beginning of the
> file? We need an rsync-like algorithm if we want to do this. I think
> this is an optional optimization. People will need to trade disk
> usage (storing text bases) versus network usage.

The average performane is better than the two previous suggestions.
To optimize the worst case would be time consuming, and I am not wheter
the time is enough within the Summer of Code limitation.

> > All the above working cycles solve the problem introduced by disable
> > caching text bases. The first one can be easily implemented, but
> > introduces inconvenient manual operations. The latter two cycles
> > require modifications in both the client and server sides. The
> > problem of the second one is the heavy load of transmission during a
> > commit. Since the contents of large files change seldom, the second
> > cycle is feasible. The third one concerns the collision of message
> > digest algorithms. There is a report that different contents give
> > same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> > collisions have not been found in SHA-1 algorithm. Some
> > investigations should be down to avoid collisions. I prefer to
> > implement the third working model.
> >
> I'm no expert in this area, but I pretty sure the collisions concern
> the cryptographic uses of MD5, so I don't think we need to worry about
> that. Others may want to comment here.

I would like to use MD5 algorithm, but there is a risk that some files are
not
correctly committed to the server.

> According to these discussions, I suggest to add a section of
> > runtime configuration options and a special property to manage text
> > bases.
> >
> > ** Runtime Configurations for text-base Management
> >
> > I suggest to add a new section, 'text-base', to the set of options
> > of runtime configuration. This section provides options of text
> > bases management on the client side:
> >
> > - compressed: This is a binary option (yes/no). This instructs
> > Subversion client to cache compressed or original text bases. Set
> > this to 'yes' to enable caching text bases in compressed format.
> >
> > - exclude-large-bins: This is a binary switch (yes/no). Set this
> > variable to 'yes' if the user want Subversion to disable caching
> > large binary files automatically. Whether the file is large or not
> > is determined by comparing its size with a threshold that
> > specified by the variable 'exclusion-threshold'.
> >
> > - exclusion-threshold: This option should be a positive number. Its
> > value describes whether a binary file is large enough to turn off
> > the caching of its corresponding text-base. The suggested default
> > value is 512KB.
>
> The two options above coludlb e combined into one. Please keep the
> number of user options low.
>
> > - digest-block-size: This variable specifies the size of blocks the
> > binary files will be split into. This option should be a positive
> > number and its default value is suggested to be 32KB.
>
> Drop this. Who will know how to tweak this (uh, and the method
> doesn't work anyway:-)

You are right.

> ** Special Property for text-base Management
> >
> > A special property, 'svn:text-base', is suggested to be added. This
> > property indicates the way Subversion stores the text base of
> > corresponding file. Its value of can be one of the follows:
>
> As I said above, this shouldn't be versioned. You may need to extend
> the .svn/entries file, though.

This suggestion is good. Is there a user interface to access the
.svn/entries file
in current Subversion client? I think we need a new command for users to
access this file.

> A problem with the user interface sketched is that there is no way to
> specify the textbase handling per working copy, but only per user.
> Say one repository is on your LAN and another is in China (I live in
> Sweden:-).

This is NOT the fact since there is an option --config-dir.

> Regards,
> //Peter
>

--
Best Regards,
Fred Qi
Received on Mon May 8 17:49:58 2006

This is an archived mail posted to the Subversion Dev mailing list.