[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

From: Peter N. Lundblad <peter_at_famlundblad.se>
Date: 2006-05-08 15:42:23 CEST

Hi,

As this was posted here, I reply on the list. I know there are other
applications for this as well. I hope all applicants will be able to
benefit from this information. (Also, note that what I say may not be
the consensus of the project - I'm only one member.)

In short, I think the proposal is a good starting point, but there are
things needing more thought or reconsideration.

Qi Fred writes:
> The following features are planned to be implemented:
>
> - By setting options in the runtime configuration files, users can
> (a) switch between using original and compressed text bases, and

I assume these options will determinee which method gets used when
checking out? Do you imagine the user being able to switch existing
working copies?

> (b) enable or disable caching large binary files.
>
> - By specifying a special property on a certain file, one of the
> three caching mechanisms can be chosen: original, compressed, and
> excluded (caching disabled). Note that the text bases can be
> excluded on client side only if the file is a binary one.
>
Do you propose to use versioned properties for this? I'd say this
should only a client-side option.

Why limit optional text bases to binary files? Many small files also
take up much disk space on many filesystems.

> But disabling the caching of text bases changes the work model of
> Subversion because comparison (diff) and generation of deltas depend
> directly on text bases.

Note that you don't strictly need the text base to generate a text
delta, it would just be a delta containing only new data, making
effectively a compressed fulltext. There is nothing saying that a
delta sent to the server must be minimal.

> If a file without cached text base has been modified and intend to
> be committed, there are three (or more) potential working cycles:
>
> 1) abort and warn the user

That's not good. This makes the feature pretty useless except for
read-only working copies...
> 2) temporarily download the base revision
>
Could as well send a fulltext delta to the server.

> 3) make Subversion work without cached text bases
> - split large binary files into small blocks, for example, 32KB
> - stores locally the very short message digests of all blocks
> - detect changes by comparing digests of corresponding blocks
> - send only the changed blocks to the server or request and
> download only the changed blocks to the client.
> - generate deltas and commit changes (on server or client side).

What happens when someone inserts one byte near the beginning of the
file? We need an rsync-like algorithm if we want to do this. I think
this is an optional optimization. People will need to trade disk
usage (storing text bases) versus network usage.

> All the above working cycles solve the problem introduced by disable
> caching text bases. The first one can be easily implemented, but
> introduces inconvenient manual operations. The latter two cycles
> require modifications in both the client and server sides. The
> problem of the second one is the heavy load of transmission during a
> commit. Since the contents of large files change seldom, the second
> cycle is feasible. The third one concerns the collision of message
> digest algorithms. There is a report that different contents give
> same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> collisions have not been found in SHA-1 algorithm. Some
> investigations should be down to avoid collisions. I prefer to
> implement the third working model.
>
I'm no expert in this area, but I pretty sure the collisions concern
the cryptographic uses of MD5, so I don't think we need to worry about
that. Others may want to comment here.

> According to these discussions, I suggest to add a section of
> runtime configuration options and a special property to manage text
> bases.
>
> ** Runtime Configurations for text-base Management
>
> I suggest to add a new section, 'text-base', to the set of options
> of runtime configuration. This section provides options of text
> bases management on the client side:
>
> - compressed: This is a binary option (yes/no). This instructs
> Subversion client to cache compressed or original text bases. Set
> this to 'yes' to enable caching text bases in compressed format.
>
> - exclude-large-bins: This is a binary switch (yes/no). Set this
> variable to 'yes' if the user want Subversion to disable caching
> large binary files automatically. Whether the file is large or not
> is determined by comparing its size with a threshold that
> specified by the variable 'exclusion-threshold'.
>
> - exclusion-threshold: This option should be a positive number. Its
> value describes whether a binary file is large enough to turn off
> the caching of its corresponding text-base. The suggested default
> value is 512KB.

The two options above coludlb e combined into one. Please keep the
number of user options low.

> - digest-block-size: This variable specifies the size of blocks the
> binary files will be split into. This option should be a positive
> number and its default value is suggested to be 32KB.

Drop this. Who will know how to tweak this (uh, and the method
doesn't work anyway:-)

> ** Special Property for text-base Management
>
> A special property, 'svn:text-base', is suggested to be added. This
> property indicates the way Subversion stores the text base of
> corresponding file. Its value of can be one of the follows:

As I said above, this shouldn't be versioned. You may need to extend
the .svn/entries file, though.

A problem with the user interface sketched is that there is no way to
specify the textbase handling per working copy, but only per user.
Say one repository is on your LAN and another is in China (I live in
Sweden:-).

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon May 8 15:43:45 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.