[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Optional/compressed text bases

From: Ron <lists_at_rzweb.com>
Date: 2006-05-08 18:00:42 CEST

> > All the above working cycles solve the problem introduced by disable
> > caching text bases. The first one can be easily implemented, but
> > introduces inconvenient manual operations. The latter two cycles
> > require modifications in both the client and server sides. The
> > problem of the second one is the heavy load of transmission during a
> > commit. Since the contents of large files change seldom, the second
> > cycle is feasible. The third one concerns the collision of message
> > digest algorithms. There is a report that different contents give
> > same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> > collisions have not been found in SHA-1 algorithm. Some
> > investigations should be down to avoid collisions. I prefer to
> > implement the third working model.
> >
> I'm no expert in this area, but I pretty sure the collisions concern
> the cryptographic uses of MD5, so I don't think we need to worry about
> that. Others may want to comment here.

I wish I could remember the link, but I read about using the MD5 of the
file forward and then the MD5 of the file backwards, producing 2 MD5
values and that along with the size of the data produced a chance of
collision so small to be (almost) impossible. Subversion could also add
modify time to increase this even more.

Maybe (almost) impossible isn't good enough, but it was 1 in the
billions of trillions if I remember correctly. I am far from an expert
in this area, so this maybe common knowledge/debunked.

I would love to see subversion store some kind of hash rather than the
full file. I work on projects with many many gigabytes of binary data
and hate to have my entire project stored twice.

Ron

Peter N. Lundblad wrote:
> Hi,
>
> As this was posted here, I reply on the list. I know there are other
> applications for this as well. I hope all applicants will be able to
> benefit from this information. (Also, note that what I say may not be
> the consensus of the project - I'm only one member.)
>
> In short, I think the proposal is a good starting point, but there are
> things needing more thought or reconsideration.
>
> Qi Fred writes:
> > The following features are planned to be implemented:
> >
> > - By setting options in the runtime configuration files, users can
> > (a) switch between using original and compressed text bases, and
>
> I assume these options will determinee which method gets used when
> checking out? Do you imagine the user being able to switch existing
> working copies?
>
> > (b) enable or disable caching large binary files.
> >
> > - By specifying a special property on a certain file, one of the
> > three caching mechanisms can be chosen: original, compressed, and
> > excluded (caching disabled). Note that the text bases can be
> > excluded on client side only if the file is a binary one.
> >
> Do you propose to use versioned properties for this? I'd say this
> should only a client-side option.
>
> Why limit optional text bases to binary files? Many small files also
> take up much disk space on many filesystems.
>
> > But disabling the caching of text bases changes the work model of
> > Subversion because comparison (diff) and generation of deltas depend
> > directly on text bases.
>
> Note that you don't strictly need the text base to generate a text
> delta, it would just be a delta containing only new data, making
> effectively a compressed fulltext. There is nothing saying that a
> delta sent to the server must be minimal.
>
> > If a file without cached text base has been modified and intend to
> > be committed, there are three (or more) potential working cycles:
> >
> > 1) abort and warn the user
>
> That's not good. This makes the feature pretty useless except for
> read-only working copies...
> > 2) temporarily download the base revision
> >
> Could as well send a fulltext delta to the server.
>
> > 3) make Subversion work without cached text bases
> > - split large binary files into small blocks, for example, 32KB
> > - stores locally the very short message digests of all blocks
> > - detect changes by comparing digests of corresponding blocks
> > - send only the changed blocks to the server or request and
> > download only the changed blocks to the client.
> > - generate deltas and commit changes (on server or client side).
>
> What happens when someone inserts one byte near the beginning of the
> file? We need an rsync-like algorithm if we want to do this. I think
> this is an optional optimization. People will need to trade disk
> usage (storing text bases) versus network usage.
>
> > All the above working cycles solve the problem introduced by disable
> > caching text bases. The first one can be easily implemented, but
> > introduces inconvenient manual operations. The latter two cycles
> > require modifications in both the client and server sides. The
> > problem of the second one is the heavy load of transmission during a
> > commit. Since the contents of large files change seldom, the second
> > cycle is feasible. The third one concerns the collision of message
> > digest algorithms. There is a report that different contents give
> > same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> > collisions have not been found in SHA-1 algorithm. Some
> > investigations should be down to avoid collisions. I prefer to
> > implement the third working model.
> >
> I'm no expert in this area, but I pretty sure the collisions concern
> the cryptographic uses of MD5, so I don't think we need to worry about
> that. Others may want to comment here.
>
> > According to these discussions, I suggest to add a section of
> > runtime configuration options and a special property to manage text
> > bases.
> >
> > ** Runtime Configurations for text-base Management
> >
> > I suggest to add a new section, 'text-base', to the set of options
> > of runtime configuration. This section provides options of text
> > bases management on the client side:
> >
> > - compressed: This is a binary option (yes/no). This instructs
> > Subversion client to cache compressed or original text bases. Set
> > this to 'yes' to enable caching text bases in compressed format.
> >
> > - exclude-large-bins: This is a binary switch (yes/no). Set this
> > variable to 'yes' if the user want Subversion to disable caching
> > large binary files automatically. Whether the file is large or not
> > is determined by comparing its size with a threshold that
> > specified by the variable 'exclusion-threshold'.
> >
> > - exclusion-threshold: This option should be a positive number. Its
> > value describes whether a binary file is large enough to turn off
> > the caching of its corresponding text-base. The suggested default
> > value is 512KB.
>
> The two options above coludlb e combined into one. Please keep the
> number of user options low.
>
> > - digest-block-size: This variable specifies the size of blocks the
> > binary files will be split into. This option should be a positive
> > number and its default value is suggested to be 32KB.
>
> Drop this. Who will know how to tweak this (uh, and the method
> doesn't work anyway:-)
>
> > ** Special Property for text-base Management
> >
> > A special property, 'svn:text-base', is suggested to be added. This
> > property indicates the way Subversion stores the text base of
> > corresponding file. Its value of can be one of the follows:
>
> As I said above, this shouldn't be versioned. You may need to extend
> the .svn/entries file, though.
>
>
> A problem with the user interface sketched is that there is no way to
> specify the textbase handling per working copy, but only per user.
> Say one repository is on your LAN and another is in China (I live in
> Sweden:-).
>
> Regards,
> //Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon May 8 18:01:31 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.