Re: FSFS format7 and compressed XML bundles

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Wed, 6 Mar 2013 18:55:55 +0000 (GMT)

Vincent Lefevre wrote:

> On 2013-03-05 16:52:30 +0000, Julian Foad wrote:
>> Vincent Lefevre wrote:
> [about server-side vs client-side]
[...]
> Because the diff between two huge compressed files is generally huge
> (unless some rsync-friendly option has been applied, when available).
> So, if the client doesn't uncompress the data for the server, it will
> have to send a huge diff or a huge compressed file, even though the
> diff between the uncompressed data may be small. So, if
> deconstruction/reconstruction is possible (canonical form),
> it is much more efficient to do this on the client side.

Certainly that is true.

>>>> That point _is_ specific to a server-side solution. With a
>>>> client-side solution, the user's word processor may not mind if a
>>>> versioning operation such as a commit (through a decompressing
>>>> plug-in) followed by checkout (through a re-compressing plug-in)
>>>> changes the bit pattern of the compressed file, so long as the
>>>> uncompressed content that it represents is unchanged.
>>>
>>> I disagree.
>>
>> It's not clear what you disagree with.
>
> With the second sentence ("... may not mind ..."), thus with the first
> sentence too.
[...]
>>> The word processor may not mind (in theory, because
>>> in practice, one may have bugs that depend on the bit pattern,
>>> and it would be bad to expose the user to such kind of bugs and
>>> non-deterministic behavior), but for the user this may be important.
>>> For instance, a different bit pattern will break a possible signature
>>> on the compressed file.
>>
>> I agree that it *may* be important for the user, but the users have
>> control so they can use this client-side scheme in scenarios where
>> it works for them and not use it in other scenarios.
>
> But one should need a scheme that will also work in the case where
> users care about the bit pattern of the compressed file.

> Moreover even when the users know that the exact bit pattern of the
> compressed file is not important at some time, this may no longer
> be true in the future. For instance, some current word processor may
> ignore the dates in zip files, but future ones may take them into
> account. So, you need to wonder what data are important in a zip
> file, including undocumented ones used by some implementations (as
> the zip format allows extensions). Taking them into account when it
> appears that these data become meaningful is too late, because such
> data would have already been lost in past versions of the Subversion
> repository.

If you are thinking about a solution that we can apply automatically, then yes it would
need to work in the case where users care about preserving the bit
pattern.

I was thinking about an opt-in system, where the user is in control of specifying which files get processed in this way. If the user is unsure whether the non-preservation of bit pattern is going to be important for their word processor files in the future, they can ask the provider of their word processor whether this kind of modification is officially supported. In many cases the answer will be "yes, we explicitly support that kind of archiving".

> On 2013-03-05 17:10:02 +0000, Julian Foad wrote:
[...]
>> Let me take that back. The point that I interpreted as being the
>> most significant impact of what Philip said, namely that the
>> Subversion protocols and system design require reproducible content,
>> is only a problem when done server-side. Other impacts of that same
>> point, such as you mentioned, are applicable no matter whether
>> server-side or client-side.
>
> The Subversion protocols and system design *currently* require
> reproducible content, but if new features are added, e.g. due to the
> fact that the users don't mind about the exact compressed content of
> some file, then it could be decided to change the protocols and the
> requirements (the server could consider some canonical uncompressed
> form as a reference).

Conceivably.

> [...]
>> So my main point is that the server-side expand/compress is a
>> non-starter of an idea, because it violates basic Subversion
>> requirements, whereas client-side is a viable option for some use
>> cases.
>
> I would reject the server-side expand/compress, not because of the
> current requirements (which could be changed to more or less match
> what happens on the client side), but because of performance reasons
> (see my first paragraph of this message).

Interesting thoughts.

The design of a bit-pattern-preserving solution is an interesting
challenge. In general a compression algorithm may have no canonical form, and not even be deterministically reproducible using only data that is available in the compressed file, and in those cases I don't see any theroretical solution. However, perhaps some commonly used compressions are found in practice to be in a form which can be
reconstructed by the compression algorithm, if given a set of parameters
that we are able to extract from the compressed data.

Perhaps it would be possible to design a scheme that scans the data stream for any such blocks (that are in one of the compression schemes it has been designed to recognize, such as 'deflate'), and extracts the parameters that will be necessary for exact recompression, and decompresses and recompresses at such times as to benefit from diffing the decompressed form. This could work in theory if we can extract these parameters, and if
diffing the plain text and preserving these parameters is cheaper than
just diffing the compressed data.

I don't know if anything like that would be feasible. It may be possible in theory but too complex in practice. The parameters we need to extract would include such things as the Huffman coding tables used and also parameters that influence deeper implementation details of the
compression algorithm. And of course for each compression algorithm we'd need an implementation that accepts all of these parameters -- an off-the-shelf compression library probably wouldn't be support this.

I was assuming that the only feasible solutions would be opt-in solutions where the user is willing to accept that the bit pattern is not preserved.

- Julian
Received on 2013-03-06 19:56:31 CET

This message: [ Message body ]
Next message: Julian Foad: "Re: merge test 125 is coredumping"
Previous message: Magnus Thor Torfason: "Re: FSFS format7 and compressed XML bundles"
In reply to: Vincent Lefevre: "Re: FSFS format7 and compressed XML bundles"
Next in thread: Vincent Lefevre: "Re: FSFS format7 and compressed XML bundles"
Reply: Vincent Lefevre: "Re: FSFS format7 and compressed XML bundles"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]