AW: Compressed Pristines (Custom Format?)

From: Markus Schaber <m.schaber_at_3s-software.com>
Date: Fri, 23 Mar 2012 16:49:45 +0000

Hi,

> -----UrsprÃ¼ngliche Nachricht-----

> The gist of the above is that if we choose a better-than-gz compression
> and combine the files both *before* and *after* compression, we'll have
> much more significant results than what we have now just with gz on
> individual files. This can be noticed using tar.bz2, for example, where
> the result is not unlike what we can achieve with the custom file format
> (although bz2 is probably too slow for our needs).

Maybe xz (lzma2) is the algorithm to look at. It usually has a better ratio for cpu_usage/compression_factor, and decompression is nearly as fast as gz.

> I also like the fact that the pristine files are opaque and don't
> encourage the user to mess with them. Markus raised this point as
> "debuggability". I don't see "debuggability" as a user requirement (it is
> justifiably an SVN dev/maintainer requirement) and I don't find reason to
> add it as one. On the contrary, there are many reasons to suspect the user
> is doing something gravely wrong when they mess with the pristine files.

So maybe a developer tool for (un)packing pristine archives should be created.

> Another point raised by Markus is to store "common pristine" files and
> reuse them to reduce network traffic.

This point is seen to be independently of the compression of pristine store. Both a working-copy local and a common pristine store can profit from the compression.

> Sqlite may be used as Branko has suggested. I'm not opposed to this. It
> has it's shortcomings (not exploiting inter-file similarities which point
> #3 makes, for one) but it can be considered as a compromise between
> individual gz files and the custom pack file. The basic idea would be to
> store "small" files (after compression) in wc.db and have "link" to
> compressed files on disk for "large" files.

Maybe a distinct pristine.db bettern than to put them in wc.db, but I'm not sure about that.

> My main concern is that
> frequent updates to small files will leave the sqlite file with heavy
> external fragmentation (holes within the file unused but consuming disk-
> space).

Usually, sqlite re-uses free space within the same database rather efficiently.

> The solution is to "vacuum" wc.db, but that depends on its size,
> will lock it when vacuuming and other factors, so we can't do it as
> routine.

"svn cleanup" would be a good opportunity.

I just had another idea, we could store the metadata in the SQLite database:

In the wc.db, in the pristine table, store 4 rows[1]: filename, offset, length, algorithm.

"filename" denotes the container file name. Payload files are first concatenated, then compressed, then put into the container. Offset and length are byte-offsets in the decompressed bytestream. "Algorithm" denotes the compression algorithm, with one value reserved for uncompressed storage. If a container grows beyond a specific limit, a new file is created.

The main advantage of storing the metadata in SQLite is that we do not need to invent any new file format.

Some other positive aspects (some of them are clearly also possible using your original proposal):

- This allows to apply concatenation and compression orthogonally, on a container-by-container basis:
- So we can handle short, non-compressable files just by concatenating them in an uncompressed container.
- We can handle large, well-compressable files the "each file has its own container" way.

- By reserving a special length value (like -1 or SQL NULL) for "look at the file on disk", we can quickly upgrade existing working copies without touching the pristine files at all.
- This way, we could make the WC upgrade implicit again, as we only add three columns with well-defined default values (offset=0, length=-1, algorithm=uncompressed) to the table.
- "svn cleanup" could grow an option to reorganize / optimize the pristine storage.

- "debuggability" is somehow given:
- If the SQLite db is still intact, the pristines can be decompressed and split into pieces just with tools like zcat, head and tail.
- Even if the db is borked, most file formats have some heuristical "begin/end" markers (#include-lines, JFIF-Header, etc.) which allows forensics to find the ofsett and size using hexdump.

- As most current decompressors for gz and lzma transparently support the decompression of streams which are "first compressed, then concatenated", we could even try to exploit transfer encodings (like transparent gz compression in http) which might already deliver us compressed files.

The disadvantage clearly is that we need a few more bits when storing that metadata in the SQLite database, instead of in our own file. But in my eyes, this few bytes do not outweigh the overhead of inventing our own metadata storage format, including correct synchronization, transaction safety etc, which are already provided reliably by sqlite.

Best regards

Markus Schaber

[1] Plus the additional rows like ref_count, checksum etc., which are needed by svn, but are not of interest for this discussion.

-- 
___________________________
We software Automation.
3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50
Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Received on 2012-03-23 17:50:30 CET

This message: [ Message body ]
Next message: Mike Dixon: "Re: Symmetric Merge"
Previous message: C. Michael Pilato: "Re: [Issue 4145] Master passphrase and encrypted credentials cache"
In reply to: Ashod Nakashian: "Re: Compressed Pristines (Custom Format?)"
Next in thread: Ashod Nakashian: "Re: AW: Compressed Pristines (Custom Format?)"
Reply: Ashod Nakashian: "Re: AW: Compressed Pristines (Custom Format?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]