Re: AW: Compressed Pristines (Custom Format?)

From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Fri, 23 Mar 2012 11:09:59 -0700 (PDT)

>________________________________
> From: Markus Schaber <m.schaber_at_3s-software.com>
>To: Ashod Nakashian <ashodnakashian_at_yahoo.com>
>Cc: "dev_at_subversion.apache.org" <dev_at_subversion.apache.org>
>Sent: Friday, March 23, 2012 8:49 PM
>Subject: AW: Compressed Pristines (Custom Format?)
>
>Hi,
>
>> -----Ursprüngliche Nachricht-----
>
>> The gist of the above is that if we choose a better-than-gz compression
>> and combine the files both *before* and *after* compression, we'll have
>> much more significant results than what we have now just with gz on
>> individual files. This can be noticed using tar.bz2, for example, where
>> the result is not unlike what we can achieve with the custom file format
>> (although bz2 is probably too slow for our needs).
>
>Maybe xz (lzma2) is the algorithm to look at. It usually has a better ratio for cpu_usage/compression_factor, and decompression is nearly as fast as gz.

LZMA is a candidate. But my experience tells me it'll be too slow for our needs. But there are good candidates that I hope will give stunning results. The secret algorithm is PPMd (also available in the LZMA SDK) which isn't patented (AFAIK) and has some really fast implementations that have a compatible license to SVN. In time, we'll get to actually benchmarking them and picking a winner based on the results.

>
>> I also like the fact that the pristine files are opaque and don't
>> encourage the user to mess with them. Markus raised this point as
>> "debuggability". I don't see "debuggability" as a user requirement (it is
>> justifiably an SVN dev/maintainer requirement) and I don't find reason to
>> add it as one. On the contrary, there are many reasons to suspect the user
>> is doing something gravely wrong when they mess with the pristine files.
>
>So maybe a developer tool for (un)packing pristine archives should be created.

Yes, that's a very reasonable suggestion. It can use the same wc lib.

>
>> Another point raised by Markus is to store "common pristine" files and
>> reuse them to reduce network traffic.
>
>This point is seen to be independently of the compression of pristine store. Both a working-copy local and a common pristine store can profit from the compression.

I'd still keep it separate from the current topic/feature.

>
>> Sqlite may be used as Branko has suggested. I'm not opposed to this. It
>> has it's shortcomings (not exploiting inter-file similarities which point
>> #3 makes, for one) but it can be considered as a compromise between
>> individual gz files and the custom pack file. The basic idea would be to
>> store "small" files (after compression) in wc.db and have "link" to
>> compressed files on disk for "large" files.
>
>Maybe a distinct pristine.db bettern than to put them in wc.db, but I'm not sure about that.
>
>> My main concern is that
>> frequent updates to small files will leave the sqlite file with heavy
>> external fragmentation (holes within the file unused but consuming disk-
>> space).
>
>Usually, sqlite re-uses free space within the same database rather efficiently.

I don't have data on this, but I know that deletes don't repack the db file. Unless vacuum is used or other data is written, these unused pages (as they are called in the sqlite parlance) will remain wasted.

>
>> The solution is to "vacuum" wc.db, but that depends on its size,
>> will lock it when vacuuming and other factors, so we can't do it as
>> routine.
>
>"svn cleanup" would be a good opportunity.

I think we'll probably use that in either case, if not add a new one as well. It's too soon to tell.

>
>
>I just had another idea, we could store the metadata in the SQLite database:
>
>In the wc.db, in the pristine table, store 4 rows[1]: filename, offset, length, algorithm.
>
>"filename" denotes the container file name. Payload files are first concatenated, then compressed, then put into the container. Offset and length are byte-offsets in the decompressed bytestream. "Algorithm" denotes the compression algorithm, with one value reserved for uncompressed storage. If a container grows beyond a specific limit, a new file is created.

This is by and large the proposed approach. The only difference is that there is a contesting index file to store the metadata. The argument for it is that it's fast to read and load it and get a very fast data-structure in memory to manage the "containers" (which in the proposal are called packs). But wc.db is also considered as an alternative (proposed first by Greg).

>
>The main advantage of storing the metadata in SQLite is that we do not need to invent any new file format.
>
>Some other positive aspects (some of them are clearly also possible using your original proposal):
>
>- This allows to apply concatenation and compression orthogonally, on a container-by-container basis:
> - So we can handle short, non-compressable files just by concatenating them in an uncompressed container.
> - We can handle large, well-compressable files the "each file has its own container" way.
>
>- By reserving a special length value (like -1 or SQL NULL) for "look at the file on disk", we can quickly upgrade existing working copies without touching the pristine files at all.
> - This way, we could make the WC upgrade implicit again, as we only add three columns with well-defined default values (offset=0, length=-1, algorithm=uncompressed) to the table.
> - "svn cleanup" could grow an option to reorganize / optimize the pristine storage.
>
>- "debuggability" is somehow given:
> - If the SQLite db is still intact, the pristines can be decompressed and split into pieces just with tools like zcat, head and tail.
> - Even if the db is borked, most file formats have some heuristical "begin/end" markers (#include-lines, JFIF-Header, etc.) which allows forensics to find the ofsett and size using hexdump.
>
>- As most current decompressors for gz and lzma transparently support the decompression of streams which are "first compressed, then concatenated", we could even try to exploit transfer encodings (like transparent gz compression in http) which might already deliver us compressed files.
>
>The disadvantage clearly is that we need a few more bits when storing that metadata in the SQLite database, instead of in our own file. But in my eyes, this few bytes do not outweigh the overhead of inventing our own metadata storage format, including correct synchronization, transaction safety etc, which are already provided reliably by sqlite.

The overhead in size isn't an issue (at least not a major one) rather the overhead of *speed* is. At least that's my argument. I need to expand that section with both approaches with pros/cons, but first let's agree on the topic debated: to custom format, or to not.

Thanks,
Ash

>
>Best regards
>
>Markus Schaber
>
>[1] Plus the additional rows like ref_count, checksum etc., which are needed by svn, but are not of interest for this discussion.
>--
>___________________________
>We software Automation.
>
>3S-Smart Software Solutions GmbH
>Markus Schaber | Developer
>Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50
>
>Email: m.schaber@3s-software.com | Web: http://www.3s-software.com
>CoDeSys internet forum: http://forum.3s-software.com
>Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects
>
>Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
>
>
>
Received on 2012-03-23 19:10:38 CET

This message: [ Message body ]
Next message: Hyrum K Wright: "Re: Compressed Pristines (Design Doc)"
Previous message: Mike Dixon: "Re: Symmetric Merge"
In reply to: Markus Schaber: "AW: Compressed Pristines (Custom Format?)"
Next in thread: Greg Stein: "Re: AW: Compressed Pristines (Custom Format?)"
Reply: Greg Stein: "Re: AW: Compressed Pristines (Custom Format?)"
Reply: Markus Schaber: "AW: AW: Compressed Pristines (Custom Format?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]