[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: pristine store design

From: Stefan Sperling <stsp_at_elego.de>
Date: Wed, 3 Mar 2010 17:51:17 +0100

On Wed, Mar 03, 2010 at 11:03:59AM -0500, Greg Stein wrote:
> We have one area for metadata, and that is the sqlite file named wc.db.
> Spreading metadata around is not very attractive. New custom file
> formats is also unattractive.

That's true. But it may be worth making an exception for this.
Keep in mind that once the pristine is created, its data never changes.

We could use a skel for the header if we decide to make an exception
and store meta data about pristines inside of pristines.
So we won't have to write a new custom parser.

We'd still use the wc.db for any volatile data concerning the pristine.
But so far it looks like the only volatile data is the mtime, which
according to Neels was invented as a way to "cheaply verify" partial
reads (WTF?). The scheme proposed below makes storing the mtime unnecessary.

> You should not have any problems moving a pristine into place and
> storing a record into the db. Sync is not the problem.

DB access needs to be synchronised while creating the pristine is
lock-free. Why not just go lock-free all the way?

> Funky headers in the file means that I cannot do:
> $ sha1sum myfile.c
> abcdef...
> $ sha1sum .svn/pristine/ab/abcdef...
> abcdef...

Why would you want to do that?

If the pristine has the same name as the SHA1 of myfile.c, you know
they are related. You just could not validate a pristine without
using some svn command that knows how to skip the pristine's header
and verifying just the content. But why would you want to use Subversion's
pristine store without also using svn? Which tool, if not svn itself,
would want to verify pristines?

> The files you're talking about storing do not actually *have* the
> desired checksum. That's just nasty.

I think that splitting up pristines for large files is a nice way
of dealing with the problem of verifying partial reads.
Require callers to read at least a whole chunk of which the checksum
is known and you can verify even partial reads relatively cheaply.

It also helps dealing with file system limitations in the future,
e.g. if the pristine store resides on a file system that does not
support files as large as the filesystem the working copy is
sitting in -- pristine store on FAT32, working copy on ZFS :)

For example:

pristine-for-16GB-file (named "sha1 of 16GB file"):
[header saying amount of fixed-size blocks, size of last block in bytes,
 and sha1 of what follows][sha1 name of block 1][sha1 name of block2][...]

pristine for block 1 (named "sha 1 of block 1's content"):
[header][block 1 data]

pristine for block 2 (named "sha 1 of block 2's content"):
[header][block 2 data]


A block is "just another pristine".
So a block can happen to also serve as a pristine for a different 10MB
file which happens to have the same content as the block.
We set the block size to something fixed, like 10MB.
The last block is allowed to be smaller and runs till EOF.

We can stat the size of small pristines, and parse a small header
to find the size of large pristines:
  #fixed-sized blocks * blocksize + size of last block

Of course, reading large files would involve a bit of seeking.

Also, this extension is optional -- for 1.7, we can write 16GB large
pristines and verify them upon read (even partial reads). Later, we can
add the proposed scheme on top and get cheaply verifiable partial reads.
Or if we get there in time, we can already have it in 1.7.

Starting with 1.7, the header would at least store the MD5 of the content
of the pristine. The header indicates its own size, so we're backwards
compatible forever since it's easy to skip unknown header data and read
and verify the content.

Received on 2010-03-03 17:52:19 CET

This is an archived mail posted to the Subversion Dev mailing list.