pristine store database -- was: pristine store design

From: Neels J Hofmeyr <neels_at_elego.de>
Date: Tue, 16 Feb 2010 14:54:56 +0100

Philip Martin wrote:
> Neels J Hofmeyr <neels_at_elego.de> writes:
>
>> THE PRISTINE STORE
>> ==================
>>
>> The pristine store is a local cache of complete content of files that are
>> known to be in the repository. It is hashed by a checksum of that content
>> (SHA1).
>
> I'm not sure whether you are planning one table per pristine store or
> one table per working copy, but I think it's one per pristine store.
> Obviously it doesn't makes no difference until pristine stores can be
> shared (and it might be one per directory in the short term depending
> on when stop being one database per directory).

Thanks for that. This is the tip of an iceberg called 'a pristine store does
not equal a working copy [root]'.

The question is how to store the PRISTINE table (also see below) once it
serves various working copies. Will we have a separate SQLite db store, and
create a new file system entity called 'pristine store' that the user can
place anywhere, like a working copy?

We could also keep pristine store and working copy welded together, so that
one working copy can use the pristine store of another working copy, and
that a 'pristine store' that isn't used as a working copy is just a
--depth=empty checkout of any folder URL of that repository. It practically
has the same effect as completely separating pristine stores from working
copies (there is another SQLite store somewhere else), but we can just
re-use the WC API, no need to have a separate pristine *store* API (create
new store, contact local store database, indicate a store location, checking
presence given a location, etc.).

>
>> SOME IMPLEMENTATION INSIGHTS
>> ============================
>>
>> There is a PRISTINE table in the SQLite database with columns
>> (checksum, md5_checksum, size, refcount)
>>
>> The pristine contents are stored in the local filesystem in a pristine file,
>> which may or may not be compressed (opaquely hidden behind the pristines API).
>> The goal is to be able to have a pristine store per working copy, per user as
>> well as system-wide, and to configure each working copy as to which pristine
>> store(s) it should use for reading/writing.
>>
>> There is a canonical way of getting a given CHECKSUM's pristine file name for
>> a given working copy without contacting the WC database (static function
>> get_pristine_fname()).
>>
>> When interacting with the pristine store, we want to, as appropriate, check
>> for (combos of):
>> db-presence - presence in the PRISTINE table with noted file size > 0
>> file-presence - pristine file presence
>> stat-match - PRISTINE table's size and mtime match file system
>> checksum-match - validity of data in the file against the checksum
>>
>> file-presence is gotten for free from a successful stat-match (fstat),
>> checksum-match (fopen) and unchecked read of the file (fopen).
>>
>> How fast we consider things:
>> db-presence - very fast to moderately fast (in case of "empty db cache")
>> file-presence - slow (fstat or fopen)
>> stat-match - slow (fstat plus SQLite query)
>> checksum-match - super slow (reading, checksumming)
>
> I'm prepared to believe a database query can be faster that stat when
> the inode cache is cold, but what about when the inode cache is hot?

Also thanks for this!

I don't know that much about database/file system benchmarks, let alone on
different platforms. My initial classifications are mostly guessing, mixed
with provocative prodding to wake up more experienced devs ;)

I'm also not really aware how expensive it is to calculate a checksum while
reading a stream for other purposes. How much cpu time does it add if the
file I/O would happen anyway? Is it neglectable?

I guess we'll ultimately have to just try out what performs best.

> If the database query requires even one system call then it could well
> be slower. Multiple processes accessing a working copy, or writing to
> the pristine store, might bias this further towards stat being faster,
> If we decide to share the pristine store between several working
> copies then a shared database could become a bottleneck.
>
> [...]
>
>> Use case "need": "I want to use this pristine's content, definitely."
>> ---------------
>> pseudocode:
>> pristine_check(&present, checksum, _usable) (3)
>> if !present:
>> get_pristine_from_repos(checksum, ra) (9)
>> pristine_read(&stream, checksum) (6)
>>
>> (3) check for _usable:
>> - db-presence
>> - if the checksum is not present in the table, return that it is not
>> present (don't check for file existence as well).
>> - stat-match (includes file-presence)
>> - if the checksum is present in the table but file is bad/not there,
>> bail, asking user to 'svn cleanup --pristines' (or sth.)
>>
>> (9) See use case "fetch". After this, either the pristine file is ready for
>> reading, or "fetch" has bailed already.
>>
>> (6) fopen()
>
>
> I think this is the most important case from a performance point of
> view. This is what 'svn status' et al. use, and it's important for
> GUIs as a lot of the "feel" depends on how fast a process can query
> the metadata.

Agreed.

> If we were to do away with the PRISTINE table, then we would not have
> to worry about it becoming a bottleneck. We don't need the existance
> check if we are just about to open the file, since opening the file
> proves that it exists.

<rant>Yes, I meant that, semantically, there has to be an existence check.
You're right that it is gotten for free from opening the file. It's still
important to note where the antenna sits that detects non-existence.</rant>

> We obviously have the checksum already, from
> the BASE/WORKING table, so we only need the PRISTINE table for the
> size/mtime. Perhaps we could store those in the BASE/WORKING table
> and eliminate the PRISTINE table, or is this too much of a layering
> violation? The pristine store is then just a sharded directory, into
> which we move files and from which we read files.

-1

While we could store size&mtime in the BASE/WORKING tables, this causes size
and mtime to be stored multiple times (whereever a pristine is referenced)
and involves editing multiple entries when a pristine is removed/added due
to high-water-mark or repair. That would be nothing less than horrible.
Taking one step away from that, each working copy should have a dedicated
table that stores size and mtime only once. Then we still face the situation
that size and mtime are stored multiple times (once per working copy), and
where, if a central pristine store is restructured, every working copy has
to be updated. Bad idea.

Instead, we could not store size and mtime at all! :)

They are merely half-checks for validity. During normal operation, size and
mtime should never change, because we don't open write streams to pristines.
If anyone messes with the pristine store accidentally, we would pick it up
with the size, or if that stayed the same, with the mtime. But we can pick
up all cases of bitswaps/disk failure *only* by verifying *full checksum
validity*!

So, while checking size and mtime gives a sense of basic sanity, it is
really just a puny excuse for not checking full checksum validity. If we
really care about correctness of pristines, *every* read of a pristine
should verify the checksum along the way. (That would include to always read
the complete pristine, even if just a few lines along the middle are needed)

* neels dreams of disks that hardware-checksum on-the-fly

If I further follow my dream of us emulating such hardware, we would store
checksums for sub-chunks of each pristine, so that we can read small
sections of pristines, being sure that the given section is correct without
having to read the whole pristine.

Whoa, look where you got me now! ;)

I think it's a very valid question. Chuck the mtime and size, thus get rid
of the PRISTINE table, thus do away with checking for any inconsistency
between table and file system, also do away with possible database
bottlenecks, and reduce the location of the pristine store to a mere local
abspath. We have the checksum, we have the filename. Checking mtime and
length protects against accidental editing of the pristine files. But any
malicious or hw-failure corruption can in fact be *protected* by keeping
mtime and length intact! ("hey, we checked it, it must be correct.")

Let's play through a corrupted pristine (with unchanged mtime/length). This
is just theoretical...

Commit modification:

- User makes a checkout / revert / update that uses a locally
corrupted pristine. The corrupted pristine thus sits in the WC.

- User makes a text mod

- User commits

- Client/network layer communicate the *delta* between the local pristine
and the local mod to the repository, and the checksum of the modified
text.

- Repos applies the delta to the intact pristine it has in *its* store.

- Repos finds the resulting checksum to be *different* from the client's
checksum, because the underlying pristine was corrupt.

--> Yay! No need to do *ANY* local verification at all!!

Of course, in case the client/network layer decide to send the full text
instead of a delta, the corruption is no longer detected. :(

Merge and commit:

- User makes a merge that uses a locally corrupted pristine.

- The merge *delta* applied to the working copy is incorrect.

- User does not note the corruption (e.g. via --accept=mine-full)

- User commits

- Repos accepts the changes based on the corrupted pristine that was
used to get the merge delta, because it can't tell the difference
from a normal modification.

--> My goodness, merge needs to check pristine validity on each read,
    as if it wasn't slow enough. But as discussed above, even if merge
    checked mtime and length, it would not necessarily detect disk failure
    and crafted malicious corruption.

Thanks, Philip.

I'm now challenging the need to store mtime and length, and a need to do
more checksumming instead. The checksumming overhead could be smaller than
the database bottleneck slew.

For future optimisation, I'm also suggesting pristines should have
additionally stored checksums for small chunks of each pristine, while still
being indexed by the full checksum.
(Which may imply a db again :/ , but that db would only be hit if we're
trying to save time by reading just a small bit of the pristine)

Everyone, please prove me wrong!

Thanks,
~Neels

application/pgp-signature attachment: OpenPGP digital signature

Received on 2010-02-16 14:55:41 CET

This message: [ Message body ]
Next message: Neels J Hofmeyr: "Re: pristine store database -- was: pristine store design"
Previous message: Bert Huijben: "RE: [PATCH] wc-ng: remove a use of svn_wc_entry_t from libsvn_client"
In reply to: Philip Martin: "Re: pristine store design"
Next in thread: Neels J Hofmeyr: "Re: pristine store database -- was: pristine store design"
Reply: Neels J Hofmeyr: "Re: pristine store database -- was: pristine store design"
Reply: Bert Huijben: "RE: pristine store database -- was: pristine store design"
Reply: Mark Mielke: "Re: pristine store database -- was: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]