Re: pristine store database -- was: pristine store design

From: Neels J Hofmeyr <neels_at_elego.de>
Date: Tue, 16 Feb 2010 15:24:10 +0100

Commenting on myself...

Neels J Hofmeyr wrote:
> Philip Martin wrote:
>> Neels J Hofmeyr <neels_at_elego.de> writes:
>>
>>> THE PRISTINE STORE
>>> ==================
>>>
>>> The pristine store is a local cache of complete content of files that are
>>> known to be in the repository. It is hashed by a checksum of that content
>>> (SHA1).
>> I'm not sure whether you are planning one table per pristine store or
>> one table per working copy, but I think it's one per pristine store.
>> Obviously it doesn't makes no difference until pristine stores can be
>> shared (and it might be one per directory in the short term depending
>> on when stop being one database per directory).
>
> Thanks for that. This is the tip of an iceberg called 'a pristine store does
> not equal a working copy [root]'.
>
> The question is how to store the PRISTINE table (also see below) once it
> serves various working copies. Will we have a separate SQLite db store, and
> create a new file system entity called 'pristine store' that the user can
> place anywhere, like a working copy?
>
> We could also keep pristine store and working copy welded together, so that
> one working copy can use the pristine store of another working copy, and
> that a 'pristine store' that isn't used as a working copy is just a
> --depth=empty checkout of any folder URL of that repository. It practically
> has the same effect as completely separating pristine stores from working
> copies (there is another SQLite store somewhere else), but we can just
> re-use the WC API, no need to have a separate pristine *store* API (create
> new store, contact local store database, indicate a store location, checking
> presence given a location, etc.).

If we have a single wc.db per user, we can also easily have a single
pristine store per user. Until then, we'll probably better use a separate
pristine store per WC...?

How to tackle a system-wide pristine store also has to cope with write
permissions, so that may be a different thing entirely (like a local service
daemon instead of a publicly writable file system location...)

>
>>> SOME IMPLEMENTATION INSIGHTS
>>> ============================
>>>
>>> There is a PRISTINE table in the SQLite database with columns
>>> (checksum, md5_checksum, size, refcount)
>>>
>>> The pristine contents are stored in the local filesystem in a pristine file,
>>> which may or may not be compressed (opaquely hidden behind the pristines API).
>>> The goal is to be able to have a pristine store per working copy, per user as
>>> well as system-wide, and to configure each working copy as to which pristine
>>> store(s) it should use for reading/writing.
>>>
>>> There is a canonical way of getting a given CHECKSUM's pristine file name for
>>> a given working copy without contacting the WC database (static function
>>> get_pristine_fname()).
>>>
>>> When interacting with the pristine store, we want to, as appropriate, check
>>> for (combos of):
>>> db-presence - presence in the PRISTINE table with noted file size > 0
>>> file-presence - pristine file presence
>>> stat-match - PRISTINE table's size and mtime match file system
>>> checksum-match - validity of data in the file against the checksum
>>>
>>> file-presence is gotten for free from a successful stat-match (fstat),
>>> checksum-match (fopen) and unchecked read of the file (fopen).
>>>
>>> How fast we consider things:
>>> db-presence - very fast to moderately fast (in case of "empty db cache")
>>> file-presence - slow (fstat or fopen)
>>> stat-match - slow (fstat plus SQLite query)
>>> checksum-match - super slow (reading, checksumming)
>> I'm prepared to believe a database query can be faster that stat when
>> the inode cache is cold, but what about when the inode cache is hot?
>
> Also thanks for this!
>
> I don't know that much about database/file system benchmarks, let alone on
> different platforms. My initial classifications are mostly guessing, mixed
> with provocative prodding to wake up more experienced devs ;)
>
> I'm also not really aware how expensive it is to calculate a checksum while
> reading a stream for other purposes. How much cpu time does it add if the
> file I/O would happen anyway? Is it neglectable?
>
> I guess we'll ultimately have to just try out what performs best.
>
>> If the database query requires even one system call then it could well
>> be slower. Multiple processes accessing a working copy, or writing to
>> the pristine store, might bias this further towards stat being faster,
>> If we decide to share the pristine store between several working
>> copies then a shared database could become a bottleneck.
>>
>> [...]
>>
>>> Use case "need": "I want to use this pristine's content, definitely."
>>> ---------------
>>> pseudocode:
>>> pristine_check(&present, checksum, _usable) (3)
>>> if !present:
>>> get_pristine_from_repos(checksum, ra) (9)
>>> pristine_read(&stream, checksum) (6)
>>>
>>> (3) check for _usable:
>>> - db-presence
>>> - if the checksum is not present in the table, return that it is not
>>> present (don't check for file existence as well).
>>> - stat-match (includes file-presence)
>>> - if the checksum is present in the table but file is bad/not there,
>>> bail, asking user to 'svn cleanup --pristines' (or sth.)
>>>
>>> (9) See use case "fetch". After this, either the pristine file is ready for
>>> reading, or "fetch" has bailed already.
>>>
>>> (6) fopen()
>>
>> I think this is the most important case from a performance point of
>> view. This is what 'svn status' et al. use, and it's important for
>> GUIs as a lot of the "feel" depends on how fast a process can query
>> the metadata.
>
> Agreed.
>
>> If we were to do away with the PRISTINE table, then we would not have
>> to worry about it becoming a bottleneck. We don't need the existance
>> check if we are just about to open the file, since opening the file
>> proves that it exists.
>
> <rant>Yes, I meant that, semantically, there has to be an existence check.
> You're right that it is gotten for free from opening the file. It's still
> important to note where the antenna sits that detects non-existence.</rant>
>
>> We obviously have the checksum already, from
>> the BASE/WORKING table, so we only need the PRISTINE table for the
>> size/mtime. Perhaps we could store those in the BASE/WORKING table
>> and eliminate the PRISTINE table, or is this too much of a layering
>> violation? The pristine store is then just a sharded directory, into
>> which we move files and from which we read files.
>
> -1
>
> While we could store size&mtime in the BASE/WORKING tables, this causes size
> and mtime to be stored multiple times (whereever a pristine is referenced)
> and involves editing multiple entries when a pristine is removed/added due
> to high-water-mark or repair. That would be nothing less than horrible.
> Taking one step away from that, each working copy should have a dedicated
> table that stores size and mtime only once. Then we still face the situation
> that size and mtime are stored multiple times (once per working copy), and
> where, if a central pristine store is restructured, every working copy has
> to be updated. Bad idea.
>
> Instead, we could not store size and mtime at all! :)

A big BUT is that we also need to store and send the MD5 checksum for
backwards compatibility with older servers/clients. So we'll definitely need
a database until 2.0, because of the MD5 compat alone.

We also currently have a 'compressed' flag stored, which allows optionally
compressing pristines. I think it's debatable if that is really useful. The
pristine store should be *fast* and, ideally, random-access-able. Opening a
decompression stream is kind of versus that; it's optimising for disk space,
and that's inherently not what the pristine store is for. I'd lose it.

~Neels

>
> They are merely half-checks for validity. During normal operation, size and
> mtime should never change, because we don't open write streams to pristines.
> If anyone messes with the pristine store accidentally, we would pick it up
> with the size, or if that stayed the same, with the mtime. But we can pick
> up all cases of bitswaps/disk failure *only* by verifying *full checksum
> validity*!
>
> So, while checking size and mtime gives a sense of basic sanity, it is
> really just a puny excuse for not checking full checksum validity. If we
> really care about correctness of pristines, *every* read of a pristine
> should verify the checksum along the way. (That would include to always read
> the complete pristine, even if just a few lines along the middle are needed)
>
> * neels dreams of disks that hardware-checksum on-the-fly
>
> If I further follow my dream of us emulating such hardware, we would store
> checksums for sub-chunks of each pristine, so that we can read small
> sections of pristines, being sure that the given section is correct without
> having to read the whole pristine.
>
> Whoa, look where you got me now! ;)
>
> I think it's a very valid question. Chuck the mtime and size, thus get rid
> of the PRISTINE table, thus do away with checking for any inconsistency
> between table and file system, also do away with possible database
> bottlenecks, and reduce the location of the pristine store to a mere local
> abspath. We have the checksum, we have the filename. Checking mtime and
> length protects against accidental editing of the pristine files. But any
> malicious or hw-failure corruption can in fact be *protected* by keeping
> mtime and length intact! ("hey, we checked it, it must be correct.")
>
> Let's play through a corrupted pristine (with unchanged mtime/length). This
> is just theoretical...
>
> Commit modification:
>
> - User makes a checkout / revert / update that uses a locally
> corrupted pristine. The corrupted pristine thus sits in the WC.
>
> - User makes a text mod
>
> - User commits
>
> - Client/network layer communicate the *delta* between the local pristine
> and the local mod to the repository, and the checksum of the modified
> text.
>
> - Repos applies the delta to the intact pristine it has in *its* store.
>
> - Repos finds the resulting checksum to be *different* from the client's
> checksum, because the underlying pristine was corrupt.
>
> --> Yay! No need to do *ANY* local verification at all!!
>
> Of course, in case the client/network layer decide to send the full text
> instead of a delta, the corruption is no longer detected. :(
>
>
> Merge and commit:
>
> - User makes a merge that uses a locally corrupted pristine.
>
> - The merge *delta* applied to the working copy is incorrect.
>
> - User does not note the corruption (e.g. via --accept=mine-full)
>
> - User commits
>
> - Repos accepts the changes based on the corrupted pristine that was
> used to get the merge delta, because it can't tell the difference
> from a normal modification.
>
> --> My goodness, merge needs to check pristine validity on each read,
> as if it wasn't slow enough. But as discussed above, even if merge
> checked mtime and length, it would not necessarily detect disk failure
> and crafted malicious corruption.
>
>
> Thanks, Philip.
>
> I'm now challenging the need to store mtime and length, and a need to do
> more checksumming instead. The checksumming overhead could be smaller than
> the database bottleneck slew.
>
> For future optimisation, I'm also suggesting pristines should have
> additionally stored checksums for small chunks of each pristine, while still
> being indexed by the full checksum.
> (Which may imply a db again :/ , but that db would only be hit if we're
> trying to save time by reading just a small bit of the pristine)
>
> Everyone, please prove me wrong!
>
> Thanks,
> ~Neels
>

application/pgp-signature attachment: OpenPGP digital signature

Received on 2010-02-16 15:24:55 CET

This message: [ Message body ]
Next message: Bert Huijben: "RE: pristine store database -- was: pristine store design"
Previous message: Neels J Hofmeyr: "pristine store database -- was: pristine store design"
In reply to: Neels J Hofmeyr: "pristine store database -- was: pristine store design"
Next in thread: Bert Huijben: "RE: pristine store database -- was: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]