Neels J Hofmeyr <neels_at_elego.de> writes:
> THE PRISTINE STORE
> ==================
>
> The pristine store is a local cache of complete content of files that are
> known to be in the repository. It is hashed by a checksum of that content
> (SHA1).
I'm not sure whether you are planning one table per pristine store or
one table per working copy, but I think it's one per pristine store.
Obviously it doesn't makes no difference until pristine stores can be
shared (and it might be one per directory in the short term depending
on when stop being one database per directory).
> SOME IMPLEMENTATION INSIGHTS
> ============================
>
> There is a PRISTINE table in the SQLite database with columns
> (checksum, md5_checksum, size, refcount)
>
> The pristine contents are stored in the local filesystem in a pristine file,
> which may or may not be compressed (opaquely hidden behind the pristines API).
> The goal is to be able to have a pristine store per working copy, per user as
> well as system-wide, and to configure each working copy as to which pristine
> store(s) it should use for reading/writing.
>
> There is a canonical way of getting a given CHECKSUM's pristine file name for
> a given working copy without contacting the WC database (static function
> get_pristine_fname()).
>
> When interacting with the pristine store, we want to, as appropriate, check
> for (combos of):
> db-presence - presence in the PRISTINE table with noted file size > 0
> file-presence - pristine file presence
> stat-match - PRISTINE table's size and mtime match file system
> checksum-match - validity of data in the file against the checksum
>
> file-presence is gotten for free from a successful stat-match (fstat),
> checksum-match (fopen) and unchecked read of the file (fopen).
>
> How fast we consider things:
> db-presence - very fast to moderately fast (in case of "empty db cache")
> file-presence - slow (fstat or fopen)
> stat-match - slow (fstat plus SQLite query)
> checksum-match - super slow (reading, checksumming)
I'm prepared to believe a database query can be faster that stat when
the inode cache is cold, but what about when the inode cache is hot?
If the database query requires even one system call then it could well
be slower. Multiple processes accessing a working copy, or writing to
the pristine store, might bias this further towards stat being faster,
If we decide to share the pristine store between several working
copies then a shared database could become a bottleneck.
[...]
> Use case "need": "I want to use this pristine's content, definitely."
> ---------------
> pseudocode:
> pristine_check(&present, checksum, _usable) (3)
> if !present:
> get_pristine_from_repos(checksum, ra) (9)
> pristine_read(&stream, checksum) (6)
>
> (3) check for _usable:
> - db-presence
> - if the checksum is not present in the table, return that it is not
> present (don't check for file existence as well).
> - stat-match (includes file-presence)
> - if the checksum is present in the table but file is bad/not there,
> bail, asking user to 'svn cleanup --pristines' (or sth.)
>
> (9) See use case "fetch". After this, either the pristine file is ready for
> reading, or "fetch" has bailed already.
>
> (6) fopen()
I think this is the most important case from a performance point of
view. This is what 'svn status' et al. use, and it's important for
GUIs as a lot of the "feel" depends on how fast a process can query
the metadata.
If we were to do away with the PRISTINE table, then we would not have
to worry about it becoming a bottleneck. We don't need the existance
check if we are just about to open the file, since opening the file
proves that it exists. We obviously have the checksum already, from
the BASE/WORKING table, so we only need the PRISTINE table for the
size/mtime. Perhaps we could store those in the BASE/WORKING table
and eliminate the PRISTINE table, or is this too much of a layering
violation? The pristine store is then just a sharded directory, into
which we move files and from which we read files.
--
Philip
Received on 2010-02-15 21:06:31 CET