[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

pristine store design

From: Neels J Hofmeyr <neels_at_elego.de>
Date: Mon, 15 Feb 2010 14:45:46 +0100

Hi all,

taking stock of the current state of the pristine store API and finding
design docs missing, I have created a "design paper" to clarify.

If you could be so kind to glance over it and straighten out my picture, if
necessary. Upon approval, I'll check it into notes/ so we can edit.

Note, if my view is correct, this design text implies small changes to the
current state of the API:
- no need for _pristine_write()
- need to add _pristine_forget()
- change the _pristine_checkmode_t enum.

Still missing completely: how to handle a "high-water mark", i.e. how to
determine which pristines get forgotten first.

Thanks!
~Neels

THE PRISTINE STORE
==================

The pristine store is a local cache of complete content of files that are
known to be in the repository. It is hashed by a checksum of that content
(SHA1).

SOME IMPLEMENTATION INSIGHTS
============================

There is a PRISTINE table in the SQLite database with columns
 (checksum, md5_checksum, size, refcount)

The pristine contents are stored in the local filesystem in a pristine file,
which may or may not be compressed (opaquely hidden behind the pristines API).
The goal is to be able to have a pristine store per working copy, per user as
well as system-wide, and to configure each working copy as to which pristine
store(s) it should use for reading/writing.
  
There is a canonical way of getting a given CHECKSUM's pristine file name for
a given working copy without contacting the WC database (static function
get_pristine_fname()).

When interacting with the pristine store, we want to, as appropriate, check
for (combos of):
  db-presence - presence in the PRISTINE table with noted file size > 0
  file-presence - pristine file presence
  stat-match - PRISTINE table's size and mtime match file system
  checksum-match - validity of data in the file against the checksum

file-presence is gotten for free from a successful stat-match (fstat),
checksum-match (fopen) and unchecked read of the file (fopen).

How fast we consider things:
  db-presence - very fast to moderately fast (in case of "empty db cache")
  file-presence - slow (fstat or fopen)
  stat-match - slow (fstat plus SQLite query)
  checksum-match - super slow (reading, checksumming)

On the method of writing pristine files:
  In short: *ALWAYS* copy to pristine temp and checksum along with that, then
  'mv' into place.

  To avoid incomplete pristines, we never want to have a write stream to a
  pristine file location. Instead, we write to a tempfile on the same file-
  system device as the pristine store, after which a filesystem 'mv' puts it
  into place and a row is stored in the database. Like that, a pristine file
  is either intact or doesn't exist (unless corrupted by alien ray guns).
  
  When fetching a new pristine from the repository, get a temporary-file
  location from the pristine API, receive&checksum to that file, install.
  Typically, we have already checked if the pristine is stored yet.

  When there is some new content (added/modified file in the working copy), a
  pristine with the same content may already exist "coincidentally" -- that
  may be common in certain use scenarios, but is generally less common.
  Discussing optimal ways in different situations:
  - New file is *not* a temporary file ('svn add'/'commit', we are not allowed
    to 'mv' the file away), and a pristine with the same checksum and content
    as this file does *not* exist yet: The file needs to be checksummed and
    copied to the pristine store.
    --> write to pristine-tempfile, checksum along the way, install.
  - New file is *not* a temporary file ('svn add'/'commit'), but a pristine
    with the same content already exists. To catch this case, we would need to
    checksum the file *before* copying to the tempfile and then don't bother to
    copy if a match is found. BUT the file could be changed in-between us
    checksumming in-place, not finding a match and then copying! We should
    copy to a tempfile first anyway.
  - New file *is* a tempfile (we may 'mv' it away) and happens to be on the
    same filesystem device as the pristine store. We could checksum in-place
    and call pristine_install() on the tempfile directly. *But* code, in this
    case, should already have chosen the pristine-store tempfile location to
    begin with.
  So, it does not make sense to optimize here.
  (Depending on hardware, copying across devices can be faster or slower
  than copying within the same device. Can't optimise on that.)

On overwriting orphaned pristine files that have no database entry:
  "Maybe the user saved his favourite aunt's final poetic words in a file that
  has a name that is this checksum and which ended up in the pristine store
  accidentally ..."
  This is super highly unlikely, given that the file name would have to match
  the checksum. We don't really need to play safe.
  But, if we find a pristine file where the db didn't expect one, it may
  sometimes be faster to checksum that file and find it intact than to
  overwrite with data possibly received from a repository.
  But again, that implies that something must have gone wrong (interrupted at
  just the right time/bug/db corruption) and we don't need to optimise for
  that.
  So, when storing a pristine, we can simply `mv` over an existing checksum
  file in the pristine store, not bothering to even check if it exists.

USE CASES
=========

Pseudocode implies args for reference to working copy, pool etc. as needed.
pristine_* = svn_wc__db_pristine_*
_usable = svn_wc__db_checkmode_usable, etc.

Use case "new": "I have locally-created content which should be stored
-------------- as a pristine."
Another pristine with identical content may exist coincidentally, in which
case we technically don't need to write the pristine at all. But, as discussed
in "On the method to write pristine files" above, we should anyway copy the
new content to a tempfile to make sure nothing changes it in-between us
checksumming and copying. So, this is exactly the same as use case "store"!

Use case "fetch": "I'm in one of the get_pristine_from_repos() places (9).
---------------- I want to get a pristine from the repository and store it."
This will contact the repos to read a revision content. Even though the repos
provides a checksum for the content, we need to anyway validate that checksum
on our end.
Getting the data from the repos, employ use case "store" below.
(We don't need to bother to checksum our end before storing the pristine,
checksumming is included in use case "store".)
 

   
Use case "store": "I am going to receive data that should be stored
---------------- as a new pristine!"
We are going to receive data which we will checksum (to validate, or even to
determine the checksum in the first place), after which we want to place it in
the pristine file. We want to write to a temporary file, which should be on
the same file system as the pristine store, to ensure a fast mv into place
once we are sure the file is complete and checksummed.
We want to write the PRISTINE table row after the pristine file is in place.
(The API should provide a temporary dir, callers can use e.g.
svn_stream_open_unique() to get a temporary file inside it).
 
pseudocode:
 pristine_get_tempdir(&tempdir) (1)

 svn_stream_open_unique(&tempstream, &tempfile_path, tempdir)
 tempstream = svn_stream_checksummed2(tempstream, &write_checksum)
 write to and close tempstream

 if a_priori_checksum && a_priori_checksum != write_checksum:
   bail: data corruption during copy
 
 if paranoid_beyond_recognition:
   pristine_check(&present, write_checksum, _usable)
   if present:
     pristine_read(&stream, write_checksum)
     read(&my_content, tempfile_path)
     compare(&same, my_content, stream)
     if !same:
       bail: hash collision o_O
 
 pristine_install(tempfile_path, checksum) (2)

(1) Get a temporary folder on the same file system device as the pristine
    store used by the given working copy.

(2) Take a temporary file (in the temp folder from (1)) and filesystem-'mv' it
    into the proper pristine file location, then create a database row in the
    PRISTINE table.

Use case "need": "I want to use this pristine's content, definitely."
---------------
pseudocode:
 pristine_check(&present, checksum, _usable) (3)
 if !present:
   get_pristine_from_repos(checksum, ra) (9)
 pristine_read(&stream, checksum) (6)

(3) check for _usable:
     - db-presence
     - if the checksum is not present in the table, return that it is not
       present (don't check for file existence as well).
     - stat-match (includes file-presence)
     - if the checksum is present in the table but file is bad/not there,
       bail, asking user to 'svn cleanup --pristines' (or sth.)

(9) See use case "fetch". After this, either the pristine file is ready for
    reading, or "fetch" has bailed already.

(6) fopen()

Use case "would use": "I'd use this pristine if it's here already,
-------------------- if not I still have other options."
pseudocode:
 pristine_check(&present, checksum, _usable) (3)
 if !present:
   pursue other options
 else:
   pristine_read(&stream, checksum) (6)

Use case "could use": "I have more complex options, and I want to get a
-------------------- *fast* count on how many of certain pristines I can
                       expect to exist. Does this one show?"
pseudocode:
 pristine_check(&present, checksum, _known) (4)

(4) ### Depending on which turns out to be faster:
    Only check db-presence
    ### OR
    Only check file-presence

Use case "routine check": "I want to do some effort to validate this pristine"
------------------------
pseudocode:
 pristine_check(NULL, checksum, _valid) (5)

(5) check for _valid:
    - stat-match (includes db-presence and file-presence)
    - checksum-match
    Bail if this checksum is in the db but the content is invalid/file is
    missing.
    If it is not known in the db in the first place, return *PRESENT as FALSE,
    but in this case we're not even interested (completely missing is also
    valid).

Use case "repair": "I've seen this pristine is børken. Restore or remove!"
-----------------
pseudocode:
 pristine_repair(&present, checksum) (7)
 [if !present:
    get_pristine_from_repos(checksum, ra) (9)
  pristine_read(&stream, checksum) (6)
 ]

(7) Iff the pristine file exists in the file system and the content matches
    the checksum, (re)write the PRISTINE table row with the checksum, mtime
    and size and return *PRESENT as TRUE.
    In all other cases, remove all of table row and file. We will fetch the
    pristine from the repos the next time it is needed. *PRESENT = FALSE.

Use case "remove": "I have lowered or hit the watermark or failed to repair.
----------------- This pristine has to go."
                   (or am moving pristines to another pristine store...)
pseudocode:
 pristine_forget(checksum) (8)

(8) Without bailing on non-existence, remove anything that exists about
    this checksum: PRISTINE table row, pristine file.

API
===
Above use cases result in these API calls:

 _usable, _known, _valid: enum values. (3),(4),(5)

 pristine_get_tempdir(&tempdir) (1)
 pristine_install(tempfile, checksum) (2)
 pristine_check(&present, checksum, _usable) (3)
 pristine_check(&present, checksum, _known) (4)
 pristine_check(NULL, checksum, _valid) (5)
 pristine_read(&stream, checksum) (6)
 pristine_repair(hash) (7)
 pristine_forget(hash) (8)

CENTRAL PRISTINE STORES
=======================
### TODO: How to configure (per WC and user), how to get write access for
multiple users while safeguarding against malicious corruption, whether to
have read-only pristine stores with a per-user writable store, whether to have
a chain of pristine stores that get asked for presence of a pristine, how to
record which working copies use which pristine store to get a valid refcount.

Received on 2010-02-15 14:46:31 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.