RE: pristine store database -- was: pristine store design

From: Bert Huijben <bert_at_qqmail.nl>
Date: Tue, 16 Feb 2010 15:58:04 +0100

> -----Original Message-----
> From: Neels J Hofmeyr [mailto:neels_at_elego.de]
> Sent: dinsdag 16 februari 2010 14:55
> To: dev_at_subversion.apache.org
> Cc: Philip Martin; Greg Stein; Hyrum Wright
> Subject: pristine store database -- was: pristine store design
>
> Philip Martin wrote:
> > Neels J Hofmeyr <neels_at_elego.de> writes:
> >
> >> THE PRISTINE STORE
> >> ==================
> >>
> >> The pristine store is a local cache of complete content of files that
are
> >> known to be in the repository. It is hashed by a checksum of that
content
> >> (SHA1).
> >
> > I'm not sure whether you are planning one table per pristine store or
> > one table per working copy, but I think it's one per pristine store.
> > Obviously it doesn't makes no difference until pristine stores can be
> > shared (and it might be one per directory in the short term depending
> > on when stop being one database per directory).
>
> Thanks for that. This is the tip of an iceberg called 'a pristine store
does
> not equal a working copy [root]'.
>
> The question is how to store the PRISTINE table (also see below) once it
> serves various working copies. Will we have a separate SQLite db store,
and
> create a new file system entity called 'pristine store' that the user can
> place anywhere, like a working copy?

The idea was not to move just the pristine store, but also wc.db to a
central location.

In this case the .svn directory will just have a 'look there' marker,
indicating where the wc database is. (Another option would be to just write
the location in the subversion configuration, but this would make it very
hard to detect if a directory is really a working copy or just a normal
directory)

The wc.db database schema (as designed by Greg) was designed to handle
multiple working copies. (All wc related tables have a wc_id column,
indicating which working copy the record applies to).

This week some other mail talked about the performance characteristics of a
central database, but I don't think this is really relevant here

The normal situation is that there is only one (single-threaded) operation
performing changes on a working copy at a time (and possibly multiple
readers). In this situation the performance of one open database with the
Sqlite + filesystem caching support of an exclusively opened file should be
much better, than deleting and recreating our own database files everywhere
through the filesystem. The old working copy used an exclusive per directory
write lock to the same effect.

<snip>

> While we could store size&mtime in the BASE/WORKING tables, this causes
> size
> and mtime to be stored multiple times (whereever a pristine is referenced)
> and involves editing multiple entries when a pristine is removed/added due
> to high-water-mark or repair. That would be nothing less than horrible.
> Taking one step away from that, each working copy should have a dedicated
> table that stores size and mtime only once. Then we still face the
situation
> that size and mtime are stored multiple times (once per working copy), and
> where, if a central pristine store is restructured, every working copy has
> to be updated. Bad idea.

The size&mtime in BASE_NODE and WORKING_NODE don't relate to pristine data,
but to the in-WC files. If a files date and size haven't changed 'svn
status' sees the file as unmodified.

Note that the PRISTINE table currently doesn't have a mtime column. It does
have a SIZE, MD5 and COMPRESSION column to allow storing the MD5 hash for
communicating over editor-v1.0 (Needed for Subversion 1.0-1.7
compatibility).

> Instead, we could not store size and mtime at all! :)

Or we could store both to perform simple consistency checks...

A query over checksum from BASE_NODE and WORKING_NODE + older_checksum,
left_checksum and right_checksum from ACTUAL_NODE would give a list from
all PRISTINE records that are in-use.
The rest can safely be deleted if diskspace is required.

> They are merely half-checks for validity. During normal operation, size
and
> mtime should never change, because we don't open write streams to
> pristines.
> If anyone messes with the pristine store accidentally, we would pick it up
> with the size, or if that stayed the same, with the mtime. But we can pick
> up all cases of bitswaps/disk failure *only* by verifying *full checksum
> validity*!

Good luck verifying 20 files with a total of 32 GB of data over a LAN :)

Well working over a LAN is not a design requirement for WC-NG, but a lot of
our users use NFS or CIFS... And checking via fstat is a lot cheaper than
reading all data.

> So, while checking size and mtime gives a sense of basic sanity, it is
> really just a puny excuse for not checking full checksum validity. If we
> really care about correctness of pristines, *every* read of a pristine
> should verify the checksum along the way. (That would include to always
> read
> the complete pristine, even if just a few lines along the middle are
needed)
>
> * neels dreams of disks that hardware-checksum on-the-fly
>
> If I further follow my dream of us emulating such hardware, we would store
> checksums for sub-chunks of each pristine, so that we can read small
> sections of pristines, being sure that the given section is correct
without
> having to read the whole pristine.
>
> Whoa, look where you got me now! ;)
>
> I think it's a very valid question. Chuck the mtime and size, thus get rid
> of the PRISTINE table, thus do away with checking for any inconsistency
> between table and file system, also do away with possible database
> bottlenecks, and reduce the location of the pristine store to a mere local
> abspath. We have the checksum, we have the filename. Checking mtime and
> length protects against accidental editing of the pristine files. But any
> malicious or hw-failure corruption can in fact be *protected* by keeping
> mtime and length intact! ("hey, we checked it, it must be correct.")

That would still leave MD5. Having a checksum before processing the file is
a common use-case in editor-1.0. And calculating a checksum over a really
large file and then reading it again for the operation is not cheap.

Another reason to keep this record around is supporting compression. In that
case we don't have the size by a cheap fstat anymore.

> Let's play through a corrupted pristine (with unchanged mtime/length).
This
> is just theoretical...
>
> Commit modification:
>
> - User makes a checkout / revert / update that uses a locally
> corrupted pristine. The corrupted pristine thus sits in the WC.
>
> - User makes a text mod
>
> - User commits
>
> - Client/network layer communicate the *delta* between the local pristine
> and the local mod to the repository, and the checksum of the modified
> text.
>
> - Repos applies the delta to the intact pristine it has in *its* store.
>
> - Repos finds the resulting checksum to be *different* from the client's
> checksum, because the underlying pristine was corrupt.
>
> --> Yay! No need to do *ANY* local verification at all!!
>
> Of course, in case the client/network layer decide to send the full text
> instead of a delta, the corruption is no longer detected. :(
>
>
> Merge and commit:
>
> - User makes a merge that uses a locally corrupted pristine.

This is cheap to detect. If we have to read the file anyway, we can just
recalculate the hash via the stream hash apis. The disk IO for reading the
file is the real pain with current CPUs. (At least for SHA1/MD5. Sha256 is
still painful in my tests)

> - The merge *delta* applied to the working copy is incorrect.
>
> - User does not note the corruption (e.g. via --accept=mine-full)

You would probably see our binary diff fail before you get here and if the
RA layer provides a hash for the result or input file you can use that for
verification on the streams. (See the update editor for some examples where
we do that in one pass since 1.6)

> - User commits
>
> - Repos accepts the changes based on the corrupted pristine that was
> used to get the merge delta, because it can't tell the difference
> from a normal modification.
>
> --> My goodness, merge needs to check pristine validity on each read,
> as if it wasn't slow enough. But as discussed above, even if merge
> checked mtime and length, it would not necessarily detect disk failure
> and crafted malicious corruption.

I don't know if we check .svn-base files for merges now. I know we do that
for 'svn update'. Before 1.6 introduced the stream checksum apis, this was
an additional scan, but that is not necessary anymore.

> Thanks, Philip.
>
> I'm now challenging the need to store mtime and length, and a need to do
> more checksumming instead. The checksumming overhead could be smaller
> than
> the database bottleneck slew.

I assume a normal wc.db is read by the filesystem and harddisks read ahead
buffer for most small working copies. (If a hard drive spins around it
usually reads the next few sectors after the read operation in its internal
buffer if it has nothing better to do)

As we usually look at wc-file statuses at the same time this file should be
pretty hot in the cache.
(and the strict locking of the db file by sqlite even allows caching
portions of this file over network connections).

Bert

> For future optimisation, I'm also suggesting pristines should have
> additionally stored checksums for small chunks of each pristine, while
still
> being indexed by the full checksum.
> (Which may imply a db again :/ , but that db would only be hit if we're
> trying to save time by reading just a small bit of the pristine)
>
> Everyone, please prove me wrong!
>
> Thanks,
> ~Neels
Received on 2010-02-16 15:58:45 CET

This message: [ Message body ]
Next message: Kamesh Jayachandran: "Re: [PATCH] Fix the commit failure over the write-through-proxy if apache is configured as '<Location /svn/>'"
Previous message: Neels J Hofmeyr: "Re: pristine store database -- was: pristine store design"
In reply to: Neels J Hofmeyr: "pristine store database -- was: pristine store design"
Next in thread: Greg Stein: "Re: pristine store database -- was: pristine store design"
Reply: Greg Stein: "Re: pristine store database -- was: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]