Re: pristine store design

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Wed, 03 Mar 2010 12:05:26 +0000

On Wed, 2010-03-03, Neels J Hofmeyr wrote:
> Julian Foad wrote:
> > On Thu, 2010-02-18, Neels J Hofmeyr wrote:
> >> Great, moving forward fast on pristine design questions!
> >
> > Hi Neels.
> >
> > Did you start working the new knowledge into a document? Lots of stuff
> > was said in this thread and it would be useful to see where we are at.
>
> thanks for poking :)

No problem. I'm glad you're working on this.

Please go ahead with this pristine store. Delete the "_write" interface.
Implement the path-based ones; at least they're already declared. It's
really important to get it moving.

I want us to be able to move on to bigger things like considering
whether to build a "pristine whole nodes" store on top of the "pristine
texts" store, because I think we should be able to reference a whole
node (especially a tree-conflict old/theirs/mine node) with a single
handle, whether that handle be a checksum or a "repos:rev:path" triple.

I continue to argue below why I think a stream-based interface is better
than path-based, but the choice of "write" method is not a critical
decision, as long as it stays within libsvn_wc. I'm not keen to spend
more time on that, now that I've looked and as far as I can tell it
doesn't need to be exposed outside libsvn_wc.

> > I have a couple of comments.
> >
> >
> > THE PRISTINE-WRITE API
> >
> > I was thinking about the "write" API and how an API designed around a
> > stream is surely better than one designed around "tell me the path where
> > I can put a file".
> >
> > The objection to "give me a writable stream and I'll write to it" was
> > that the stream close handler wouldn't know whether the stream was
> > closed after a successful write or because of an error. We can rectify
> > this by adding a second step: require the caller to call a special
> > "commit" API to close the stream on success, and any other way of
> > closing the stream would abandon the new content.
> >
> > // Caller passes the checksum if it knows it, or passes NULL and
> > // the store in that case will calculate the checksum.
> > stream = pristine_start_new_file(expected_checksum);
> >
> > [caller writes to stream]
> >
> > // Now the store commits the new text, verifies the checksum
> > // (if it was given an expected one) and returns it.
> > new_checksum = pristine_close_new_file(stream);
>
> Let's also look at the advantages of the current design of
> svn_wc__db_pristine_get_tempdir() and _install():
>
> pseudocode:
> [[[
> pristine_get_tempdir(&tempdir)
>
> /* Make tempfile */
> svn_stream_open_unique(&tempstream, &tempfile_path, tempdir)
>
> /* Write and checksum */
> tempstream = svn_stream_checksummed2(tempstream, &write_checksum)
> write to and close tempstream
>
> pristine_install(tempfile_path, write_checksum)
> ]]]
>
> When there is only a write-stream available, some options get lost, while
> _get_tempdir() and _install() offer the whole palette of possibilities. For
> convenience, we can offer a wrapper for the first three instructions above.
> But additionally, we can offer the possibility to determine a file location
> *on the same file system device* as the pristine store. This is useful when
> we can provide some external tool or specialized function X with a
> destination path in advance. We create a tempfile in pristine_temp_dir() and
> pass that location to X. X writes or OS-copies or <whatever>s to the final
> device, possibly saving one copy-across-fs-devices operation per pristine.

Sure. That's my "case (4)" of the cases I enumerated. But I think it's
a less important case. See my investigation of "actual uses" below,
near the end of this email.

> Also, X may already have a checksum tee in its implementation, obsoleting
> the need to calculate again. In that case it's nice that pristine_install()
> doesn't care where the checksum came from.

Sure, but, for the cases where X writes to a stream, a stream-based API
(slightly different from how I suggested) could do that too.

> (Also, if we are going to
> validate the checksum on "every" read, there is no need to be paranoid about
> correctness of the checksum when writing -- we anyway have to "expect"
> corruptions to occur after any successful write.)
>
> Note that your proposed usage is a subset of the usage provided by
> _get_temp_dir() and _install(). Your way does KISS some API complexity
> goodbye when thinking in human terms, but the OS will find the
> _get_temp_dir() and _install() simpler in some cases :P

Yes, but also the stream way allows the pristine store to grow new
features transparently - see "disadvantages" below.

> The simpler usage of above pseudocode would be:
>
> pseudocode:
> [[[
> /* convenience function provided by API or whomever. */
> svn_error_t *
> pristine_get_write_stream(stream_t **write_stream,
> svn_checksum_t *write_checksum,
> const char **temp_path)
> {
> pristine_get_tempdir(&tempdir)
>
> /* Make tempfile */
> svn_stream_open_unique(&tempstream, &tempfile_path, tempdir)
>
> /* Write checksum to the checksum location provided by caller */
> *write_stream = svn_stream_checksummed2(tempstream, write_checksum)
> }
>
> ...
>
> /* calling code */
> {
> pristine_get_write_stream(&stream, &write_checksum, &stream_temp_path)
> write to and close stream
> if (had priori checksum and does not match write_checksum):
> bail
> pristine_install(stream_temp_path, write_checksum)
> }
> ]]]
>
> This is similar to what you propose,

Yes, and that looks neater, but still has all the problems caused by
exposing the path.

> when considering that your
> implementation would also need a baton to pass the stream temp path along
> (right?).

The API wouldn't need an extra baton parameter. It would return a stream
whose internal "baton" (user data) stores any metadata it needs.

> So IMHO it boils down to: Do you think it is better to opaquify
> such baton for the sacrifice of not being able to get a write destination on
> the target device in advance?

Yes, I think that is the better API for general use. (Note that the
stream would always write to a file that is on the destination device.
The only thing you can't get with this API is a path to that file (or
directory).)

Now let's look at some disadvantages of "return a directory path".

If the caller creates the file, the caller chooses the file's name. We
have to trust the caller to use the right kind of creation functions to
give a safely unique name.

If the caller creates the file, the pristine store can't keep track of
what's in its temporary directory. Does it matter? Maybe not, yet.

If the caller creates the file, the caller chooses the file's
permissions. The pristine store loses the ability to control privacy.
Does it matter? Maybe not, yet.

If the caller writes to the file, the pristine store doesn't have a
streamy way to compress the data if it wants to - it would have to read
it again. Does it matter? Maybe we never will, or maybe we'll want to
change the API anyway when we support compressed texts so it wouldn't
matter.

If the path is exposed just within libsvn_wc, these issues stay under
our control, so that's OK. If the path is exposed in a public API, to
libsvn_client and beyond, that's when I worry.

The other thing we can do is look at the actual uses and see what would
be best for them.

* svn_wc_add_repos_file4() is a public API that takes (as an input
parameter) the new pristine contents as a stream and writes it to the
pristine store, among other things.

* update_editor.c:apply_textdelta() gets a readable stream of the
existing text base file, and a writable stream to a new (temporary) text
base file, and passes the two streams to svn_txdelta_apply(). Presumably
some other function then moves the temp base file to the real base path.

It looks like it would be slightly easier and more "natural" for both of
those functions to get a writable pristine stream than to create an
intermediate file itself and then wrap it in a stream. Not a big deal,
perhaps.

> I vote for removing _pristine_write() and for keeping _get_temp_dir() and
> _install(), plus adding a convenience function in the spirit of above
> pseudocode to make the stream-only case sit comfortably.
>
> Yay?
>
> >
> > Now let's examine the ways in which a caller might want to give new
> > content to the store:
> >
> > 1. Caller asks for a writable stream and pushes content to that, then
> > calls a "commit" function.
> >
> > 2. Caller has a readable stream and wants the store to pull from that.
> >
> > 3. Caller has a (non-temporary) file and wants the store to read from
> > that file.
> >
> > 4. Caller has to create a temporary file for reasons beyond its
> > control (output of an external tool perhaps) and wants the store to take
> > the entire file by an atomic move if possible. This is the case where it
> > would be more efficient if it know where to put the file in the first
> > place.
> >
> > The caller can easily implement 2 and 3 in terms of an API that provides
> > 1, so that just leaves 1 and 4 that are worthwhile as an API.
> >
> > I feel that (1) is by far the more important one to have, and (4) is a
> > specialist optimisation.
>
> ...which may nevertheless come up and is not that different in its
> implementation, IMHO. As above. Internal API.

> > VERIFYING CHECKSUMS
[...]

[DBs and concurrency ...]
[...]

I'm not replying to those parts in this email.

- Julian
Received on 2010-03-03 13:06:06 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: [PATCH] sasl service name for svnserve"
Previous message: Matthew Bentham: "Re: [PATCH] wc-ng - Remove use of entry_t from calculate_target_mergeinfo"
In reply to: Neels J Hofmeyr: "Re: pristine store design"
Next in thread: Neels J Hofmeyr: "Re: pristine store design"
Reply: Neels J Hofmeyr: "Re: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]