Re: svn commits: r33357, r33358

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Sun, 5 Oct 2008 18:22:30 +0300 (Jerusalem Daylight Time)

Greg Stein wrote on Sun, 5 Oct 2008 at 06:59 -0700:
> On Sun, Oct 5, 2008 at 3:41 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> > Suppose you check out two wc's of the same repos at the same time. If
> > the TEXT_BASE row was updated immediately, then as soon as the first wc
> > fetched its list of checksums of files to download, the second wouldn't
> > download any of them;
>
> Not really. The second WC would note that all of the STORED_SIZE
> columns are NULL meaning that the content is not present, or got
> partially downloaded, or whoknowswhat. But for all intents and
> purposes, it does not have it. Thus, for safety, the second WC should
> go ahead and request all that content. (it has no idea that another
> process is doing it at the same time; coulda been last week, for all
> it knows!)
>

Oops, right.

> > whereas, if TEXT_BASE is only updated after
> > download completes, then it's possible for both wc's to download the
> > same (identical) file simultaneously -- unnecessarily.
>
> For safety, yes: they'll both download it. Sure, maybe that could be
> optimized, but it is no worse than today :-)
>

This argument can save you a lot of work :-)

> >...
> >> It seems more problematic to fetch the content first, then insert a
> >> row. That's just kind of a gut feel... I don't have immediate reasons
> >> why offhand.
> >
> > (see above: I also prefer inserting the row first, but haven't good
> > reasons either)
>
> I figured this out the other day. The scenario is an interrupted
> download. There are two choices here: add a row, then grab the content
> (RTC), or grab the content and add a row (CTR).
>
> In RTC: if the content is never downloaded, or partially downloaded,
> then the row will have NULL for the STORED_SIZE. When you read the
> database, you'll immediately know that you don't have valid content.
> If some content is present, you could actually checksum it to find
> that you have the content, but it just didn't get recorded into the
> database properly.
>

And even if the checksum doesn't match, you can assume that the
start-of-file that you have locally *is* valid, and download just the
rest of the file... (if ra supported that)

> In CTR: the row is missing, so the content on disk is unknown. Maybe
> all there, or maybe a partial file. You have no information.
>
> >> > Most of the time the pristine files don't change under our feet. (When
> >> > they do, they're corrupted.) So we can skip the size check and always
> >> > do only the more expensive check (calculating the checksum). And if we
> >> > skip the size check, we won't need stored_size in the database.
> >>
> >> We can do a quick apr_stat() to see if the file is present and the
> >> right size, and have a high confidence that it is usable.
> >
> > I wouldn't trust the size alone: it's too easy to run 'find | xargs sed'
> > that doesn't modify the sizes. That's why I assumed we would checksum
> > the base file before declaring it usable.
>
> Well... I don't think we go to that extent today, so I'm not prepared
> to distrust the text base to that level.

Good point, I'm convinced.

> Shoot... if you think the
> text base is *that* fragile, then you're gonna have to worry about
> race conditions. "Oh hey. The text base checksum still matches. Good.
> <CORRUPTION> Now, let me go ahead and use it." Ooops.
>
> So yeah. If the stored size matches what is on disk, then I'm prepare
> to trust it.
>

I think current libsvn_wc doesn't do even that (it just trusts the bases
blindly?). But, since the size check costs nothing, I see no reason not
to do it.

> Note that svn_wc__db_pristine_check() *does* have a mode that will
> re-run a checksum. We can use that mode from "svn cleanup" or
> somethign.
>
> >> If the stored_size is not event present, then the file is not going to
> >> be there, or it has only been partially-fetched.
> >
> > I agree that it's useful for the partially-fetched case. But in that case
> > a boolean would work as well (just store an arbitrary non-NULL value at
> > the point you would store the size).
>
> If the text base's size changes, then we have a fast-path exit to
> "corrupt", so hey... SIZE DOES MATTER.
>
> :-P
>
> > The rest is pure logic: since I don't trust the size as corruption
> > indicator and since I think a boolean would work as well as the size for
> > detecting the partial-fetch case, therefore I am not yet convinced that
> > we need to store the size at all.
> >
> > (You're welcome to convince me otherwise, of course :))
>
> If the size is the same, then we can always run a checksum. But we can
> skip the checksum if whatever happened to the text base has changed
> the size. If we want to go nuts on corruption detection, then we could
> also store a timestamp.
>
> >...
> >> > The metadata location (per-dir, wc root, etc.) is typically per-wc, not
> >> > per-versioned-dir. As the code is, we'll just set a key in the hash for
> >> > every single directory in the wc, recursively. But we could store the
> >> > information (whether the pristine dir is stored in-place or at the wc
> >> > root) just for the wc root and let deeper dirs inherit its setting.
> >>
> >> Whenever an operation hits a subdirectory, we don't want to scan
> >> upwards to find the wc-root. We want to just *go* to where the data
> >> is. Thus, we have pdh.base_dir.
> >
> > I thought we wouldn't need to scan because each dir would know what wc
> > it is part of. (If we're going to put metadata at wc-root by default,
>
> How would it know that?
>
> In the default case, a versioned dir will NOT have a .svn
> subdirectory. Also, there will be no central datastore. Everything
> will be wc-root. So given an arbitrary directory, how do you determine
> whether it is versioned at ALL, let alone where the wcroot is? Answer:
> you traverse up the directory.
>
> During a single svn command, you might hit some file in a subdirectory
> as one of the arguments. The fact that it is "under" one of the
> previously-discovered wcroots is NOT sufficient. You still have to
> scan upwards to find a switched root, or an svn:external or somesuch.
>

If you have .svn dirs: store the URL given to 'svn checkout' (or the key
to the WORKING_COPY table).

If have central datastore: do the upwards-scan in the datastore (not in
the filesystem). (Requires extending the DB to list switched/external
dirs.)

If you don't have .svn dirs, and don't have central datastore: First
time, scan up the filesystem, no other way. Second time, do you have
the result of scanning of the parent? If so, you just have to check if
you are an 'exception' to the parent (such as: completely unrelated,
independent wc; switched; external; etc.), and wouldn't have to do
a complete scan.

Of course, if you don't have the parent's scan results available, you
just have to scan all the way up. (And then you'd have the same problem
in every operation that needs the wc root, not just in finding the
pristine/ files.)

> > having a cheap way to find it seems natural.) And the scan would be in
> > the hash, not in the disk (except for the first scan). (Lastly, isn't
> > there a recursion a few layers above us that we can take advantage of?)
>
> Once you fill the hash, then yes: it'll be fast. And as you smack new
> directories, as they traverse upwards, you may run into a
> previously-known directory and terminate your scan.
>
> >> I also didn't envision proactively setting subdir values, but lazily
> >> filling them in. Today, the code will create a new pdh structure since
> >> we're doing per-versioned-dir base_dir values.
> >
> > Lazy filling or not, by the end of the checkout you'll have a hash key
> > for every versioned dir in the wc just checked out... right? Couldn't
> > memory usage, etc., become significant?
>
> We've previously stated that an in-memory map of directories (and even
> files! ugh) is okay. Take a look at the commit process some time: it
> holds all that crap in memory. It is *never* okay to store a file's
> contents in memory, but we've been okay with metadata.
>

I didn't know that. Guess it's fine, then, by precedent.

> In the long run, we'll have a bunch of directory paths mapping to a
> single pdh structure, so it "shouldn't be too bad".
>
> >...
> >> And note that we can't just have db.wc_root ... any given operation
> >> could cover *multiple* wc roots (think of svn:externals or switched
> >> subdirs).
> >
> > open_many() allows that too, no?
>
> Correct. open_many() is intended to (in one pass) find all the common
> pdh structures. Future requests into the API might need to look up
> more, but that depends on how you use the API.
>

> There is an open question: maybe just open a db handle, and then as
> you start using it, it will open datastores as necessary, rather than
> trying to do all that up front. It might simplify the API. (I don't
> have a good argument right now for opening all the datastores up
> front)
>

Verifying that all of them exist and are readable?

I don't see much difference, unless you also want to close them as soon
as we don't need them (otherwise they'll all be opened soon after the
start). And then, how do we know that a certain datastore isn't needed
(at least for now)?

Also, what is 'all datastores' in the metadata-in-.svn case? I'm sure
you didn't mean to open all of them recursively, but that's what we'll
need to be able to answer the same questions.

> >...
> >> Note that if the result is HOME or GIVEN_PATH, then all the hash
> >> values would point to the same PDH structure. Well... at least for
> >> a given WC. I envision an eventual possibility of different configs
> >> for different WCs. For example, a WC on an NFS mount might be WC_ROOT,
> >> while your personal WCs are all GIVEN_PATH.
> >
> > The part that surprises me is "eventual". Are you saying we won't be able
> > to operate on metadata-at-.svn-dirs and on metadata-at-wc-root wc's in the
> > same operation ('svn up wc1 wc2') or in subsequent operations under the
> > same environment ('svn up wc1; svn up wc2') by default?
>
> In one operation, some of the datastores may be wcroot, and others
> centralized. I don't know what the commands will look like or the
> configs or anything to get that kind of granularity, which is why I
> said "eventual". The APIs definitely support metadata-per-wc to be
> located anywhere, and to mix/match that at will.

okay.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-05 17:43:46 CEST

This message: [ Message body ]
Next message: Lieven Govaerts: "[PATCH] Drastically reduce the number of opened ra_sessions during merge."
Previous message: Bert Huijben: "RE: [PATCH] Working copy on Samba share"
In reply to: Greg Stein: "Re: svn commits: r33357, r33358"
Next in thread: Greg Stein: "Re: svn commits: r33357, r33358"
Reply: Greg Stein: "Re: svn commits: r33357, r33358"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]