[ cc += dev@. summary for dev@: investigating issue #4129: predecessor
count of rN is not incremented by one wrt that of r(N-1); see
http://subversion.tigris.org/issues/show_bug.cgi?id=4129 ]
Jason Wong wrote on Thu, Mar 15, 2012 at 07:57:47 -0700:
> On Wed, Mar 14, 2012 at 5:15 PM, Daniel Shahaf <danielsh_at_elego.de> wrote:
> > Jason Wong wrote on Tue, Mar 13, 2012 at 06:57:59 -0700:
> >> On Fri, Mar 2, 2012 at 8:12 AM, Daniel Shahaf <danielsh_at_elego.de> wrote:
> >> > Jason Wong wrote on Fri, Mar 02, 2012 at 07:32:38 -0800:
> >> >> On Fri, Mar 2, 2012 at 2:58 AM, Daniel Shahaf <danielsh_at_elego.de> wrote:
> >> >> > Jason Wong wrote on Thu, Mar 01, 2012 at 10:01:26 -0800:
> >> >> >> I have had a developer here create a build of the latest SVN code
> >> >> >> with your changes you mentioned in r1294470 for the svnadmin verify
> >> >> >
> >> >> > Okay, that's great news, for two reasons:
> >> >> >
> >> >> > 1. It means building svn on windows isn't as painful as it used to be :)
> >> >>
> >> >> Actually, it did take some work to get it going as we did not have
> >> >> another system available to us and also did not have VC++ 6. We had
> >> >> to use VS 2010 in order to do this. Also, for the other components
> >> >> required (python,perl etc), the files after the install were copied
> >> >> to the workstation to see if it would work as we did not want to
> >> >> change the current workstation configuration by running the
> >> >> installers. All in all, it did seem to work.
> >> >>
> >> >
> >> > Okay. The normal build requires just the *.exe and *.dll files to be
> >> > placed appropriately (such that the *.exe's and httpd's find their
> >> > libsvn_* DLL's at runtime) --- it doesn't require Administrator access,
> >> > for example.
> >> >
> >> > To clarify, Perl is only required to build OpenSSL; it is not required
> >> > to build APR, Neon, or Subversion.
> >> >
> >> >> >
> >> >> > 2. It means I can ask you to build a custom server with the 'inprocess'
> >> >> > cache disabled, or (if all else fails) to bisect, per my previous email.
> >> >> >
> >> >> > One of the things you could try is to disable caching: simply modify
> >> >> > the function create_cache() in libsvn_fs_fs/caching.c to always return
> >> >> > NULL in *CACHE_P. See below for another suggestion.
> >> >> >
> >> >> >> command. We have run 'svnadmin verify' against every revision of our
> >> >> >> hotcopy of our repository taken when we first brought this issue to
> >> >> >> the forums and are now tracking down each of the revisions to see
> >> >> >> what actions were being done at those times.
> >> >> >>
> >> >> >
> >> >> > Thanks! I do hope this work enables us to pinpoint and fix the bug.
> >> >>
> >> >> I will be going through the list to see what else was happening at the
> >> >> same time on the apache server since it was alluded to that there may
> >> >> be concurrency issues. I know the last two times that this error has
> >> >> popped up, we had two svn operations starting at around the same time
> >> >> according to the Apache logs. I will go through the previous apache
> >> >> history to see if this was always the case or not.
> >> >>
> >> >
> >> > Thanks, looking forward to hear what you come up with.
> >> >
> >> > FWIW, Justin's reply suggests that the error was seen on three different
> >> > platforms --- Windows, Solaris, and FreeBSD --- so that should narrow
> >> > down the range of possible explanations.
> >> >
> >> > (I'll also note that at ASF's installation we are not running into new
> >> > instances of the bug.)
> >>
> >> Hi Daniel.
> >>
> >
> > Hi. Sorry for the delay --- was away from svn the last few days.
> >
>
> No problem. I have been really busy as well past couple weeks on other
> tasks.
>
> >> I haven't gone through all the cases yet, but I have made progress
> >> through quite a number of them and a pattern seems to be coming up.
> >>
> >
> > Is it safe to summarize your findings as: in every instance of the bug
> > (as determined by the new 'svnadmin verify' output), the victim revision
> > was started whilst (victim-1) was in progress?
> >
>
> From what is there so far, yes. We do have different operations
> occurring at the same time, but for these ones, I see MERGE and DELETE
> verbs overlapping in the same or near time intervals according to the
> Apache logs. I just did a quick look in the Apache logs during a time
> window where the bug wasn't triggered, and was able to see cases where
> I have the following:
> rev(x-1) merge
> rev(x) merge
> rev(x) delete
> rev(x-1) delete
>
> This seems fine.
> The case that I had seen in my reported cases are as follows:
> rev(x-1) merge
> rev(x) merge
> rev(x-1) delete
> rev(x) delete
>
I have tried forcing this sequence in the debugger, but that did not
reproduce the bug.
I was using trunk_at_1301511 mod_dav_svn, trunk client, and httpd-2.4.
I set a breakpoint in ra_neon's commit.c:commit_close_edit() and
ra_serf's commit.c:close_edit() before the DELETE of the activity. and
sent the requests in the following order: WC1-MERGE, WC2-MERGE,
WC1-DELETE, WC2-DELETE. I tried both serf and neon. I tried
mod_dav_svn of trunk_at_1301511 and 1.6.x_at_1301511. (Only httpv1 makes
sense, since in httpv2 DELETEs do not occur.)
I conclude that the specific MERGE/MERGE/DELETE/DELETE order, by itself,
is not sufficient to reproduce the bug.
I'd like to try and force the race at the other end --- force both httpd
threads to get as far as svn_fs_fs__commit_txn() or even svn_fs_fs__commit()
at the same time. I implemented that by adding a sleep(5+0) call in
fs_fs.c [1], and starting the two concurrent commits within a second of
each other. When I did that, the second commit took 10 seconds to
complete (I assume the loop in svn_fs_fs__commit() iterated twice), but
the bug still did not reproduce. In that case I used ra_serf and 1.6.x
server.
@@ -6011,1 +6013,2 @@ svn_fs_fs__commit(svn_revnum_t *new_rev_p,
+ if (getenv("SVN_")) sleep(5);
SVN_ERR(svn_fs_fs__with_write_lock(fs, commit_body, &cb, pool));
> I would have to look more closely if needed to show that this case
> always triggers the bug or not. We do not have a large userbase here
> so it is harder to get a lot of overlapping hits for this type
> of case.
>
Can you relate the time of the MERGE and DELETE operations (from the
access log) to the time of the commit?
The best way to determine the time of the commit is to look at the
modification time of the revision file (REPOS/db/revs/0/42) belonging
to it; this happens just before the commit is finalized.
In particular it would be interesting to see when rX's commit was done
relative to r(X-1)'s DELETE.
>
> > That by itself is an everyday occurence, but I think it's nonetheless
> > a useful piece of information. I'll try and digest it further later
> > when I'm less sleepy (it's way past midnight here).
> >
> > (As I understand ra_dav, the MERGE verb corresponds to the FS level
> > svn_fs_commit_txn(). Someone please correct me if I'm wrong.)
> >
> > Thanks,
> >
> > Daniel
> >
> >> I have attached 2 txt files. One shows the modified svnadmin verify
> >> output from the binaries we built. The other shows the revisions and
> >> what appears to have been occuring at the time of the bug. I figure
> >> better to provide this now rather than delay any longer for the rest
> >> of the results.
> >>
> >> I will continue to go through the rest of the events and see if
> >> there are other differences seen when the issue occurs. I hope
> >> this information helps.
> >>
> >> Thanks.
> >>
> >> Jason
> >
[1] Yes, this is rather crude. I couldn't do this with 'gdb -args httpd
-X' because I needed two concurrent httpd threads. In retrospect
I think there was an easier solution (use a threaded non-processed MPM,
or perhaps gdb the parent before it forked), but I only realized that
solution when I wrote this paragraph.
Received on 2012-03-18 15:31:06 CET