As you may have noticed, we're not self-hosting yet. I'm too
exhausted at the moment (it's 4:45am Chicago time) to give a very good
description of the problem, but I'll try to do a summary. Greg Stein
will follow up with more detail soon.
The showstopper appears to be caused by unfreed Berkeley DB resources.
We had some scripts do massive numbers of multi-file commits against a
repository, and eventually it starts failing, with (among other
things) DB_INCOMPLETE errors, in both ra_local and ra_dav case. See
http://www.sleepycat.com/docs/api_c/db_close.html for more on that.
There are some other problems as well. Some of the http PUT requests
are failing; there are some working copy locking issues cropping up;
and a few other, more minor things here and there. See the bunches of
new issues filed yesterday and today.
We have reproduction recipes for most things, and leads on the rest,
but the point is that to go self-hosting tonight would be more pain
than it's worth. We could self-host with almost any degree of working
copy bugginess, but repository maintenance issues are another story.
We need to get that stuff fixed before self-hosting.
My personal lesson for next time is: don't wait until a few days
before self-hosting to do the massive tests. Our test suite was
simply not heavy-duty enough -- running tons of commits against a big
tree revealed problems that running few commits against small trees
never uncovered. This will be fixed right away.
I am off-line tomorrow (flying to my friends' wedding in
Massachussetts); back in the saddle Monday. Do watch this space for
more from Greg Stein, who has been working like crazy the last couple
of days tracking a lot this stuff down.
I have just enough brain left at the moment to avoid making estimates.
On Monday, will appraise the situation with a more alert mind and see
what we're looking at. We will be self-hosting very soon.
Disappointed but undaunted,
-Karl
P.S. For your entertainment, here is the IRC transcript:
<gstein> actually. can I restart apache?
<gstein> that might unblock it
oooooh.
Yes, feel free. But. Um. I just deleted the repos.
<gstein> no problem
I'm rebuilding with Berkeley 3.3.11 src in the tree as subdir "db/".
using ./configure --with-apxs=/usr/local/apache2/bin/apxs
(But have not installed Berkeley 3.3.11 on the system.)
Just removed /usr/local/lib/libsvn_* on svn.collab.net.
<gstein> noticed...
If any of tihs is getting in your way, say sumthin and I'll ask first
:-)
<gstein> restarting the apache was a bit funky
(just now backtracked, am building and installing 3.3.11 first, then will
<gstein> I'm wondering if it got tweaked in some way and was holding open the db
build subversion using the 3.3.11 db/ subdir,
Could be.
I *really* wanted to debug right into Berkeley, but for some reason gdb
<gstein> you don't need the db/ subdir if you're installing 3.3
didn't seem to know where the berkeley sources were, even though
I installed it.
hmmm.
Which version will subversion use, if both are installed, then?
<gstein> I believe it will choose 3.2, but you can force it to 3.3
<gstein> the --with-berkeley-db config switch
oh, --with-berkely
never mind, you typed faster than me. :-)
./configure --with-apxs=/usr/local/apache2/bin/apxs --with-berkeley-db=/usr/local/BerkeleyDB3.x
is that right?
s/x/3/
<gstein> DB.3.x
<gstein> (an extra dot)
<gstein> but yah
thanks
[action] gstein thinks we could auto-find /usr/local/apache2
<gstein> hmm
<gstein> I'm not sure we want to go with db 3.3 yet
<gstein> we don't know why the first hung up
(kfogel holds off on configuring)
No, that's true, we don't.
We also don't have a reproduction recipe.
<gstein> let's put things back using 3.2 and see if it repros
However, Mike and I did the following experiment:
copied the repository to our local boxes
ran statically-linked svnadmin lsrevs and lstxns on it, so we could gdb
and hung in that call to fs->get()
it was in a select(), down in a mutex lock function the name of which I
don't remember.
<gstein> but how do you know that the copy didn't monkey the db?
copying from one box to another, you mean?
<gstein> yah
We don't, except that Berkeley is "supposed" to be impervious to that.
<gstein> I'm a bit concerned because apache wasn't restarting properly
(hands hurtin', gonna phone)
<gstein> so I'm wondering what was going on in there
greg, I'm wondering about this message toward the end of "make install":
libtool: install: warning: remember to run `libtool --finish /usr/local/lib'
<gstein> no big deal
<gstein> dang. I wish mIRC gave some notification from the background when a message comes across
by the way, that tree is already at rev 2 or 3, because I've committed a few
things to it.
IRC used to ring a bell on my machine when I got msg'd.
Now it doesn't.
I don't know what changed. :-)
(It was a different machine... different company... different life.)
<gstein> heh
<gstein> I moved my window so I'll (theoretically) see it scroll
<gstein> I just did "time svn co ..." to see how long a checkout takes
<gstein> of course, this is over the loopback interface... but still it is DAV
<gstein> 16 seconds wall clock.
<gstein> got a conflict. svn up. 'G' code. committed.
<gstein> no problem.
<gstein> of course, the conflict message is ugly. got some work to do there.
okay, got a minor problem. I tried committing three files, including update.c,
got a conflict becuase you'd modified it (again).
No sweat. Instead of updating, I decided to commit only the other two
files: IDEAS, and subversion/libsvn_fs/dag.c.
<gstein> neat
So I typed this:
svn ci -m "blah" IDEAS subversion/libsvn_fs/dag.c
and got this error:
[gauss]:/home/kfogel/src/wc1/trunk>svn ci -m "Changed three files again, but only committing two, leaving update.c out of this commit." IDEAS subversion/libsvn_fs/dag.c
svn_error: #21010 : <Bogus filename>
commit failed: while sending tree-delta.
svn_error: #21010 : <Bogus filename>
svn_wc__wcprop_get: failed to load props from disk.
svn_error: #21010 : <Bogus filename>
wcprop_list: non-existent path '/home/kfogel/src/wc1/trunk/dag.c'.
Icko. Though we can even self-host with a honker like that; ppl would
<gstein> hmm. I just got an error trying to commit with ../../INSTALL
just have to be careful with their commits. :-)
<gstein> let me move up and try committing down
It's converting to abs path anyway, I think, but try moving up and committing
down, yeah.
<gstein> what happened with that dag thing?
<gstein> that path is way off
yeah!
off by two components, not even just one
weird.
<gstein> on multiple targets, it probably doesn't combine properly?
hey.
let me try committing just dag.c without IDEAS.
<gstein> that is: they both went from the top
Maybe, yeah.
<gstein> I've done that with update.c, I believe
Hold on, here I go.
HEH.
You nailed it.
<gstein> multiple targets. busted.
Well, that's not so bad.
<gstein> *nod*
I think I'll debug it in place.
See you in a bit.
<gstein> just found that with committing down w/ IDEAS on the line
[action] gstein nods
(keep hammering, baby :-) )
<gstein> tried committing no change. no luck :-)
what happened?
<gstein> it just returned
<gstein> in the future, it might want to say "no change", but at least it didn't check in a no-change
<gstein> wall clock time to commit update.c from the top: 4 seconds
oh! I thought by "no luck" you meant it did something wrong, instead
of merely non-intuitive.
+wipes brow+
<gstein> hehe ... sorry
no sweat.
no wait, I don't mean that.
I *did* sweat. :-)
Anyway, back to debugging.
<gstein> wow. 2 secs to update the tree. picked up a new dag.c and a new README.
<gstein> seems like we scan the WC *really* fast
<gstein> there is a noticable pause in the commit. you notice that?
<gstein> wow. that update on the top: it also updated *all* the vsn numbers in the WC
<gstein> wonder how much is the box vs. svn
<gstein> (I could have wacky expectations from working on that P120 for so long)
Yeah, been noticing that commit pause.
wondrin' about it.
<gstein> not gonna delay M3
<gstein> but something to look into
+1
<gstein> I'm thinking to write a script to do a thousand commits to random groups of files
<gstein> prolly want to disable the email... oh!
<gstein> I bet that is the delay: the commit email
<gstein> gonna disable the mail for a bit to do some testing.
okay
<gstein> do you need the mail?
(curious how it could affect the... Oh. I see.
no, I don't need the mail, they've been working fine enough.
)
go ahead & disable
goddam gcc has optimized out every single variable in a block here
grrrrr.
<gstein> oof
We need to turn off optimization. :-)
<gstein> no way...
huh?
<gstein> go build a debug one :-)
Oh, yeah, I just mean for debugging.
<gstein> I thought it did
But right now, who would possibly want to build this *without* debugging? :-)
<gstein> heh
So maybe it should be the default until beta.
<gstein> it *is* the mail
<gstein> 3.77 and 3.59 seconds for w/ email. when I removed it: 0.18 seconds
<gstein> those are wall clock times
<gstein> for the client, the user/sys times are basically the same (as you'd expect)
<gstein> so... at some point in the future, we may want to look at how to optimize the commit-email script
Mike already has some big optimizations, actually.
He can basically do the whole thing by running `svnlook' once instead of
4 times or whatever it is.
<gstein> excellent
I'm mailing this section of my irc buffer to Mike right now.
all right, buster.
+ Step 30.
+
+ After installation, please visit your local 7-11. Pick up two pints
+ of olive oil. Stop by Krispy Kreme on the way home and get donuts.
+ Pour the olive oil around your dining table, making a neat circle on
+ the floor. Get in the circle with the donuts and a book of matches.
+ Light the olive oil. Sit down and enjoy the donuts.
?????????
:-)
[action] gstein smiles innocently
okay, the problem is _not_ in svn_path_condense_targets().
after that's called, we have
(gdb) p *(((svn_stringbuf_t **) condensed_targets->elts)[0])
$80 = {data = 0x812968c "IDEAS", len = 5, blocksize = 6, pool = 0x812925c}
(gdb) p *(((svn_stringbuf_t **) condensed_targets->elts)[1])
$81 = {data = 0x81296a4 "subversion/libsvn_fs/dag.c", len = 26,
blocksize = 27, pool = 0x812925c}
looks good so far.
(still goin'...)
<gstein> nod. and what is the root?
<gstein> is the root '.' or 'subversion/libsvn_fs' ?
root is `.'
<gstein> hmm
I'm in ~/src/svn/trunk/, that is.
<gstein> nod
(so those paths are fine).
Now I'm in svn_client_commit(), which I notice is calling svn_path_condense_targets() *again*.
Sigh.
let's see what it does this tiem. :-)
<gstein> hehe
still good
now I'm in send_to_repos(), with an entirely correct target list.
Hmmmm.
chugga chugga...
<gstein> :-)
<gstein> I wonder if ra_local does it right
problem might be in the driver, not in either ra layer.
We'll see..
heh heh!
#if 0
/* ### kff todo: after M2, put this back wrapped in a conditional,
for implementing 'svn --trace' */
SVN_ERR (svn_test_get_editor (&test_editor,
&test_edit_baton,
svn_stream_from_stdio (stdout, pool),
0,
base_dir,
pool));
svn_delta_compose_editors (&editor, &edit_baton,
editor, edit_baton,
test_editor, test_edit_baton, pool);
#endif /* 0 */
(Haven't found the bug yet, just thought above was an amusing thing
to run across.)
<gstein> hehe
<gstein> okay. should I wait for my mass commits until you're done debuggin?
stepping into svn_wc_crawl_local_mods()...
You mean commits to the testing repos?
Or to subversion's current CVS repos?
<gstein> svn's current
<gstein> no
:-)
<gstein> not CVS. the testing one.
I know what you mean.
Yeah, commit away. This bug has got to be local, right?
<gstein> sorry. thought you mean /testing/current. I mean /repos/svn
right.
commit to it, it's fine.
<gstein> prolly local. but didn't want you to get a conflict while you're working there.
<gstein> okay. gonna disable email. and then slam the repos :-)
Well, just don't commit to IDEAS and dag.c, then.
<gstein> I'll take that out of the potential list. got a script to generate the commits :-)
hehe.
I thought you might. :)
OH.
Wait.
<gstein> it is going now... :-)
Do make a copy of the repos first, in case it gets corrupted.
Oh, never mind then, no biggie.
<gstein> up to #62
<gstein> 103
Revision 103???
Tell me when we get to 1729. :-)
<gstein> hehe
<gstein> I think that I just hung it
<gstein> I tried an svnlook at the same time
<gstein> yup. hung
<gstein> waiting on a lock
??
(phone call came, off soon)
<gstein> db is hung
[action] gstein points at irc log above
"u know it's waiting on a lock?" was the question I meant toask
<gstein> attached to the process with gdb
okay, off phone
any other details?
<gstein> on phone
ok
/me is embarrassed to admit that he's never attached to a process with gdb
/me would like to do so, however, as he anticipates greatly increased future need for this in his life
<gstein> gdb program pid#
<gstein> had to do it as root, in this case (for the apache process)
++ beautiful ++
thank you for that, mmmm, nugget.
<gstein> ok. well... we have a hung up server now :-(
Maybe if I learn how to read someday, I can actually read the documentation.
How many revisions did it get to?
<gstein> 280
<gstein> it was probably the svnlook that I ran at the same time
Is there any way we can forcibly clear that lock?
Ooooooooh.
<gstein> so... it probably would have done 1000
<gstein> but that would simply be because it was done serially
if only we hadn't tried to do something else at the same time.
<gstein> as soon as two people tried to do something at the same time...
bleargh.
yeah.
Well.
<gstein> better to find now
yup
question is, do we want to go to Berkeley 3.3.11? If there is a way
<gstein> in any case, the commits weren't really working right anyway, it would seem
to un-hang a repos, I'm not sure...
How were they not working?
<gstein> just checked the error log. I had forgot aobut the a/b/c and d/e/f problem you were working on
<gstein> so the multi-file commits were simply choking
oh, so the script tried multi-file commits that awy?
ah.
okay, at least taht's (likely) not a server or db problem.
<gstein> regardless, it did 187 commits
that's good.
Want to try the same thing with Berkeley 3.3.11?
<gstein> hmm
<gstein> sure. why not. we have a repro case
(Or rather, try reproducing again, just to make sure we can make it hang,
[info] ben has joined #svn
then try to reproduce it but with 3.3.11
BEN!
<ben> doh!
<gstein> your initial thought was good.
Heyy there.
<gstein> let's try to repro on 3.2 one more time
deal.
<gstein> if we can, then we'll use that recipe against 3.3
<gstein> first, we have to clear the locks, tho
Ben: call me 922-2784 for quick status update
Umm.
<ben> okay
Can you set up a different repos, svn2 or something,
<ben> (just listening in.) :-)
just in case the curre
oha never mind.
[info] ben stopped wasting time: Started wasting time elsewhere
I can't use the current one anyway.
Blow that baby away.
/me goes back to debugging the client-side problem
<gstein> db_stat might tell us more info. looking at it now.
<gstein> do you need a good repos?
<gstein> woah. svnlook core dumped
greg: don't blow away that repos
Ben is telling me something important about the bug
(the one I'm debugging)
Okay, it looks like I will need to contact a repos of some sort,
and start the activity. But don't necessarily need the repos to
be the same one my wc is based on.
<gstein> ok
So if you can make a working repos be there again, then that's probably
enough for me. Maybe it also should have at least rev 4 or 5 in it,
so my wc is not too out of whack.
<gstein> let's move repositories/svn to repositories/svn-broken
yes
<gstein> and then can you do the import for a new svn?
Yes, I will import right now?
<gstein> (I don't have the receipt for that handy)
<gstein> let me move the old one
I can tell you what I do.
<gstein> ok
<gstein> moved
okay, this is what I type:
as user `svn':
svn import file:///usr/www/repositories/svn /home/kfogel/src/import-me trunk
there.
<gstein> ok
I keep that `import-me/' tree around and never touch it.
<gstein> importing... done
<gstein> you should be set. let me rstart apache
<gstein> all set. I just did 'svn stat' all right
<gstein> your IDEAS and dag.c are going to be tweaked, tho
<gstein> but I'll working with the broken one now
thanks.
should be okay, even with them tweaked
Because I just need to get to the resource report stage, where it fetches
local wcprops and sends them over.
So the highest rev in the new repos is 1, right?
<gstein> yup
<gstein> okay. did a recovery on that db.
<gstein> still some problem in 'nodes'
<gstein> oh. wait. db_verify man page talks about custom ordering, which I believe we use.
damn
svn_error: #21073 : <RA layer's server request failed>
The CHECKOUT request failed (http #409) (/repos/svn/$svn/ver/19.1/trunk/IDEAS)
no big deal, just mean my wc can't be used with this repos.
<gstein> urk
I'll re-check out and start again, know where to jump to this time
<gstein> well... you can brekapoint well later
<gstein> right
exactly
<gstein> I'll leave yours alone and set a new pointer to this broken repos. see if it still works
thanks
<gstein> ok. I'm off for a bit. back later to work on the db thing
okay
Do you know how long?
I think I've got it.
Whew.
<gstein> prolly another hour.
<gstein> i.e. 9pm to 9:30pm my time
<gstein> then I'm gonna try to repro the db hang
<gstein> hmm. the recovery didn't do the full recovery
<gstein> there is some kind of problem with the dbenv
<gstein> when will you be back from your trip?
<gstein> oh: back on the repro. so I'll try to repro. assuming I can, then I'll switch to 3.3
I'm back on Monday. :-(
<gstein> then try the repro again. if it *doesn't* happen, then we may be okay
We should settle for that. :-)
<gstein> can I suggest that I will announce a "provisional M3" until you return?
sure.
<gstein> (assuming 3.3 works and/or I find how to get 3.2 to work)
But let's announce it together tonight
Oh, but you may not be able to do it before I have to leave here, is
<gstein> um. 9:30pm here is late for you
what you mean. Get around to repro-ing here.
<gstein> right
<gstein> yah :-(
Uh, I know it's late, but I long ago gave up hope of sleeping tonight.
<gstein> that is why I called it "provisional"
I would rather release an M3, even provisional, than get some extra
sleep tonight.
<gstein> okee dokee
I'm almost done with this bug.
(I found it, just have to make the right fix).
<gstein> cool. what was it? ra_dav or something else?
It was in the adm crawler.
That's a paddlin' for Ben. :-)
<gstein> hehe
<gstein> you should see issue 443 (iirc)
Can I use your repro script to start hitting the DB
/me looks at 443
<gstein> that is why he is due a paddle :-)
so can test the 3.2 vs 3.3 thing
<gstein> hmm. not that one. lessee... what was it...
are you sure 443?
Ah, no, okay.
Come to think of it, there's no point calling M3 "provisional". We always
<gstein> 444
knew it would have bugs. So if it has them, it's still M3. :-)
<gstein> (next one)
/me wants to release M3 ***tonight***
<gstein> fine w/ me. I really meant "provisional" as in,
<gstein> "I'm announcing 'provisional M3' now, but when Karl returns, he can bless it as 'real M3'"
reading issue
<gstein> but if you're gonna he-man it tonite, then I'm game
<gstein> I have another 9 hours of wake time left, at least
I'm gonna he-man it tonight, because I want to have an enjoyable weekend.
<gstein> great!
I'll still be here at 11:30 Chicago time when you get back, don't worry.
But if I can have your repos-hitting script, then I will work on that too
after this bug is finished.
i.e., work on reproduction of the db hang
<gstein> the scripts are in /home/gstein/tmp
<gstein> the generated script is "mass-commit"
<gstein> (as if you couldn't guess :-)
<gstein> ok. gotta go. back in a short while.
see you later
svn ci -m "Committing two files explicitly." IDEAS subversion/libsvn_fs/dag.c
Changing /home/kfogel/src/svn/trunk/IDEAS
Changing /home/kfogel/src/svn/trunk/subversion/libsvn_fs/dag.c
Commit succeeded.
Bug fixed.
Next. :-)
I'm working directly in your home dir as `gstein' right now.
Nope, I take that back. I'm copying your whole script setup and doing
it in my home dir.
(since your wc refers to an obsolete repos anyway, so I have to check
it out again).
Reproduced the DB hang.
Have stopped apache, and removed libsvn_* from /usr/local/lib.
Am rebuilding svn (with the bugfix, incidentally) and DB 3.3.
Hmmm. Subversion doesn't want to build with 3.3 even when told to.
at the end of configure --with-apxs=etc... --with-berkeley-db=/usr/local/BerkeleyDB.3.3
It says these two lines
checking for Berkeley DB in /usr/local/BerkeleyDB.3.3... no
configure: error: Could not find Berkeley DB 3.2.9.
(though berkeley is installed there).
I'm changing the configuration helpers to want 3.3.11.
_Still_ doesn't work.
Hmmmm....
(same error)
Funny. /usr/local/BerkeleyDB.3.3/include/db.h certainly does claim the
correct version number.
<gstein> dunno. maybe after changing the helpers... rerun autogen.sh ?
did that
<gstein> did you do that?
<gstein> hmm
Am looking at our ac-helper for Berkeley DB.
Have you read everything above in the irc buffer already?
<gstein> just did
cool.
<gstein> saw you repo'd
yup.
I say, time to switch to 3.3.11.
(now if only we knew how :-) )
<gstein> hmm. I'll grab the source and poke at it some. i.e. not muck with /usr/local/src while you're in there
I'm not in /usr/local/src.
<gstein> how did you repro? do an svnlook while the mass commit?
I just do this in my home working.
in kfogel's working copy.
Yes , I ran svnlook
it took a few times.
<gstein> ah. well... when you do a "make install" you do it from /usr/local/src, right?
No, not right now.
[action] gstein nods re: svnlook
<gstein> ok
I just do it from my home dir.
<gstein> well... I'm gonna go see if I can get bdb 3.3 to take
<gstein> second pair of eyes and all
mind if I phone and give my hands a rest?
(phone you, that is, for a brief chat)
<gstein> no problem.
heh, no answer, so I have to phone him to ask the question.
Ah, there he is.
:-)
<gstein> the configure test is linking against libdb-3.1
<gstein> from /lib
<gstein> ah!
<gstein> may have it... one sec
Cool!
(hmm, when did you write that? :-) )
<gstein> just now
ahh. I just happened to check IRC now.
<gstein> I got it linked to db-3.3, but the RTL couldn't load it
hunh.
<gstein> which means ld.so didn't know about it
ahhhhhh....
<gstein> which can be fixed with --rpath or ld.so.conf
/etc/ld.so.conf time?
[info] This looks like spam to me: etc/ld.so.conf
/etc/ld.so.conf
(the irc server apparently thought I was spamming when I wrote that)
weird
<gstein> yah. weird.
If we change ld.so.conf, we have to run `ldconfig', right?
(until next reboot, that is)
<gstein> oh, no problem on that
<gstein> one issue is that we were running with 3.2, yet that isn't in ld.so.conf
<gstein> therefore, why do I need to do it for 3.3?
yeah, i wuz just thinking that.
<gstein> or is it that we're actually running against 3.1? (gasp)
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOHHHHHHHHHHHHHHH
my goodness.
<gstein> hmm. did you recently blow away /usr/local/lib ?
But why would the RTL complain for 3.3.11, but not 3.2.9?
libsvn_* yeah
not all of usr/local/lib
<gstein> right. but I thot you did that hours ago.
not so recently, but I don't think I've reinstalled since then.
<gstein> still not put back, or did you do it again :-)
<gstein> ok
if you didn't install, then they're not there.
<gstein> gotcha
No, wait.
I think I *did* install them.
<gstein> no matter. I'll put them back
Oooooh, here's what happened
- blew them away
- built with 3.2.9
- reproduced the bug
<gstein> perms?
- blew them away again
<gstein> ah
(you need root password?)
<gstein> nope
<gstein> got an ssh key for it
- never did reinstall, because I couldn't build w/ 3.3.11
So they're probably not there, and that's probably why. :-)
<gstein> ok
Anyway: go ahead and blow away everything.
<gstein> I can get this
libs, repositories, whatever.
<gstein> okee
I'm working on post-M3 instructions; give a hoot when you're ready for
repos slamming.
(BTw, you updated, right? Your mass-commit script should work now,
with the bug fix for that wc problem)..
<gstein> haven't updated. will in a sec.
<gstein> I just realized why the ld.so.conf didn't matter...
<gstein> when I did a traceback on that hung server,
<gstein> the DB stuff was statically linked into libsvn_fs
????
Ohhh.
libsvn_fs is a shared, dynamic library.
<gstein> yup. but DB was statically linked into it.
But once you load it in, you get Berkeley DB along with it, because
that is statically linked into fs.
<gstein> right
[head spins]
<gstein> I'm just gonna add 3.3 to ld.so.conf
<gstein> which is the "right" thing for the box
But wait. Why shouldn't we just be able to statically link DB 3.3.11
into the fs, like we were doing with 3.2.9?
<gstein> because we don't want to
(Or is that undesriable)
gotcha.
otay
As long as it's not likely to be a significant variable here, let's
do it the Right Way.
<gstein> ah. and I know the diff during configure
<gstein> for 3.2, there is only a .a available
<gstein> therefore, the config test statically linked in the right lib
Oh. But 3.3 makes a .so too?
<gstein> but for 3.3, there is a .so. the linker linked against that. when run, it would pick up libdb-3.1
<gstein> and *that* fails
Bleecccch.
I see.
*shiver*
<gstein> point is: 3.2 was built non-shared only. 3.3 was built with both
<gstein> dunno why, but that is the case.
<gstein> now that ld.so.conf is fixed. lemme try the configure now...
[action] gstein is off and running... you may return to your regularly scheduled program :-)
/me returns to his irregularly scheduled programming
I'm dying to know -- any reproduction, or is it playing nice?
<gstein> just got it built
<gstein> about to build a new repos
<gstein> and run a couple tests
<gstein> looks like things are linking against 3.3 properly
<gstein> (e.g. the .so linking)
<gstein> importing the tree now
<gstein> seems to be working fine
got a _lot_ of the administrivia work done. I'd say now there's 30-40 minutes
between when we have the M3 tree and the whole thing is done.
Maybe less.
(But remembers we still have to test w/ auth on)
<gstein> yah
<gstein> checking out the repos now
<gstein> I have checked my repos with auth on
<gstein> so I don't see any issues
<gstein> I've noticed that the password is not hidden (when asked). same for you?
<gstein> e.g is it from my terminal? or bug in APR?
are you runnig in Emacs or something?
<gstein> nope. just a normal ssh shell.
oh, that's right: Emacs is hiding it for me, so I wouldn't know.
<gstein> I presume that it isn't hidden for you?
<gstein> hmm. something about the terminal then.
[action] gstein shrugs
It is, but it's not the terminal, it'
<gstein> ok. I'm ready to do the mass-commit.
it's Emacs regexp matching on the password pormpt. :)
<gstein> ah!
[action] gstein chuckles
<gstein> shall I start the commit?
Yeah! IS apache back up, then? {I noticed it was down)
I won't touch the db while you do mass commit
<gstein> it's up. I restarted it a bit ago
but let me know if the script still gets those errors (the non-hang
<gstein> okay. two phases: mass commit until it complets. see how svn stands up
errors, Imaen)
<gstein> yah. we'll check for errors
then phase 2: svnlook while it runs?
[action] gstein bings
/me nods
[action] gstein mumbles something about great minds...
[action] gstein winks
/me wishes he had any mind at all right now, let alone a great one
go for it
i'll check back soon
<gstein> hmm. lots of errors. go ahead and do whatever. I'll work on this for now.
<gstein> doesn't look like the problems we saw before re: wrong dirs
ok
/me uses "ok" to mean "not very ok"
:-)
does doing a local "make check" work?
<gstein> dunno. didn't try that :-)
living on the edge, huh?
<gstein> yah...
I'm done with everything I can do before we have the tree.
What can I help with here?
<gstein> not sure.
ok, I can wait, too.
<gstein> waiting for this to complete
<gstein> then we can review the error log to see what is up
paste in the errors?
<gstein> one sec
<gstein> some of that "broken pipe" stuff
I've been tailing error log
<gstein> mass-error ?
<gstein> or /var/log/httpd/error-log
the latter
<gstein> I believe that is simply apache recycling children
<gstein> but then again: signal 6 is related to abort()
plus a few odds and ends, like:
"could not write the file contents"
and
"Delta source ended unexpectedly"
<gstein> ah. I see that
<gstein> the two are part of the same request
what does it mean?
<gstein> (stacked errors)
<gstein> dunno offhand
ok
<gstein> commit #813 so far
huh.
<gstein> hmm. 819
Is repository rev 813 the same as attempted commit 813?
or are there some lost ones?
<gstein> repos rev is prolly lower. one sec...
DON'T DO IT MAN
BACK OFF FROM THAT LEDGE
<gstein> :-)
were you about to run svnloko?
<gstein> fgrep -c Commit mass-output
:-)
<gstein> 404 commits succeeded
hmmmmm.
That's a great batting average. But one wants better from one's revision
control system.
<gstein> yah
what # is it on now?
<gstein> mass-error has problems show
<gstein> 991
(typo there?)
<gstein> yah. shown
<gstein> done. wall clock was 12 minutes
whew
okay it's ~gstein/tmp/mass-error, right?
<gstein> yah
<gstein> hmm. 515 lines of "Broken pipe"
whew, what a mess
yeah
it's always when trying to commit a lot of files at once
<gstein> I betcha that the diff is busted (intuition)
which diff?
<gstein> so the server chokes. and that filters back.
<gstein> svndiff
hmmm
bad deltas being sent to server?
<gstein> maybe
"bad" as in invalid, recognizably bad, maybe
<gstein> it is just that we have errors during "sending postfix text-delta"
run make check with that thing, let's see what it does in simple cases
with just local access.
(want to isolate db 3.3.11 incompats from problems inherent in subversion)
<gstein> good idea
<gstein> you logged in as svn? go to /usr/local/src and run that
will do
<gstein> I'll continue looking at these errors
<gstein> there are also some errors about locked WCs
yeah, saw
<gstein> I'm gonna run an svnlook to see what happens. nothing else going on right now.
<gstein> (want to see the current rev#)
the test suite passes just fine
<gstein> woah
??
did svnlook bite you?
<gstein> try: svnlook /usr/www/repositories/svn
okay
holy cow
<gstein> ok. I'm gonna shut down apache.
well, let's try it, especially as 3.3.11 claims to have
fixed some bugs in running recovery (!).
<gstein> altho we know nobody else is talking to the db. but just in case.
We have two choices here:
svnadmin recover ..../svn
or
db_recover -c -h ..../svn/db
<gstein> I see in the apache status, no other connections at the moment.
<gstein> ok. we'll make two copies of the repos.
ahh!
<gstein> try both on the copies.
good plan
<gstein> but shutting down apache to lock out access is first bit. one sec.
okay
<gstein> copying first one
ok, i just wait
<gstein> darn repos is up to 224 meg
whew
i hate this bloat
<gstein> yah. that is a bit extreme
does berkeley *really* need all those zeroes?
or is it us, I wonder... hmmm.
<gstein> I wonder if there is a GC type thing we can run
maybe
to learn later
<gstein> ok. both copies
<gstein> see /usr/www/repo...
okay, I'll take svnadmin, you take ddb_recover
<gstein> ok
heh, i just get a runrecover error again
DB_RUNRECOVERY: Fatal error, run database recovery
that's what it says when I run "svnadmin recover"
<gstein> hmm. that isn't helpful :-)
no, I'd say it's fairly losing.
what about you?
<gstein> running db_recover now
i'm doign the -c -h flags from memory of watching mike eralier today
<gstein> yah. those are right. I did it before, too
what is the result
it is still running?
<gstein> still running
that's good
sure beats a runrecover error, anyway :)
mind if I clean up some old repostiories in there?
<gstein> no problem
don't think we need main*, nor svn-broken, nor was_svn
will clean all those.
<gstein> ok. db_recover completed
<gstein> but... I bet it is really broken now. one sec
try it
<gstein> yup. error opening the environment
<gstein> that is what happend with svn-broken after I ran db_recover on it.
what's the point of recovery if it don't recover nothing?
/me wonders
<gstein> yah. that is what I thot
Well.
Anyway, what if we generated a mass-commit script that
never committed more than two or three files at a time?
<gstein> and what would be the point of that?
I'm wondering if the problem only shows up when many
files are committed at once.
If that is the case...
well, hmmm, still, it does corrupt the repos.
<gstein> well... let's take a look at the failing cases. should be easy to find.
<gstein> and yah: there is that :-)
If it were solely a client side issue, then I'd say, fix it after M3.
<gstein> I'm gonna see what I can find about this recovery thing
But this is db side.
Let's see if it always fails in the same commit.
<gstein> gonna make one more copy of svn to ensure that we don't accidentally break it by doing something.
First failed was commit #3, right?
<gstein> gonna call it svn-pristine-broken
heh
Anni won't be accidently woken if I phone, right?
<gstein> no. she's right here.
oh.
Hi, Anni!
Then I'll phone
<gstein> I think I've learned some about recovery procedures...
<gstein> wow. reading the berkeley docs... there is a good amount of regular maintenance to do regarding archival and log truncating
<gstein> that 220 meg or whatever is because we haven't tossed the logs and stuff
<gstein> we're going to need kind of a summary doc on how to maintain the db environment in the svn repos
<gstein> found the reason for the aborts in the error log. we did not install an FS warning func. so it abort()'d
<gstein> I wrote a func to log the message. interesting results...
I'm back
<gstein> we're getting DB_INCOMPLETE back from the db->close
<gstein> ah!
hoo-hoo!
<gstein> but getting that is weird. the doc says that is cuz somebody is writing while we are closing
<gstein> but in our test case, nobody is writing
hmm
Btw, have we done mass-commit over ra_local?
(just curious)
<gstein> nope
<gstein> worth a try
yeah
<gstein> found something interesting...
I'll do it right now (easy to set up, no apache conf involved)
yeah?
<gstein> fgrep 500 /home/gstein/tmp/mass-error
ok
<gstein> *only* those two files are getting 500
<gstein> nothing else
it's all the same file!
<gstein> .html and .txt
oh, and .html
what the *heck*
(and I note that .html doesn't even belong in the tree, what's it doing in there? Oh, have to remember
<gstein> hmm. tmp/pass-1/mass-* is the first run
to take it out of the import source.
<gstein> it also shows ltmain.sh got a 500 a couple times
<gstein> but most the svn-design
<gstein> anyways... running a new pass with the warning func installed
ok
<gstein> no more broken pipes
<gstein> obviously, those came from the server dropping the connection. reading the socket would throw SIGPIPE
I just created `karl-repos'
great!
<gstein> read the log re: the db repos?
Glad that's explained
you mean read mass-error?
( ? )
<gstein> nah.. the bit about log files in the db area
<gstein> we need a regimen of running db_archive and tossing log files and stuff
<gstein> that would drop that 220 meg
oh, you mean the irc log?
<gstein> hehe. yah
will it grow without bound?
Or does it slow down eventually?
<gstein> I believe without bound
<gstein> it keeps the log files until you deign to toss them
<gstein> I saw something about a program you run to see if any are in use. if not... 'rm'
<gstein> #360
[action] gstein goes to get more port
Can they really mean that? I mean, who's supposed to run software that
requires human maintenance?
If it's not in use, why doesn't Berkeley toss it?
<gstein> not my software :-)
what's the #360 referring to.
?
<gstein> commit #360
(Importing into karl-repos)
ok, old on
<gstein> 369 now
oh! You're runnig again, gotcha.
<gstein> prolly a big doggy cuz you're smacking the system too :-) (no biggy)
<gstein> yes. hitting with svnlook right now wouldn't be good :-)
yeah.
<gstein> I'm concentrating on fixing the mass-error stuff first
<gstein> then we can try the hang
agreed
<gstein> there is also the DB_INCOMPLETE that we're seeing
<gstein> reading some of those db docs are scary. apps crashing will tend to leave the db in a recovery-needed state
<gstein> but doing that also seems to require copy __db.* from backup
<gstein> after I did the db_recover, the __db.* were missing. so the db didn't work.
<gstein> so I 'restored' them from backup: created a new repos using svnadmin. then copied the __db.* files.
<gstein> the db env worked again
No way!
Yeah, we'll definitely need to doc this.
Or rather, I'll need to *read* the docs that Sleepycat wrote. :-)
<gstein> point being that I believe db_archive folds the log files into __db. then you back up __db.
[action] gstein chuckles
<gstein> yah. there is something "big" that needs to happen.
<gstein> so a recover occurs from the last __db backup or somesuch. no idea. I'm spouting here :-)
but the point is, you don't just run "db_recover". You do some stuff
to your db env, *then* you run db_recover. That it?
<gstein> run db_recover, then copy some __db files in. then you're running
<gstein> db_recover *deletes* the __db files
oh, i had order reversed, okay
I see.
/me wants Postgres backend
<gstein> but there must be some kind of policy re: backing up that stuff. archiving the files. deleting. etc
[action] gstein grins at karl
<gstein> now you see why I want a mysql backend
<gstein> but yah: pg works well too
yup
sheeeeeeeeeeeeesh.
<gstein> what's the 'du' ?
that I'm running right now?
long story
<gstein> woah! we've got some 'splainin to do... the apache processes are at 35M ram
ew
and no one's even doing anything, right?
sheeeesh.
Right now, I'm user `svn', in ~svn/, running "du -s wc-bkp". There, it
just finished.
Took a coupla minutes to tell me 137 megs. That's ridiculous.
<gstein> remember that I'm smacking that disk, too
<gstein> slamming commits in there
<gstein> that head musta been grinding awfully fierce :-)
Yeah.
[action] gstein noted the commits were moving slower
I can't hear it though.
<gstein> woof. they're up to like 60 meg
the apache's?
/me wonders if httpd-2.0 doesn't have some pool usage issues?
<gstein> yah
So let's see: every day, we have to
<gstein> I'm not doing a lot of subpool stuff on the server
1) kill and restart apache
<gstein> nah. the processes will auto-recycle
2) archive some berkeley .db files somewhere
oh, that's right, okay.
this box clearly does not have great i/o
i've been removing the same tree for 5 minutes now :-)
<gstein> hmm. well. we should configure it that way. they won't recycle with our current config.
[action] gstein chuckles
oh.
what do we have to put in httpd.conf?
<gstein> with one disk, there isn't *any* machine that can do well with that
<gstein> MaxRequestsPerChild should be set to, say, 1000
'spose that's true, yeah.
<gstein> so every 1000 requests, the child dies. the parent will then restart a new one.
<gstein> the mem seems stable between 60M and 70M
<gstein> the way the pools work: peak usage remains
hmmm, not understanding the MaxRequestsPerChild thing.
I would think that would lead to bigger processes, not smaller, because
the same child will handle more requests, no?
Oooooh.
But currently it's set to 0, meaning, no max?
/me notes that 1000 > 0 only in some contexts
<gstein> right. no max.
<gstein> the pool reuses the mem.
<gstein> so you see peak usage
gotcha
<gstein> one request is done. it goes back in the pool for the next request
okay, got a clean home dir for `svn' now.
gonna check out locally from karl-repos
WHEW
badness
<gstein> what was the test for the local?
<gstein> badness?
this is a repos with 1 revision, a full import of the subversion
tree
then I tried to do a checkout...
(hold on, cut & paste may be broken here)
[svn_at_svn svn]$ svn co file:///usr/www/repositories/svn -d svn
svn_error: #21066 : <Berkeley DB error>
Berkeley DB error while closing `nodes' database for filesystem /usr/www/repo\
sitories/svn/db:
DB_INCOMPLETE: Cache flush was unable to complete
[svn@svn svn]$
there
oh
never mind
my bad.
Sorry. I hope I didn't screw something up.
(wrong repository!)
<gstein> heh
now checkout working fine
<gstein> lessee if it hung
<gstein> no hang
oh well
:-)
<gstein> but I'm working on errors first. we'll try the hang later :-)
<gstein> and we're still seeing those incompletes... solve in a while tho
okay, I'm gonna run ~svn/tmp/mass-commit, which will use ~svn/tmp/svn as
its wc
.
All totally local, no Apache involved.
[action] gstein nods
<gstein> interesting to see what happens with those svn-design docs
yeah.
so far everything peachy
when you run `mass-commit', do you manually redirect stdout and stderr?
<gstein> yah
ah.
I shoulda done that. I'm just running in a shell buffer.
<gstein> ./mass-commit > mass-output 2> mass-error
I'll see everything, but I won't get err separated from output
it's okay, there's absolutely no error output yet.
<gstein> nice
<gstein> hmm...
yeah, a useful datapoint
<gstein> I wonder if I'm not escaping the delta text(!)
??????????
hey!
hmmm!
<gstein> xml-escaping
<gstein> ra_local doesn't have to
It's committed svn-design.txt twice and svn-design.html once, at least
right
Do you think...
<gstein> hmm. no... I don't either
No?
<gstein> that is in a PUT
[action] gstein thinks more...
/me listens to gstein think
<gstein> BANG BANG CHUGGA CHUGGA BOOM ... **GRRRRRRRINDDDDDD*
<gstein> :-)
oops finally saw some rror
Commit succeeded.
Changing svn/trunk/expat-lite/hashtable.h
Changing svn/trunk/subversion/clients/win32/svn_com/SVNCOM.rc
Changing svn/trunk/notes/entries-handling.txt
Changing svn/trunk/notes/difftools/pics/xxdiff-README
Changing svn/trunk/subversion/tests/libsvn_vcdiff/target1.txt
Changing svn/trunk/subversion/include/svn_error.h
Commit succeeded.
svn_error: #21021 : <Attempted to lock an already-locked dir>
commit failed: while sending tree-delta.
svn_error: #21021 : <Attempted to lock an already-locked dir>
working copy locked: /home/svn/tmp/svn/trunk
Changing svn/trunk/subversion/libsvn_wc/lock.c
<gstein> ah!
<gstein> hmm. but I don't know where my errors occur. gonna stop mine. got enuf data.
heh, funnny, it had a locking error around lock.c
<gstein> rofl
/me savors the irony
okay, I got the same error again many commits later
mostly it's succeeding
<gstein> I believe that most of my commits succeeded this time
<gstein> the repos is at rev=663
In a PUT request, you tell it the length of the data before you send
right?
<gstein> yah
/me wonders aloud how we know the length of an svndiff stream before we
<gstein> dump the diff to a file. then put it.
send it...
ah.
thanx
[action] gstein mumbles .svn_commit.3164.00001.ra_dav
is that what those things are?
[action] gstein nods
/me gags nicely
:-)
<gstein> hehe
<gstein> you can gag when I tell that we'll spin up a thread with a pipe between them so we can avoid the file
<gstein> one thread writes to pipe. neon reads from pipe.
[action] gstein grins
so far, precisely two errors.
Don't know what commit I'm on... oh, I can count them, hold on
#204
<gstein> 205
<gstein> 'ps' does wonder
how did you do that?
[action] gstein grins evilly
no, really.
I'd like to know.
<gstein> 'ps'
did you run emacsclient?
ps???????????
But. but, but.
<gstein> sure. it shows the svn commit
<gstein> and the log message says what commit #
<gstein> which is on the cmdline, of course
OH!
Okay. (whew)
I felt my boundaries breakin' apart there for a moment.
<gstein> heh
I'm gonna let it run to the end.
3 working copy errors in 313 commits.
<gstein> I had 6 in 600 some
okay.
<gstein> all working copy locked errors
<gstein> then there is the PUT problem
So same ratio, and probably same error, in fact.
Wait -- you mean you had 6 total in 600, or 6 of this kind of error,
and N others?
<gstein> svn_error: #21021 : <Attempted to lock an already-locked dir>
<gstein> commit failed: while sending tree-delta.
<gstein> svn_error: #21021 : <Attempted to lock an already-locked dir>
<gstein> working copy locked: /home/gstein/tmp/svn/trunk
right.
6 of those in 600.
<gstein> yup
But besides those, how many other errors?
<gstein> where'd yours appear? numeric wise? any pattern?
I'll look
<gstein> the locked thing. and the PUT thing. that appears to be it.
<gstein> no other errors in 600 plus
<gstein> 662
<gstein> I don't have interleaved output, tho... I can't tell what the errors apply to.
<gstein> gonna regen the script
<gstein> with some add'l output
Well, I don't see any pattern, except that it's always /home/svn/tmp/svn/trunk
that's locked, never some subdir of it...
<gstein> right
you know what?
I think we can live with these things, if the db isn't getting corrupted.
Let's try some `svnlook' on karl-repos?
<gstein> I've got a repeatable error trying to commit svn-design.txt
wonderful!
what the heck is it about that file...
<gstein> right :-)
(not that it will be in our tree _anyway_, so who cares about teh bug, riht?)
<gstein> heh
<gstein> ltmain.sh doesn't go in the tree either
<gstein> but there *is* something ugly here :-)
anything else you can think of?
YEs, definitely something ugly.
What is the repro recipe? Is it fairly simple
?
<gstein> we've got the lock problem. and this PUT issue.
<gstein> I'll work on PUT. you work on lock?
Just "try to commit svn-design.txt"?
<gstein> dunno the full recipe, but my repos is in a state such that the PUT doesn't work
Okay.
Let's do this:
<gstein> so yah: committing the file sipmly barfs with the 500
[action] gstein listens
I'm going to be running out of time in about an hour
<gstein> I bet :-)
(have to wrap Ian and Janine's present, go home, find a suit,
drive to airport, etc. :-)
The result of the PUT bug is that the commit never takes, right?
Nothing goes into the repos.
<gstein> appears so, yes
the result of the lock bug is... run svn cleanup. :-)
or something.
but the point is, it also doesn't take.
<gstein> the file is at rev 23. the rest is 600 some
huh?
<gstein> nah. the lock just resolves itself it seems
hmm.
whichever.
<gstein> oh. svn-design.txt is rev=23. the repos is at 600+
(not blowing problem off, just saying it can be debugged along with
everything else, later)
<gstein> on the lock. dunno the extent.
It may well corrupt the wokring copy.
But it seems unlikely that any problem is touching the server.
I mean, a wc error means the wc can't do stuff. :-)
<gstein> the 500 is not completing the commit. should be safe.
The one thing we need to make sure of is that the hang/corruption bug
doesn't happen.
<gstein> right
So I'm going to svnlook on karl-repos, you do same.
shit
wait,
bad error,
<gstein> all right then. let me restart the mass commit and try the look.
just saw it go by
hold on
[action] gstein holds...
Changing svn/trunk/INSTALL
svn_error: #21066 : <Berkeley DB error>
commit failed: while calling close_edit()
svn_error: #21066 : <Berkeley DB error>
Berkeley DB error while reading representation for filesystem /usr/www/reposi\
tories/karl-repos/db:
Cannot allocate memory
svn_error: #21021 : <Attempted to lock an already-locked dir>
commit failed: while sending tree-delta.
svn_error: #21021 : <Attempted to lock an already-locked dir>
working copy locked: /home/svn/tmp/svn/trunk
Changing svn/trunk/subversion/clients/win32/svn_com/MarshalArray.h
Out of memory ??
sigh
but it's still committing and succeeding after that, oddly enough.
did someone's pool get freed?
anyway, I think we can proceed with the `svnlook' test.
<gstein> ok. gotta reset first.
reset what?
<gstein> hmm. maybe not. just go again, I guess. :-)
you mean reset the machine?
<gstein> no no
:-)
<gstein> meant reset the repos and the WC
/me guessed wrong
oh!
nah.
<gstein> but don't need to.
<gstein> just rerun and keep appending changes
<gstein> ok. gonna start mass-commit now.
you're starting mass-commit?
on which repos?
<gstein> on /repos/svn
ooooooh
I'm getting DB_INCOMPLETE now.
And other errors.
"unexpected end of svndiff input"
<gstein> wow. lots of locked dirs
this is all part of the same mass-commit script, over ra_local
<gstein> I started mine again. it is rerunning.
svnlook ran and saw commit #603 just fine, though.
<gstein> urf. ran the whole thing. it is grinding...
I think we're straining this box.
<gstein> hehe
I got another DB_INCOMPLETE: cache flush was unable to complete.
<gstein> what is the svnlook cmd you're running
now both of them have happened during my svnlook cmd
<gstein> yah. the apache-based repos is spitting that out all the time.
just "svnlook /usr/www/repositories/karl-repos"
<gstein> goes in the error_log now
<gstein> the whole tree?
so it will show author, log, date, and latest tree summary
right
<gstein> as in... grind grind grind. pop! tree?
uh, right
<gstein> ok. just takes a while
wow.
it completed.
and the commits are still going, some succeeding.
<gstein> yup. my commits are still going.
So.
<gstein> gonna let one more svnlook go.
you started it? (or want me to?)
<gstein> then I'm gonna do it and hit ^C when it pauses :-)
okay
go for it
<gstein> commits still seem to be going
i'm waiting for one to _succeed_, though. :-)
there, one succeeded.
<gstein> yah. svnlook can take a while to print that tree
I wonder, Greg, if there are a lot of stale txns in that repository, though.
<gstein> uh oh. may have hung.
shit
<gstein> I bet there are.
what makes you say may have hung?
I just saw anohter successful commit go by
you mean svnlook may have hung?
<gstein> svnlook appeared to have hung when I did "svnlook . changed"
<gstein> the mass commit stopped at 107 when I did that
<gstein> hasn't resumed
oh, you were running against your own repos
I didn't realize
<gstein> yup
crap
<gstein> damn. hung.
I'm still able to run 'svnlook changed' a lot and get no hang
But so what. hmm.
can phone you now?
<gstein> I can call you. whats the #
(312) 922-2784
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:36 2006