On 08/04/2010 05:38 AM, Julian Foad wrote:
> Hi Blair. I had a look but can't see how this can occur.
>
> On Sat, 2010-07-31, Blair Zajac wrote:
>> We run long running, persistent processes that expose the svn_fs and svn_repos
>> APIs via RPC (using ZeroC's Ice). We've seen some occasional core dumps after
>> the processes have been up for two weeks or more. We're running against 1.6.5.
>>
>> The processes have LRU caches of fs_t and repos_t that they use if there is an
>> RPC request onto the same repos or revision.
>>
>> The core dumps only appear to happen during the commit phase in checksum related
>> code. In the first, svn_checksum_dup() is getting a pointer to address 0x48
>> which is odd and the other svn_checksum__from_digest() is getting a pointer to
>> an out of bounds address.
>>
>> We haven't had the cycles to investigate them because they happen rarely and we
>> have more pressing issues. I'm posting them here just in case anybody else has
>> seen these or has an idea.
>>
>> Regards,
>> Blair
>>
>> #0 svn_checksum_dup (checksum=0x48, pool=0x19730448) at
>> subversion/libsvn_subr/checksum.c:190
>> #1 0x00002abf50dabf87 in read_representation (contents_p=0x2aaadf514c50,
>> fs=0x2aab0af5a730, rep=0x19731478, pool=0x19730448)
>> at subversion/libsvn_fs_fs/fs_fs.c:2859
>
> fs_fs.c:2859 is in rep_read_get_baton() which is actually a subroutine
> of read_representation(), presumably in-lined by the compiler and
> therefore not visible in the debug info.
>
> The pointer 'rep->md5_checksum' is apparently 0x48 at this time.
>
>> #2 0x00002abf50dac762 in svn_fs_fs__rep_contents_dir (entries_p=0x2aaadf514cd0,
>> fs=0x2aab0af5a730, noderev=0x19940428, pool=0x19730448)
>> at subversion/libsvn_fs_fs/fs_fs.c:3351
>
> (Similarly, fs_fs.c:3351 is in get_dir_contents(), a subroutine of
> svn_fs_fs__rep_contents_dir().)
>
> (FWIW, we can see from the location of line 3351 that rep->txn_id is
> null.)
>
>> #3 0x00002abf50dadc83 in svn_fs_fs__set_entry (fs=0x2aab0af5a730,
>> txn_id=0x15ac3ca8 "683229-ewby", parent_noderev=0x19940428, name=0x1bf5a8d0
>> "hash:28",
>> id=0x2aab1316b378, kind=svn_node_dir, pool=0x1723ae78) at
>> subversion/libsvn_fs_fs/fs_fs.c:4651
>
> (FWIW, at this level the rep of interest has been passed in as
> 'parent_noderev->data_rep'.)
>
>> #4 0x00002abf50da1dee in set_entry (parent=0x2aab127a9378, name=0x1bf5a8d0
>> "hash:28", id=0x2aab1316b378, kind=svn_node_dir,
>> txn_id=0x15ac3ca8 "683229-ewby", pool=0x1723ae78) at
>> subversion/libsvn_fs_fs/dag.c:346
>> #5 0x00002abf50db33a7 in merge (conflict_p=0x2aab0d1b4de8,
>> target_path=0x2abf50db7530 "/", target=0x2aab127a9378, source=0x2aab18bde098,
>> ancestor=0x27a17c98, txn_id=0x15ac3ca8 "683229-ewby",
>> mergeinfo_increment_out=0x0, pool=0x2aaaf76aed28) at
>> subversion/libsvn_fs_fs/tree.c:1422
>> #6 0x00002abf50db34fa in merge_changes (ancestor_node=0x27a17c98,
>> source_node=0x2aab18bde098, txn=<value optimized out>, conflict=0x2aab0d1b4de8,
>> pool=0x2aaaf76aed28) at subversion/libsvn_fs_fs/tree.c:1602
>> #7 0x00002abf50db3cae in svn_fs_fs__commit_txn (conflict_p=0x0,
>> new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>> at subversion/libsvn_fs_fs/tree.c:1701
>> #8 0x00002abf509782cb in svn_repos_fs_commit_txn (conflict_p=0x0,
>> repos=0x2aab10c99df8, new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>> at subversion/libsvn_repos/fs-wrap.c:51
>> #9 0x00000000004fd819 in SvnUtil::commit_transaction (repos=0x2aab10c99df8,
>> fs=0x2aab0af5a730, txn=0x15ac3c80, client_info=<value optimized out>,
>> pool=@0x2aab309374f8) at SvnUtil.cpp:276
>> #10 0x00000000004f10e8 in SvnTransactionI::commit (this=0x2aab309374f0,
>> clientInfo=@0x2aaadf5153d0, current=<value optimized out>)
>> at SvnTransactionI.cpp:964
>> #11 0x000000000049ef47 in VnpIce::SvnTransaction::___commit
>> (this=0x2aab309374f0, __inS=@0x2aaadf515580, __current=@0x2aaadf515580)
>> at vnp_svn_backend.cpp:6796
>> #12 0x00002abf50099ea0 in IceInternal::Incoming::invoke (this=0x2aaadf515580,
>> servantManager=@0x2aaadf515930) at Incoming.cpp:447
>> #13 0x00002abf5006a9ab in Ice::ConnectionI::invokeAll (this=0x23f96b50,
>> stream=@0x2aaadf515cb0, invokeNum=1, requestId=4873841, compress=0 '\0',
>> servantManager=@0x2aaadf515930, adapter=@0x2aaadf515920) at
>> ConnectionI.cpp:2440
>> #14 0x00002abf5006eac7 in Ice::ConnectionI::message (this=0x23f96b50,
>> stream=@0x2aaadf515cb0, threadPool=<value optimized out>) at ConnectionI.cpp:1110
>> #15 0x00002abf5017f92a in IceInternal::ThreadPool::run (this=0x150ee070) at
>> ThreadPool.cpp:519
>> #16 0x00002abf5018013f in IceInternal::ThreadPool::EventHandlerThread::run
>> (this=0x151016d0) at ThreadPool.cpp:759
>> #17 0x00002abf50448e02 in startHook (arg=0x151016d0) at Thread.cpp:368
>> #18 0x0000003ae6a062f7 in start_thread () from /lib64/libpthread.so.0
>> #19 0x00000033b0ad1b6d in clone () from /lib64/libc.so.6
>> Current language: auto; currently c
>
> These two back-traces appear to be showing the very same code path.
>
> Again due to in-lining, I presume, fs_fs.c:2859 calls svn_checksum_dup()
> but the debugger shows, in this case, only its subroutine
> svn_checksum__from_digest().
>
>> #0 0x00002ac15f8bd28c in svn_checksum__from_digest (
>> digest=0x646c6f663a706e76<Address 0x646c6f663a706e76 out of bounds>,
>> kind=1949987428, result_pool=<value optimized out>)
>> at subversion/libsvn_subr/checksum.c:77
>
> Here, digest is ASCII "dmof:pnv" and kind is hex 743a7264 is ASCII
> "t:rd"...
>
>> 77 memcpy((unsigned char *)checksum->digest, digest, DIGESTSIZE(kind));
>> #1 0x00002ac15f057f87 in read_representation (contents_p=0x2aaab92d7c50,
>> fs=0x2aab80d53590, rep=0x2aabbd57e748, pool=0x2aab1c15a0e8)
>> at subversion/libsvn_fs_fs/fs_fs.c:2859
>
> ... meaning that 'rep->md5_checksum' points to readable memory but not
> to a valid checksum object.
>
> Thus, in both cases, it looks like that pointer was either uninitialized
> or subsequently corrupted.
>
> I searched in 1.6.5 code for creation of a representation_t structure
> and found nowhere that leaves it uninitialized. At least one place
> allocates the structure with apr_pcalloc (zero-initialized) and doesn't
> subsequently fill in ->md5_checksum, but that's not what you're seeing.
> I searched for creation of node_revision_t structures and found no
> places where the data_rep member is uninitialized.
>
> Thus I can only suggest looking for memory corruption.
Hi Julian,
Thanks for taking your time to look at this, I appreciate it.
I haven't followed the code path in detail, but my gut agrees with you
on memory corruption.
The code that uses svn_fs.h and svn_repos.h is C++ with very little raw
pointer manipulation, so my inclination is that it's not the C++ code
treading over this memory but something in the svn layer.
This seems like a hard problem to track, as it occurs only after a long
period of uptime in a high load performance setting so we just can't
have debuggers or other tools attached to the process. I'll run our our
backend severs on the dev cluster in valgrind and see if we pick up
anything there.
Regards,
Blair
Received on 2010-08-04 19:32:57 CEST