Re: Another 1.4 release critical bug

From: Garrett Rooney <rooneg_at_electricjellyfish.net>
Date: 2006-08-01 23:07:25 CEST

On 7/31/06, Philip Martin <philip@codematters.co.uk> wrote:
> "Garrett Rooney" <rooneg@electricjellyfish.net> writes:
>
> > I'm now able to run basic_tests.py over svnserve with a bdb 4.4 repos
> > with apr built with pool debugging and under valgrind with no errors,
> > so I think we're on the right track ;-)
>
> Unfortunately it doesn't work with --enable-dso which is how some
> (most?) distributions build Subversion. I think the problem is that
> the callbacks happen after the DSO has been unloaded and so the
> callback code is no longer part of the executable. Using trunk:
>
> $ valgrind -q svnadmin create --fs-type bdb repo
> ==16619== Jump to the invalid address stated on the next line
> ==16619== at 0x44CF79D: ???
> ==16619== by 0x41DC43B: pool_clear_debug (apr_pools.c:1372)
> ==16619== by 0x41DC624: pool_destroy_debug (apr_pools.c:1457)
> ==16619== by 0x41DC423: pool_clear_debug (apr_pools.c:1369)
> ==16619== by 0x41DC624: pool_destroy_debug (apr_pools.c:1457)
> ==16619== by 0x41DC6EC: apr_pool_destroy_debug (apr_pools.c:1499)
> ==16619== by 0x41DC241: apr_pool_terminate (apr_pools.c:1269)
> ==16619== by 0x41DD6B7: apr_terminate (start.c:82)
> ==16619== by 0x42ACAE1: exit (exit.c:54)
> ==16619== by 0x4296E3D: (below main) (libc-start.c:245)
> ==16619== Address 0x44CF79D is not stack'd, malloc'd or (recently) free'd
> ==16619==
> ==16619== Process terminating with default action of signal 11 (SIGSEGV)
> ==16619== Access not within mapped region at address 0x44CF79D
> ==16619== at 0x44CF79D: ???
> ==16619== by 0x41DC43B: pool_clear_debug (apr_pools.c:1372)
> ==16619== by 0x41DC624: pool_destroy_debug (apr_pools.c:1457)
> ==16619== by 0x41DC423: pool_clear_debug (apr_pools.c:1369)
> ==16619== by 0x41DC624: pool_destroy_debug (apr_pools.c:1457)
> ==16619== by 0x41DC6EC: apr_pool_destroy_debug (apr_pools.c:1499)
> ==16619== by 0x41DC241: apr_pool_terminate (apr_pools.c:1269)
> ==16619== by 0x41DD6B7: apr_terminate (start.c:82)
> ==16619== by 0x42ACAE1: exit (exit.c:54)
> ==16619== by 0x4296E3D: (below main) (libc-start.c:245)
> Segmentation fault
>
> $ valgrind -q svn ls file://`pwd`/repo
> ==16621== Jump to the invalid address stated on the next line
> ==16621== at 0x47BC79D: ???
> ==16621== by 0x422943B: pool_clear_debug (apr_pools.c:1372)
> ==16621== by 0x4229624: pool_destroy_debug (apr_pools.c:1457)
> ==16621== by 0x4229423: pool_clear_debug (apr_pools.c:1369)
> ==16621== by 0x4229624: pool_destroy_debug (apr_pools.c:1457)
> ==16621== by 0x42296EC: apr_pool_destroy_debug (apr_pools.c:1499)
> ==16621== by 0x4229241: apr_pool_terminate (apr_pools.c:1269)
> ==16621== by 0x422A6B7: apr_terminate (start.c:82)
> ==16621== by 0x4314AE1: exit (exit.c:54)
> ==16621== by 0x42FEE3D: (below main) (libc-start.c:245)
> ==16621== Address 0x47BC79D is not stack'd, malloc'd or (recently) free'd
> ==16621==
> ==16621== Process terminating with default action of signal 11 (SIGSEGV)
> ==16621== Access not within mapped region at address 0x47BC79D
> ==16621== at 0x47BC79D: ???
> ==16621== by 0x422943B: pool_clear_debug (apr_pools.c:1372)
> ==16621== by 0x4229624: pool_destroy_debug (apr_pools.c:1457)
> ==16621== by 0x4229423: pool_clear_debug (apr_pools.c:1369)
> ==16621== by 0x4229624: pool_destroy_debug (apr_pools.c:1457)
> ==16621== by 0x42296EC: apr_pool_destroy_debug (apr_pools.c:1499)
> ==16621== by 0x4229241: apr_pool_terminate (apr_pools.c:1269)
> ==16621== by 0x422A6B7: apr_terminate (start.c:82)
> ==16621== by 0x4314AE1: exit (exit.c:54)
> ==16621== by 0x42FEE3D: (below main) (libc-start.c:245)
> Segmentation fault

Ok, so here's my first cut at a patch. The fs loader stuff seems
pretty reasonable, but for some totally unclear reason it is not
enough to make the problem go away, I also have to make the ra loader
code allocate dsos in a global pool, if I don't then I continue to get
segfaults during global pool destruction. Why? I have no clue. Did
this stuff ever work?

Note that this is totally not committable at this point, but I'd LOVE
if someone else could look at it and tell me what they think...

-garrett

[[[
Make a valiant attempt at curing the insanity of the dso loading code.

* subversion/libsvn_fs/fs-loader.c
  (dso_cache): New global hash of dso objects.
  (load_module): Add locking, use the global common pool to hold the
   dso, and use the new cache.

* subversion/libsvn_ra/ra_loader.c
(load_ra_module): Use a global pool to hold the dso for the ra layer
since for some reason we crash during apr_terminate otherwise.
]]]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

text/plain attachment: broken-dso-stuff.diff

Received on Tue Aug 1 23:08:10 2006

This message: [ Message body ]
Next message: Daniel Rall: "Re: Java Question"
Previous message: Dan Pozmanter: "RE: Java Question"
Maybe in reply to: Branko Čibej: "Re: Another 1.4 release critical bug"
Next in thread: Philip Martin: "Re: Another 1.4 release critical bug"
Reply: Philip Martin: "Re: Another 1.4 release critical bug"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]