Friday, March 21, 2008

Exceptions, gcc and Solaris 10 AMD 64bit

For a little while now Alex and I have been battling a problem on Solaris 10 64bit AMD. We have been trying to port Firebird 2.0x to 64bit Solaris. However after fixing a few issues, e.g. defining the platform etc, we found that isql_static would core dump on trying to create a database. This is where the fun begins...

The default debugger for gcc shipped with Solaris 10 can't handle 64bit applications. Solution - build your own from a later version. At least this way we can now step through the Firebird code and try to see whats happening..

After some careful debugging sessions we find that Firebird seems to be OK, when you try to create a database, Firebird first trys to open it as an existing database, if it can't find an existing database we throw an internal exception and go on to create it.

The problem is isql_static is occurring when the exception is thrown...

Program received signal SIGSEGV, Segmentation fault.
0x0000000000488385 in __EH_FRAME_BEGIN__ ()

(gdb) bt
#0 0x0000000000488385 in __EH_FRAME_BEGIN__ ()
#1 0xfffffd7ffedacf3c in _Unwind_RaiseException_Body () from /lib/64/libc.so.1
#2 0xfffffd7ffedad129 in _Unwind_RaiseException () from /lib/64/libc.so.1
#3 0xfffffd7ffef3c71e in __cxa_throw (obj=0x1, tinfo=0x1, dest=0x474e5543432b2b00)
at /builds/sfw10-gate/usr/src/cmd/gcc/gcc-3.4.3/libstdc++-v3/libsupc++/eh_throw.cc:75
#4 0x000000000056510b in Firebird::status_exception::raise (status_vector=0x8d19e0)
at ../src/common/fb_exception.cpp:197
#5 0x0000000000694137 in ERR_punt () at ../src/jrd/err.cpp:562
#6 0x0000000000693ced in ERR_post (status=335544344) at ../src/jrd/err.cpp:441
#7 0x00000000005c66d4 in PIO_open (dbb=0xfffffd7ffec29050, string=@0xfffffd7fffdec9e0,
trace_flag=false, connection=0x0, file_name=@0xfffffd7fffdeca20, share_delete=false)
at ../src/jrd/os/posix/unix.cpp:646

So Firebird looks OK, i.e. its doing what it should do, however Solaris is causing a crash dump on the exception.

Some more detective work uncovered the following:

The problem is caused by the use of the -lc flag which was explicitly added to linker command line by autoconf.

Because we always use g++/gcc as linker, there is no need for us to need add in libc explicitly - it's done by compiler itself when it invokes ld. But autoconf seems to be doing it anyway.

When it is added to the command line, libc (Solaris native library) then happens to be added to ld's command line before libgcc_s where we would find gcc's own exception support library. If this happens any call to functions that have the same name, that happen to be present in both libraries, will go to libc, not libgcc_s.

There is at least one such function - Unwind_RaiseException(), which is called by g++ generated
code when exception is thrown. But instead of libgcc_s::Unwind_RaiseException() libc::Unwind_RaiseException() is called, but not with the parameters the Solaris native function expects - cue core dump.

Solution:

Don't use -lc

2 comments:

Glenn West said...

Hi Paul.

Unrelated.

I was looking at your rather nice Firebird/Interbase Sync Utility. And it got me wondering of using firebase for large Rails applications. There a single server can never "keep" up.

So multiple "db" servers is a norm.

Any thinking on Rails and Scaling using multiple database clones?

Glenn West
glennswest@yahoo.com.sg
mentalpagingspace.blogspot.com

Unknown said...

We have just solved a similar problem: our binary was linked against a shared library and when an exception was thrown in the shared library, a segmentation fault occurred in the file libgcc_s.so. It was not possible to catch the exception and continue the program.

Our system is Debian, we are working on 64 bit computers with g++ 4.3.2. The problem mentioned above did not show on 32 bit (Open SuSE) systems.

The reason was: the shared library was linked against another static library. Our automake mechanism told us, that this is not portable, but everything seemed to work... exception catching exceptions.

Now we have changed our build mechanism. The shared library is not linked against the static library anymore. Instead we include the static library when we link our executable and suddenly everything works correctly. We can throw exceptions and catch them and there are no segmentation faults anymore.

We found that solution with the following steps: Create a very simple example consisting just of a main file and a shared library (if you need one). Try to throw and catch exceptions. Probably you will succeed. It does not matter whether you use autoconf/libtool or not. Now increase the complexity until you arrive at your real world situation. As soon as you detect that your exception is not caught anymore you probably are very close to the solution!