10 April 2007

Heart of Darkness

I had heard tales of its existence. A dark, soul-less being whose only purpose in this world was to cause pain and misery. It strikes without warning and without prejudice. By the time your eyes come upon it, it is already too late. It has but one name, and few dare speak it. But hear me now when I say that I have seen the beast, and live to tell the tale.

It was this past Monday evening. I was online changing out redo log groups to change from 20M logfiles to 100M logfiles. I finished and all was well. An hour later I checked back and peeked into the alert log, and there it stood. It literally sent a shiver down my spine and made my stomach turn. I will show you what I saw now. Women and children should probably leave the room.

Dear readers, GAZE UPON YOUR DESTRUCTION:

ORA-00600: internal error code, arguments: [17281], [1001], [0x79E0FFE0], [], [], [], [], []

For those of you still reading and have not had your retinas burned out at the mere sight of its name, I shall now tell of my encounter and eventual escape from ... the horror.

Naturally my first thought was that I had somehow SRSLY messed up the redo log group changeover. I went over my steps and log output, and nothing was wrong. The ORA-00600 didn't come up until 20 minutes after I was done. So I did some more log perusing and the 00600 had been coming up earlier that day, and in fact had begun at around 9:30 PM on Sunday evening, coincidentally after some application upgrades.

Metalink note 39361.1 told me that this particular error had to deal with invalid cursors being closed. It also said it should have been fixed in Oracle 10gR1. The stack trace told me that it was JBoss doing it. Looking at timestamps, it was JBoss's IdleRemover that was generating the ORA-00600. BUT WHY??!!?!

The stack trace also had "ORA-01403: no data found" in it, but I couldn't find any mention of this error in the alert log or in the application logs. Then, as if sent by angels on a ray of light, Lewis Cunningham posted an article on the fact that Oracle doesn't report NO DATA FOUND (ORA-01403) exceptions when it should. I quickly confirmed this in both sqlplus and a quick jdbc application.

My theory now was that the JBoss application was generating an ORA-01403, but Oracle wasn't throwing the error back. Instead, the JBoss cursor was left invalid. Then when the IdleRemover awoke to clean up connections, it generated the ORA-00600 when it got to that particular connection. Perhaps not full of technical argument, but it made sense in my head.

One of the developers mocked up a similar test to run through JBoss on his workstation. He ran the code to call a stored function that would raise an ORA-01403, and then we waited for the IdleRemover to kick in. Then it happened ... NOTHING.

I was a bit crestfallen, until a minute later the developer noted that he was still using the ojdbc14.jar from 10.2.0.2, and we had upgraded our jars in production to 10.2.0.3 this past Sunday evening. Sure enough, we were able to duplicate the error 100% when using the 10.2.0.3 ojdbc14 driver.

So JBoss 4.0.5 GA with ojdbc 10.2.0.3 and Oracle RDBMS 10.2.0.2 don't jive. It doesn't help that Oracle isn't throwing ORA-01403, which I think is a bug, as does Lewis. Nor does it help that our application did not appear to notice the ORA-01403 (which it wouldn't be able to) or ORA-00600. So we're reverting back to ojdbc 10.2.0.2 for now. We'll have to find out why JBoss/Hibernate is triggering the ORA-01403, and then see if Oracle knows why the 10.2.0.3 jars are triggering the ORA-00600. Of course it could very well be the case that the ORA-00600 should be triggered, and that the 10.2.0.2 jars are mistaken for not doing so. The list of bug fixes in ojdbc 10.2.0.3 don't mention anything with regard to that, though.

UPDATE: Eddie Awad provides some more insight into Oracle's behavior with regard to NO DATA FOUND.

1 comment:

  1. fortunately for you and for me the dreaded ORA-00600 does not always mean disaster. I found interesting the question about the NO_DATA_FOUND exception, i think that is a philosophical question on what has to be the behaviour of Oracle

    ReplyDelete