03 June 2014

Beware April 2014 PSU and Golden Gate Integrated Capture Users

When the latest quarterly Patch Set Update (PSU) came out from Oracle, we planned to get it in place as soon as reasonable due to our need to stay current with security patches, and my need to apply what I had assumed were well-tested bug fixes for our 11.2.0.3 installations. However we were in for an unpleasant surprise.

We first applied the April 2014 PSU on our staging & development database hosts and things ran fine. After two weeks, we put it into production. The next morning we noticed that our Golden Gate extracts were abending with a message like this:

2014-05-13 01:41:44 ERROR OGG-02077 Extract encountered a read error in the asynchronous reader thread and is abending: Error code 600, error message:
ORA-00600: internal error code, arguments: [knlogcPackColList:101], [1], [], [], [],[], [], [], [], [], [], []. 

Obviously the PSU was the only change from the day before when these GoldenGate extracts ran fine. The error itself seemed to match Bug 16306373 (OGG Lightweight capture fails with ORA-600 [knlogcPackColList:101]), which affects integrated capture extracts, which was what we were trying to use.

So we had two questions to answer:

  1. Was it really the PSU, and if so do we need to rollback?
  2. Why didn't we see this in development/staging?




The answer to #2 came pretty quickly: the extracts had been stopped prior to applying the PSU in development and were never restarted. Our use of GoldenGate is for a new project that is still not quite solid in form, and so the extracts might be shut down for weeks at a time. However the DBA team was not aware of this (something that will be added to the checklist in the future) and so that part was never tested. We decided to turn the extracts on and, sure enough, we saw the problem within 30 minutes.

As far as #1, we opened an SR with Oracle Support. At first they assured us it wasn't the PSU, but within a few hours had started backing away from that confidence and by the end of the day were suggesting we roll back. However we decided to hold off on that due to the luxury of not needing those extracts to run quite yet (we had just been capturing data in production to make sure we could handle the workload). That gave us and Oracle Support some breathing room to get it fixed.

Eventually Oracle Support did confirm that it was Bug 16306373. The patch for this bug, however, conflicted with another patch we had installed. So they created a new merge patch MLR 18689166 for us. This was applied that evening to just our development database home, and immediately we saw more problems.

First, regularly-scheduled materialized view refreshes were failing with:

ORA-06508: PL/SQL: could not find program unit being called: "SYS.DBMS_SNAPSHOT" 

Then we saw DataPump exports failing with:

ORA-06508: PL/SQL: could not find program unit being called: "SYS.DBMS_LOGREP_UTIL" 

On top of that, GoldenGate gave us these errors trying to unregister the extract:

GGSCI (stagingdb) 14> unregister extract ext1ol database
ERROR: Cannot register or unregister EXTRACT EXTFOO because of the following SQL error: OCI Error ORA-01775: looping chain of synonyms (status = 1775). 

It definitely looked like something in that patch had hosed the catalog. Oracle quickly found the problem:

The problem with the invalid dictionary objects after the installation of patch 18689166 is due to the execution of script e1102000.sql that is being called from postinstall.sql 
e1102000.sql is a downgrade scripts and removes some objects from the data dictionary resulting in other dependent objects to become invalid. 

It is here that I should point out that Oracle had this merge patch in their QA for over 4 days. I could have seriously hurt myself from simultaneously rolling my eyes and face-palming as violently as I did.

The fix was to restart the instance into UPGRADE mode and run the catupgrd.sql script from $ORACLE_HOME/rdbms/admin, which rebuilds the catalog. We confirmed that this did fix the problem in DEV, and proceeded to patch the other databases, this time commenting out the call to e1102000.sql n the postinstall.sql script, per Oracle's workaround.

So, not only did the April 2014 PSU introduce a bug, but then the patch for that bug made things incredibly worse. We were fortunate in that we did not yet need the GoldenGate extracts in production. However many others will not have that luxury and will want to double-check that bug and the PSU if they're using GoldenGate and integrated capture.

No comments:

Post a Comment