21 July 2015

The Leap Second is No Laughing Matter (if you have java on an older Linux kernel)

Earlier this month we began getting frequent email warnings from our EM12c server that some agents were experiencing read time outs. Then we saw that the emagent java process was using A LOT of CPU, regularly around 500% but sometimes as high as 800% as seen from "top". Restarting the agent did nothing.

I opened an SR with Oracle Support, where I was first instructed to apply a JDBC patch and then a PSU agent patch. No change in behavior.

Courtney Llamas from the Oracle EM team reached out and suggested it might be due to the leap second, directing me to these MOS docs:

  • Enterprise Manager Management Agent or OMS CPU Use Is Excessive near Leap Second Additions on Linux (Doc ID 1472651.1)
  • Leap Second Hang - CPU Can Be Seen at 100% (Doc ID 1472421.1)
The workaround is to restart ntpd (or reboot the server):

# /etc/init.d/ntpd stop
# date -s "`date`" (reset the system clock)
# /etc/init.d/ntpd start


I monitored top while my system admin restarted ntpd and reset the clock. As soon as he did, java CPU usage dropped like a rock.



While I'm incredibly grateful that Courtney provided the solution in basically 5 minutes, I'm even more upset that Oracle Support had me doing everything but for the 20 days that my original SR has been open.

Of course the real joke is on me, since I first reported the error on July 1 and we all joked on twitter how it was probably due the leap second. The fault also lies with me since I failed to notice that our kernel version (2.6.32-220) was still vulnerable to this (fixed in 2.6.32-279). See Maris Elsins' great write-up (which I apparently skimmed too lightly).

Upgrade to Oracle 12c, Get the Huge Trace Files for Free!

Last week we began testing a copy of our production database on Oracle 12c (12.1.0.2). This past weekend we were alerted that the disk holding our ADR diagnostic directory was near full. We noticed some pretty big (and recent) trace files there. Then it happened twice more. This last time filled the disk before we could get to it (thankfully only dev, and during the evening), meaning it filled up fast. The largest file was over 18GB and it only took 6 hours to get that big.

I saved the three biggest trace files to a large NFS mount and did a trace file purge just to get our dev database back up. When I looked at those files, I saw they were all sqlplus sessions running the same DELETE statement after reports like this:


----- Cursor Obsoletion Dump sql_id=7q0kj0sp5k779 -----
Parent cursor obsoleted 1 time(s). maxchild=1024 basephd=0xae2a2ca308 phd=0xae2a2ca308
----- Dump Cursor sql_id=7q0kj0sp5k779 xsc=0x7ffbf191fd50 cur=0x7ffbf2702670 -----


At first I thought it was a user session setting some oddball trace event. However our team found DocID 1955319.1 (Huge Trace Files Created Containing "----- Cursor Obsoletion Dump sql_id=%s -----")

Long story short, it's an unpublished bug introduced in 12.1.0.2 with the cursor obsoletion diagnostic dump "ehancement". I don't think they intended this though, even though they did say "Huge". The workaround is to disable it completely, via this hidden parameter:


alter system set "_kks_obsolete_dump_threshold" = 0;


There is this note at the end though:
Note: The underlying cursor sharing problem should always be highlighted and investigated to ensure that the reason for the non-sharing is known and fully understood.
Which is definitely good advice.