21 July 2015

The Leap Second is No Laughing Matter (if you have java on an older Linux kernel)

Earlier this month we began getting frequent email warnings from our EM12c server that some agents were experiencing read time outs. Then we saw that the emagent java process was using A LOT of CPU, regularly around 500% but sometimes as high as 800% as seen from "top". Restarting the agent did nothing.

I opened an SR with Oracle Support, where I was first instructed to apply a JDBC patch and then a PSU agent patch. No change in behavior.

Courtney Llamas from the Oracle EM team reached out and suggested it might be due to the leap second, directing me to these MOS docs:

  • Enterprise Manager Management Agent or OMS CPU Use Is Excessive near Leap Second Additions on Linux (Doc ID 1472651.1)
  • Leap Second Hang - CPU Can Be Seen at 100% (Doc ID 1472421.1)
The workaround is to restart ntpd (or reboot the server):

# /etc/init.d/ntpd stop
# date -s "`date`" (reset the system clock)
# /etc/init.d/ntpd start


I monitored top while my system admin restarted ntpd and reset the clock. As soon as he did, java CPU usage dropped like a rock.



While I'm incredibly grateful that Courtney provided the solution in basically 5 minutes, I'm even more upset that Oracle Support had me doing everything but for the 20 days that my original SR has been open.

Of course the real joke is on me, since I first reported the error on July 1 and we all joked on twitter how it was probably due the leap second. The fault also lies with me since I failed to notice that our kernel version (2.6.32-220) was still vulnerable to this (fixed in 2.6.32-279). See Maris Elsins' great write-up (which I apparently skimmed too lightly).

No comments:

Post a Comment