28 January 2016

Sending notifications from Oracle Enterprise Manager to VictorOps

We use VictorOps for our paging/notification system, and we're pretty happy with it so far. On the DBA team, we've just been using a simple email gateway to send notifications from Oracle Enterprise Manager (EM) to VictorOps. Even then, we can only send the initial notification and not really send an automated recovery without more hacking than its worth. Not a big deal, but would be nice to have some more functionality.

So yesterday I decided I'd just sort it all out since VictorOps has a nice REST API and Enterprise Manager has a nice OS script notification method framework. The initial result can be found on my github: entmgr_to_victorops.sh.

It doesn't do anything fancy, but will handle the messages sent by your notification rules and pass them on to VictorOps. It keys on the incident ID to track which events it is sending follow-up (ie RECOVERY) messages for.

Please do let me know if you have any bugs, requests, suggestions for it.

Many thanks to Sentry Data Systems (my employer) for allowing me to share this code. It isn't mind-blowing stuff but should save you a few hours of banging your head against a wall.

10 November 2015

Just XFS Things

$ uptime

16:36:42 up 4 days, 12:28, 6 users, load average: 1029.20, 995.42, 865.77

See https://www.centos.org/forums/viewtopic.php?f=47&t=52412 and https://access.redhat.com/solutions/532663 (requires RedHat subscription) for details. And defrag your XFS volumes.

21 July 2015

The Leap Second is No Laughing Matter (if you have java on an older Linux kernel)

Earlier this month we began getting frequent email warnings from our EM12c server that some agents were experiencing read time outs. Then we saw that the emagent java process was using A LOT of CPU, regularly around 500% but sometimes as high as 800% as seen from "top". Restarting the agent did nothing.

I opened an SR with Oracle Support, where I was first instructed to apply a JDBC patch and then a PSU agent patch. No change in behavior.

Courtney Llamas from the Oracle EM team reached out and suggested it might be due to the leap second, directing me to these MOS docs:

  • Enterprise Manager Management Agent or OMS CPU Use Is Excessive near Leap Second Additions on Linux (Doc ID 1472651.1)
  • Leap Second Hang - CPU Can Be Seen at 100% (Doc ID 1472421.1)
The workaround is to restart ntpd (or reboot the server):

# /etc/init.d/ntpd stop
# date -s "`date`" (reset the system clock)
# /etc/init.d/ntpd start

I monitored top while my system admin restarted ntpd and reset the clock. As soon as he did, java CPU usage dropped like a rock.

While I'm incredibly grateful that Courtney provided the solution in basically 5 minutes, I'm even more upset that Oracle Support had me doing everything but for the 20 days that my original SR has been open.

Of course the real joke is on me, since I first reported the error on July 1 and we all joked on twitter how it was probably due the leap second. The fault also lies with me since I failed to notice that our kernel version (2.6.32-220) was still vulnerable to this (fixed in 2.6.32-279). See Maris Elsins' great write-up (which I apparently skimmed too lightly).

Upgrade to Oracle 12c, Get the Huge Trace Files for Free!

Last week we began testing a copy of our production database on Oracle 12c ( This past weekend we were alerted that the disk holding our ADR diagnostic directory was near full. We noticed some pretty big (and recent) trace files there. Then it happened twice more. This last time filled the disk before we could get to it (thankfully only dev, and during the evening), meaning it filled up fast. The largest file was over 18GB and it only took 6 hours to get that big.

I saved the three biggest trace files to a large NFS mount and did a trace file purge just to get our dev database back up. When I looked at those files, I saw they were all sqlplus sessions running the same DELETE statement after reports like this:

----- Cursor Obsoletion Dump sql_id=7q0kj0sp5k779 -----
Parent cursor obsoleted 1 time(s). maxchild=1024 basephd=0xae2a2ca308 phd=0xae2a2ca308
----- Dump Cursor sql_id=7q0kj0sp5k779 xsc=0x7ffbf191fd50 cur=0x7ffbf2702670 -----

At first I thought it was a user session setting some oddball trace event. However our team found DocID 1955319.1 (Huge Trace Files Created Containing "----- Cursor Obsoletion Dump sql_id=%s -----")

Long story short, it's an unpublished bug introduced in with the cursor obsoletion diagnostic dump "ehancement". I don't think they intended this though, even though they did say "Huge". The workaround is to disable it completely, via this hidden parameter:

alter system set "_kks_obsolete_dump_threshold" = 0;

There is this note at the end though:
Note: The underlying cursor sharing problem should always be highlighted and investigated to ensure that the reason for the non-sharing is known and fully understood.
Which is definitely good advice.

06 January 2015

Returning Error Codes from sqlplus to Shell Scripts

When I have to run SQL scripts, I prefer to do it via executable ksh scripts (I use ksh because it's usually an Oracle pre-requisite so I can rely on it being installed). One thing that I've just recently started adding into my scripts is exiting sqlplus on error and returning the ORA error code. This is done via the WHENEVER SQLERROR feature of sqlplus:

sqlplus / as sysdba <<EOF

        whenever sqlerror exit sql.sqlcode
        alter tablespace foo
                rename to foo_old;

        create tablespace foo
                datafile size 100m;

        alter table foo move tablespace $TABLESPACE_NAME nocompress;

if [ $RETURN_CODE -ne 0 ]; then
        echo "*** Tablespace renaming error code $RETURN_CODE. ***"
        exit $RETURN_CODE;

In this example we rename a tablespace and then create a new tablespace in its place and move a table there. It seems rather contrived but this is actually what I'm doing to move tables to an uncompressed and unencrypted tablespace to test storage vendor deduplication claims. But I digress ...

To test this, I used a tablespace that doesn't exist. This results in an ORA-00959 error when the tablespace does not exist. However, the return code I get was 191. I spent a good portion of the day testing and retesting with all sorts of debug output until I stumbled across this comment. Since Linux (and I'm told Unix) return codes only go up to 255, the ORA sqlcode value of 959 is wrapped until there is a remainder. The value of 191 is the difference, achieved simply by modulo operation:

959 % 256 = 191

 And suddenly the skies were cleared.

As always, hope this helps!

12 November 2014

Making Copies of Copies with Oracle RMAN

I recently had need to make a copy of an image copy in Oracle rman. Since it wasn't immediately obvious to me, I thought it was worth sharing once I had it sorted out. I was familiar with making a backup of a backup, but had never thought about making a copy of a copy.

First you need to create an image copy of your database or tablespace. For the sake of example, I'll make a copy of the FOO tablespace. The key is to assign a tag to it that you can use for later reference. I'll use the tag "DTSCOPYTEST":

backup as copy 
    tablespace foo 
    format '+DG1';

So I have my image copy in the DG1 tablespace. Now say we want to make copy of that for some testing purpose and put it in a different diskgroup. For that, we need the "BACKUP AS COPY COPY" command, and we'll want to specify the copy we just took by using the tag that was used:

backup as copy
    copy of tablespace foo
    from tag 'DTSCOPYTEST'
    tag 'DTSCOPYTEST2'
    format '+DG2';

As you'd guess, RMAN makes a copy of the first copy, writing it to the specified format location.

As always, hope this helps!

04 November 2014

The Importance of Backups (A Cautionary Block Recovery Tale)

Just wanted to share a quick story with everyone. As I was in the airport waiting to fly to Oracle OpenWorld this year, I noticed a flurry of emails indicating that part of our storage infrastructure for our standby production database had failed. Long story short, my co-workers did a brilliant job of stabilizing things and keeping recovery working. However, we ended up with more than a few block corruptions.

Using the RMAN command "validate database", we could then see the list of corrupt blocks in the v$database_block_corruption view. All that was needed was to run "recover corruption list" in RMAN, which will dig into datafile copies and backups to do what it can to repair or replace the corrupt blocks and then recover the datafiles. Of course, nothing is ever that easy for us!

The storage we were writing our weekly backups to had been having problems and the latest weekly had failed. We ended up having to back 2 weeks into backups to get the datafile blocks and archivelogs to eventually complete the corruption recovery. I also immediately moved our backups to more reliable storage as well so that we're never in the situation of wondering whether or not we have the backups we need.

So, triple-check your backup plan, validate your backups and TEST RECOVERY SCENARIOS! You can't say your backups are valid until you use them to perform a restore/recovery, and you don't want to find out the hard way that you forgot something.