29 January 2013

Learning to Love the ADR - Part 2: Diagnostic Files and ADRCI

This post is the second in a short series about Oracle 11g ADR. These posts are based on the presentations I gave at the NoCOUG Summer Conference in August 2011 and at the RMOUG Training Days in February 2012.

Wherever the DIAGNOSTIC_DEST is (remember that the default is the $ORACLE_BASE value), Oracle will create a "diag" directory underneath it. Under this you'll find directories based on instance type, for example "rdbms" or "asm" or "tnslsnr". Dig deeper to find directories per database/instance and then the alert, trace, incident, etc. files. What is new with the ADR is the use of XML files for the alert log, as "log.xml" in the "alert" directory. The human-readable alert_XXX.log that you're used to is now in the "trace" directory. One important distinction is that the XML files are maintained by Oracle, archived once they reach 10Mb in size and purged based on policy settings that you can define. The trace files, just as with previous versions, require you to set up some sort of regular rotation/archiving/purging to prevent them from growing out of control. There are other advantages to the XML files, as we'll see in future posts.

SO. MANY. FILES.

Another not-so-fun surprise for those upgrading to Oracle 11g is that the new version generates a LOT more trace data than before. If you monitor those directories for size, your thresholds are likely to be blown away. Some of this was tamed in 11gR2, thankfully. One way to help cut back on some of trace (trc/trm suffixes) file generation is to set the hidden parameter _disable_health_check to TRUE. As with any hidden/underscore parameter, do NOT set this without approval from Oracle support (at least not in production). Another way which is easier, is to run regular PURGE jobs via adrci, which we will get to in another post.

Problems and Incidents

Oracle and the ADR nicely track problems and incidents. It is important to note that they are not one and the same. Problems are critical errors in the database (eg ORA-00600, 07445, 04031, etc). An incident is a single occurrence of a problem. When an incident occurs, Oracle creates incident trace files in addition to the normal alert log message and error trace file. You'll see this noted after the "Incident details in:" line in the alert log. For example:

Incident details in: /home/oracle/app/oracle/diag/rdbms/orcl/orcl/incident/incdir_12201/orcl_ora_18457_i12201.trc


However, there is a smarter way to look for incidents and alerts than scraping the human-readable alert log like some kind of animal. We'll use the ADRCI.



ADRCI

ADRCI is the command-line tool to navigate and manage the ADR. The name itself stands for ADR Command Interpreter. When you launch ADRCI, you have to point it to a specific home directory to work under. You first get the list of homes with the "show homes" command, and then set the home you wish to work on. E.g.:

[oracle@localhost ~]$ adrci

ADRCI: Release 11.2.0.2.0 - Production on Fri Dec 28 14:32:41 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/home/oracle/app/oracle"
adrci> show homes
ADR Homes:
diag/rdbms/orcl/orcl
diag/tnslsnr/localhost/listener
adrci> set home diag/rdbms/orcl/orcl

Now I can view incident and alert log information pertaining to my database home. If I wanted to troubleshoot problems with my listener, I would set the home to that directory instead.

ADRCI HELP

Everything you want to know about ADRCI commands can be found by using the help facility with in ADRCI, it is really very well documented. I'm not going to explain everything you can do with ADRCI here, but do want to cover a few of the more useful commands. One listing I would take the time to digest is:

adrci> help show alert

You'll be glad you did. And that takes us into ...

Viewing Alert Log with ADRCI

One of the most useful features of ADRCI is being able to view the alert log. You can even use it to view a running tail of the alert log (even on Windows!). Everything starts with the "show alert" command. By default this will open the entire alert log in an external text editor (like Notepad or vim, it is based on the editor option), unless the "-term" option is specified. To get our running tail, we would use this command:

adrci> show alert -tail -f

Note that the -tail option implies the -term.

Now this by itself isn't anything new. Where the real power comes in is in the ability for adrci to show alert log entries based on predicate searches. Again, the "help show alert" command will show you all of the options, but one quick example would be this:

adrci> show alert -p "message_text like '%ORA-%'" -term;

This will return any alert log entries that have a message body containing an ORA- error. This is much smarter than running grep on the text alert log since it includes message timestamps and other contextual information than just getting a line of text. One thing to note is that the search is case-sensitive, and so far I haven't seen any way around that (no upper or lower function in adrci). You can also do patter-matching on other fields, such as host_id, user_id, originating_timestamp, etc.

For example, to get a list of all alert log messages in the last hour:

adrci> show alert -p "originating_timestamp >= systimestamp-1/24" -term;

Or, to get all ORA-600 messages in the last 30 days:

adrci> show alert -p "originating_timestamp >= systimestamp-30 and message_text like '%ORA-600%'" -term;

It should be obvious by now that this is perfect for alert log monitoring scripts. ADRCI can execute scripts and spool output to log files as well, just as SQL*Plus does. This leads us to ...

What about those incidents?

Ah yes. This all started because we wanted a quick way to find incidents reported in our database. Well, here it is:

adrci> show incident

That's it. You'll get a list of the incidents recorded in the ADRCI (based on the available XML data), including the incident ID, problem key and time of the incident. Later in this series, I'll set up an incident to demonstrate how a typical on-call DBA can use ADRCI for quick troubleshooting and information gathering using the incident ID.

ADRCI Scripts

I'm the master of the segue. Anyway, as I said, ADRCI can call scripts just like SQL*Plus. And, like SQL*Plus, it uses the "@" sign to execute them interactively:

adrci> @/home/dts/scripts/test.adrci

You can also tell adrci to execute a script from the command line, like this:

$ adrci script=/home/dts/scripts/test.adrci

Here's an example script that you might use to monitor the alert log for ORA errors in the past 5 minutes and write them to a log file:

# ADRCI script to find alert log errors
SPOOL /home/dts/logs/alert_log_errors.log
ECHO "ALERT LOG ERRORS:"; 
SET HOMEPATH diag/rdbms/orcl/orcl; 
SHOW ALERT -TERM -P "MESSAGE_TEXT LIKE '%ORA-%' AND ORIGINATING_TIMESTAMP >= SYSTIMESTAMP-5/1440";
SPOOL OFF

The semi-colon line terminator is optional, the script runs the same with or without them. As you can see, the pound character "#" is used to start comments, and the "echo" command just outputs text. Output from one example run on my sandbox is:

ALERT LOG ERRORS:

ADR Home = /home/oracle/app/oracle/diag/rdbms/orcl/orcl:
*************************************************************************
2012-07-12 11:40:06.309000 -05:00
ORA-1543 signalled during: create temporary tablespace migtemp
tempfile size 1G...

For my next article, however, we'll do a quick look at how the ADR purges old log files, and when it might not do so.

30 January 2013 ADDENDUM: X$DBGALERTEXT


I was working on the rest of this series when I realized I forgot to mention X$DBALERTEXT. You (or your monitoring scripts) can query X$DBALERTEXT to get the same information as ADRCI does. This is an external table that references all of the log*.xml files, and so is dependent on those files being where they should be (just like ADRCI). According to MOS Doc ID 961682.1, this data is "effectively indexed by date," so date-based queries are "efficient and performant."

There is an inst_id column in this table, but from my tests it is NOT RAC-aware. The only data I could see was from the local node.

As of Oracle 11.2, there is a related view named V$DIAG_ALERT_EXT, which is for all intents and purposes, the same as X$DBGALERTEXT. However the new view includes alert information from ALL homes under the ADR (ASM, listeners, clients, etc.).

No comments:

Post a Comment