26 August 2009

Turn Off db_cache_advice To Avoid Latch Contention Bugs

Originally posted on The Pythian Group blog.

A couple of weeks ago, we noticed some timeouts in some of our standard Oracle RDBMS health check scripts on a new instance. I had just migrated this instance to bigger, better, badder hardware and so it had been given more SGA to use, namely a bigger buffer cache. The software version was still Oracle 10.2.0.2, as we wanted to introduce as few variables as possible (we were already moving to a new platform with an endian change).

At first the timeouts were infrequent, but over the course of a week started to grow in frequencey until the point where none of the checks were finishing in the allowed timeframe. We ran an AWR report, and tucked far down in the “Latch Activity” section, a colleague noticed this:
Pct    Avg   Wait                 Pct
                                    Get    Get   Slps   Time       NoWait NoWait
Latch Name                     Requests   Miss  /Miss    (s)     Requests   Miss
------------------------ -------------- ------ ------ ------ ------------ ------
...
simulator lru latch          10,032,617    3.3    0.7  44950      336,837    0.3
...
 Latch Activity                             DB/Inst: FOO/foo  Snaps: 156-157
-> "Get Requests", "Pct Get Miss" and "Avg Slps/Miss" are statistics for
   willing-to-wait latch get requests
-> "NoWait Requests", "Pct NoWait Miss" are for no-wait latch get requests
-> "Pct Misses" for both should be very close to 0.0

                                           Pct    Avg   Wait                 Pct
                                    Get    Get   Slps   Time       NoWait NoWait
Latch Name                     Requests   Miss  /Miss    (s)     Requests   Miss
------------------------ -------------- ------ ------ ------ ------------ ------
transaction branch alloc        112,412    0.0    0.0      0            0    N/A
undo global data                466,321    0.0    0.0      0            0    N/A
user lock                         7,440    0.8    0.4      1            0    N/A
          -------------------------------------------------------------
The “simulator lru latch” event brought us to MetaLink note 5918642.8 and bug 5918642. Affecting 10g and 11g prior to 10.2.0.4 and 11.1.0.7, respectively. The bug is with the database buffer cache advisor, controlled by the parameter db_cache_advice, which defaults to ON (depending on statistics_level). The note simply states:
High simulator lru latch contention can occur when db_cache_advice is set to ON if there is a large buffer cache.
We simply set db_cache_advice to OFF (thankfully it is a dynamic parameter), and pretty quickly our checks were running just fine.

My suggestion is to simply turn this off unless you are actively using the cache advisor to tune an instance. Once you are done tuning, and are no longer using the advisor, turn it off.

NOTE:  As Mladen Gogola pointed out in the comments, turning this off will cause problems if you are using automatic memory management (i.e. sga_target > 0).   Re-pasting his post here:
The problem with that advice is that it will prevent automatic memory management from resizing the buffer cache and the instance will end up with a huge, mostly empty, shared pool and default buffer cache. Automatic memory management is biased toward shared pool even with the cache dvice turned on, without it, buffer cache will be reduced to the minimum size, usually only 64MB. If you disable cache advice, I would also recommend disabling the automatic memory management and configuring SGA manually.

19 August 2009

Setting up Network ACLs in Oracle 11g… For Dummies

Originally posted on The Pythian Group blog.


Having recently performed a test upgrade for a client from Oracle RDBMS 10g to 11g, I can tell you that one of the big changes that will likely require action on your part as DBA is the new fine-grained access control for the packages UTL_SMTP, UTL_TCP, UTL_MAIL, UTL_HTTP and UTL_INADDR. Part of the Oracle 11g pre-upgrade tool will notify you of users that will require new privileges.

Of course, Oracle’s post-upgrade network ACL setup documentation is much more confusing than it needs to be, at least for simple minds like me. A colleague stepped forward with a simple set of commands for a basic setup that even the tired and stressed can understand.

I’ll share that here, with some basic explanation:

BEGIN
-- Create the new ACL, naming it "netacl.xml", with a description.
-- Also, provide one starter privilege, granting user FOO
-- the privilege to connect.
DBMS_NETWORK_ACL_ADMIN.CREATE_ACL('netacl.xml',
'Allow usage to the UTL network packages', 'FOO', TRUE, 'connect');


-- Now grant privilege to resolve DNS names for FOO,
-- and then grant connect and resolve to user BAR
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'FOO', TRUE, 'resolve');
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'BAR', TRUE, 'connect');
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'BAR', TRUE, 'resolve');


-- Specify which hosts this ACL applies to,
-- for simplicity, we're saying all (*)
-- You might want to specify certain hosts to lock this down.
DBMS_NETWORK_ACL_ADMIN.ASSIGN_ACL('netacl.xml','*');
END;
/


As you can see, this example will let the FOO and BAR database users connect and resolve to any host. The ASSIGN_ACL section in the full package documentation (see link below) details how this can be used to lock down a user’s ability to make outside connections.

Of course, nothing beats reading the Oracle 11g DBMS_NETWORK_ACL_ADMIN documentation, where you can see some examples of stricter ACL setups.

Sending Timezone-Aware Email with UTL_SMTP

Originally posted at The Pythian Group blog.

I’m back again with another in what I hope will be a long line of “Quick Tips for Newbies” series.

At The Pythian Group, we have employees all over the globe, from our headquarters in Ottawa to regional offices in Boston, Prague, India and Sydney, and a few scattered remote workers in Seattle, Paris, Kiev, Brazil, South Africa and Wisconsin, among other places. In other words, we are spread across multiple timezones, and since it wasn’t too long ago that everyone was in Ottawa, this is something that still presents little quirks.

One such quirk involved email generated by one of our internal Oracle instances—via a stored procedure that used UTL_SMTP to send the messages—did not have timezone information in the “Date” email header. As a result, they would be stamped with the hour in Eastern timezone (Ottawa time), but the mail clients would think that hour was local. Depending on where you are relative to Ottawa, this could be many hours in the past or future. Of course, this wouldn’t be noticed if you were in Ottawa or even Boston, both in Eastern. For the rest, it was at the very least, an annoyance—but one that is easily fixed.

ORA-16069? You May Need A New Standby Controlfile

Originally posted on The Pythian Group blog.

On a recent Monday, I had to perform an emergency Oracle standby switchover for a client whose primary instance host had mysteriously rebooted itself over the previous day. Confidence in that host was, understandably, shaken.

The Oracle Data Guard configuration is a 3-instance setup using Data Guard Broker: one primary, we’ll call it OraA, feeding two standby instances, OraB and OraC. In this particular configuration, we perform switchovers between OraA and OraC. Caught in the middle is OraB, which is on a 60-minute standby delay.

After this particular switchover, OraB started complaining with this message in the alert log:


ORA-16069: Archive Log standby database activation identifier mismatch
We had seen this occasionally in prior switchovers, and the problem would fix itself once the standby delay passed and the OraB standby would process the log notifying it of the switchover. This time, however, recovery was stopped and more than enough time had elapsed. OraA and OraC were performing perfectly fine.

Much of the reference searching I saw suggested that the standby instance would have to be completely rebuilt. Not an appetizing option. A search of metalink turned up Bug 4048687, which seemed to demonstrate a similar problem, although on a different OS/Platform. That solution was to recreate the standby controlfile. Trust me, it sounds more drastic than it is!

Here’s how to do it in just 6 easy steps!
  1. Shutdown the misbehaving standby.
  2. Copy one of the current standby controlfiles for safekeeping (just in case).
  3. On the primary instance, create a new standby controlfile: alter database create standby controlfile as '/tmp/stdby.ctl';
  4. Transfer that new standby controlfile to the standby host.
  5. Copy the new controlfile to the controlfile location(s) used by the instance (you have more than one, right?).
  6. STARTUP MOUNT the standby instance. If you use the Data Guard Broker, it should automatically begin recovery for you; otherwise restart managed recovery with alter database recover managed standby database disconnect;
Voila. Standby recovery should resume nicely, assuming logs are there to apply.