As year end is fast approaching, the volume of data processed in our ERP environment sees a sharp surge and i dont think i would be mistaken in saying that this is the time of the year when maximum data processing happens in our environment.
Problem Statement: Issue reported by users (and of course alerts) that different concurrent programs are experiencing ORA-30036.
Symptoms and observations:
- Storage devices reported a considerable surge in number of writes to the disk starting 9:00 AM 11-Dec-07
- Alert log has recorded ORA-30036
- The concurrent logs and/or output files have reported ORA-30036
- Alert continuosly firing reporting a long running custom program
The Approach:
One of the key things was to stay focussed on the issue. In these kind of situations, there are bound to be distractions where in users and developers alike would "guess" that every other issue they are experiencing is "somehow" related with ORA-30036 and that there is a "general" slowdown of the system etc etc. The following is the approach we used to narrow down on the issue:
1. Using the statspack, we narrowed down on the top 5 sqls that were doing the max amount of i/o. In fact, Nikhil, my colleague who was part of this all throught the issue contributed significantly in this area.
2. Obtain the details of the concurrent programs that have failed with the ORA-30036 error and also noted down if they had completed successfully in the subsequent runs.
3. Scanned the alert log to see when this error was first reported
4. Effort to determine which active sessions are consuming the undo space using a join on v$session and v$transaction. This we were primarily interested because we wanted to see if the failed programs would succeed if we asked the users to "retry" the failed jobs.
Eg. select a.sid, a.username, b.used_urec, b.used_ublk from v$session a, v$transaction b where a.saddr=b.ses_addr;
Multiplying the used_ublk column with the db_block_size would give us the size of undo space used by the session in bytes.
5. One of the most important and the decisive piece of information in this issue was the information derived from dba_undo_extents.
SELECT DISTINCT STATUS, SUM(BYTES)/1024/1024, COUNT(*) FROM DBA_UNDO_EXTENTS GROUP BY STATUS;
The above query helped us to see the amount of space held by "Active", "Expired" and "UnExpired" undo extents.
Active refers to the extents where the active transactions are holding space.
Expired refers to the extents that are not processing "Active Transactions" as well as out of the time frame defined by undo_retention parameter and hence are available for reuse.
UnExpired refers to the extents that are not active, but fall within the undo_retention time frame. The time is calculated from the end of transaction which is either a commit or a rollback.
Also, a couple other things that will help understand the issue are queries on dba_Free_Space and dba_data_Files to see if all of the space in undo is consumed or not, in conjunction with dba_undo_extents.
In our case, it turned out that the custom program was running for more than 24 hours and was updating close to 50 million records without committing in between. This program also showed up in the top sqls in the statspack report. It would also imply that this program would also generate a lot of undo data. Also, we had a large chunk of extents in our undo tablespace - over 26GB in "unexpired" status most of which would be attributed to the failed previous run of the custom program which was terminated minutes before the kick-off of the same custom program that was currently running. This led to a situation where the other programs could not overwrite the unexpired extents and faced the ORA-30036 error.
Though the error message, when looked in isolation in the context of failed requests, might trigger a knee-jerk reaction to increase the undo tablespace, a careful analysis might reveal that tuning / redesigning the offending program would be the right thing to do. Of course, there can be genuine situations where increasing the undo tablespace is the right thing to do, as reducing the undo_retention would result in increased number of ORA-1555 occurences. In such situations, the key would be to strike a balance between the undo_retention, undo tablespace size and the ORA-1555 errors.
- Aravind Kamath
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment