Friday, December 28, 2007

Understanding Desktop Heap from an Apps DBA perspective

One of the issues that an Apps DBA might encounter sooner or later on WINDOWS 2000/2003 platform is "The application failed to initialize properly (0xc0000142). Click on OK to terminate the application." error in the event log of the server. This would also contain the name of executable that could not be initialized. This could be any of the oracle applications executables like OAMKSVC.exe, Apache.exe, ifweb60.exe and so on..If you check the memory availability using the "Task Manager" you will see plenty of memory available.

The Problem:

The above scenario implies that you are running into "desktop heap" issue. Let me try to explain in simple words what desktop heap means, the underlying important aspects from troubleshooting perspective. Desktop heap simply refers to a memory area within a session, which is used to store certain user interface objects like windows. Every process requires a desktop object to run under, every desktop requires a window station to run under and every window station requires a session to run under. One session can have only one interactive window station, but more than one non-interactive winstation. Also, there are three types of desktops - Default, Disconnect and winlogon. This also implies that one window station can have all three types of desktops running under it. I think that Disconnect and winlogon are applicable in a terminal services environment.

The question that would immediately arise is - how does this all relate to my "apache / forms" applications running as a service?

The (apache/forms or any other) service is considered as a "non-interactive" window station and implicitly uses a "Default" desktop. Windows operating system allows you to configure "Default" desktop heap value for both Interactive and non-interactive window stations separately.

Ascertain the issue:

One of the tools available to monitor and ascertain the presence of desktop heap issue is dheapmon. Run this tool to determine the status of desktop heap.

Settings and Corrective Action:

You can configure the value of "default" desktop heap for "interactive" and "non-interactive" window station using the following registry entry: HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows.

The value would look something like %SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,3072,512 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16

The second and third value (KBs) in the "SharedSection" in the above value represent the desktop heap for interactive and non-interactive window stations. This can be configured. However, please note that each of the sessions would allocate this amount of desktop heap for every "default" desktop running under the session.

Note that if your apache service is running with the "LocalSystem" account, without the "Allow service to interact with desktop" option, then the amount of desktop heap available is 512 (per the above example). This value is allocated for each service as each of the service is considered as non-interactive desktop. However, when you have large number of concurrent users using the forms or apache, the chances are 512 KB is not enough. In such situations, you can check the "Allow service to interact with desktop" option, there by leveraging on the interactive desktop heap value for the Window station 0 (console), which is 3072. One word of caution here, if 3072 is depleted, your apache will be "hung" until someone clicks on the "OK" button on the console for the apache to continue.

The downside of increasing the 512 value of the non-interactive window station to 3072 is that, every service will occupy this amount of memory. This applies to every service that you may have on the server like anti-virus software, backup software, volume manager softwares, system management softwares etc.


One last word, do not worry about the "disconnect and winlogon" desktop heaps, as issues are infrequent in this area. I am not aware if these are configurable or not either.

Wish You Happy Holidays and Happy New Year.

- Aravind Kamath

Thursday, December 13, 2007

ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS'

As year end is fast approaching, the volume of data processed in our ERP environment sees a sharp surge and i dont think i would be mistaken in saying that this is the time of the year when maximum data processing happens in our environment.

Problem Statement: Issue reported by users (and of course alerts) that different concurrent programs are experiencing ORA-30036.

Symptoms and observations:

- Storage devices reported a considerable surge in number of writes to the disk starting 9:00 AM 11-Dec-07
- Alert log has recorded ORA-30036
- The concurrent logs and/or output files have reported ORA-30036
- Alert continuosly firing reporting a long running custom program

The Approach:

One of the key things was to stay focussed on the issue. In these kind of situations, there are bound to be distractions where in users and developers alike would "guess" that every other issue they are experiencing is "somehow" related with ORA-30036 and that there is a "general" slowdown of the system etc etc. The following is the approach we used to narrow down on the issue:

1. Using the statspack, we narrowed down on the top 5 sqls that were doing the max amount of i/o. In fact, Nikhil, my colleague who was part of this all throught the issue contributed significantly in this area.
2. Obtain the details of the concurrent programs that have failed with the ORA-30036 error and also noted down if they had completed successfully in the subsequent runs.
3. Scanned the alert log to see when this error was first reported
4. Effort to determine which active sessions are consuming the undo space using a join on v$session and v$transaction. This we were primarily interested because we wanted to see if the failed programs would succeed if we asked the users to "retry" the failed jobs.

Eg. select a.sid, a.username, b.used_urec, b.used_ublk from v$session a, v$transaction b where a.saddr=b.ses_addr;

Multiplying the used_ublk column with the db_block_size would give us the size of undo space used by the session in bytes.

5. One of the most important and the decisive piece of information in this issue was the information derived from dba_undo_extents.

SELECT DISTINCT STATUS, SUM(BYTES)/1024/1024, COUNT(*) FROM DBA_UNDO_EXTENTS GROUP BY STATUS;

The above query helped us to see the amount of space held by "Active", "Expired" and "UnExpired" undo extents.

Active refers to the extents where the active transactions are holding space.
Expired refers to the extents that are not processing "Active Transactions" as well as out of the time frame defined by undo_retention parameter and hence are available for reuse.
UnExpired refers to the extents that are not active, but fall within the undo_retention time frame. The time is calculated from the end of transaction which is either a commit or a rollback.

Also, a couple other things that will help understand the issue are queries on dba_Free_Space and dba_data_Files to see if all of the space in undo is consumed or not, in conjunction with dba_undo_extents.

In our case, it turned out that the custom program was running for more than 24 hours and was updating close to 50 million records without committing in between. This program also showed up in the top sqls in the statspack report. It would also imply that this program would also generate a lot of undo data. Also, we had a large chunk of extents in our undo tablespace - over 26GB in "unexpired" status most of which would be attributed to the failed previous run of the custom program which was terminated minutes before the kick-off of the same custom program that was currently running. This led to a situation where the other programs could not overwrite the unexpired extents and faced the ORA-30036 error.

Though the error message, when looked in isolation in the context of failed requests, might trigger a knee-jerk reaction to increase the undo tablespace, a careful analysis might reveal that tuning / redesigning the offending program would be the right thing to do. Of course, there can be genuine situations where increasing the undo tablespace is the right thing to do, as reducing the undo_retention would result in increased number of ORA-1555 occurences. In such situations, the key would be to strike a balance between the undo_retention, undo tablespace size and the ORA-1555 errors.

- Aravind Kamath


graphic counter

Sunday, December 2, 2007

How to skip a job stuck on adworker while applying a patch

In this post, i would like to share the steps to skip a job while applying the patch. Well, we are all aware of the hidden option "8" in adctrl which will skip a job. However, consider this situation, you have started a patch and you discover that there is a SQL that is sitting on one of the adworkers doing nothing. You wont be able to progress with the patch because, the patch will not complete unless the "stuck" job on the worker completes. The option 8 will work only when the jobs have failed and the adpatch process has stopped due to the failed jobs.

The following steps will help in this situation:

1. Identify the "process" from v$session of the "stuck" job.
2. On the middle tier where the adpatch process is running, terminate the process with the process id as identified in the first step. (Killing the session on database should also work)
3. You will see that the job fails with "deferment number 1" and some other worker (or the same worker) will pick up the same job
4. Repeat step 1 and 2 again.
5. You will again see the job fail - this time with "deferment 2" and again the job would be started.
6. Repeat step 1 and 2 again.
7. This time, the adpatch will come to a halt reporting an error with the job you are after.
8. Use option 8 in adctrl this time to skip the worker and get over the patch.

- Aravind Kamath