Friday, December 28, 2007

Understanding Desktop Heap from an Apps DBA perspective

One of the issues that an Apps DBA might encounter sooner or later on WINDOWS 2000/2003 platform is "The application failed to initialize properly (0xc0000142). Click on OK to terminate the application." error in the event log of the server. This would also contain the name of executable that could not be initialized. This could be any of the oracle applications executables like OAMKSVC.exe, Apache.exe, ifweb60.exe and so on..If you check the memory availability using the "Task Manager" you will see plenty of memory available.

The Problem:

The above scenario implies that you are running into "desktop heap" issue. Let me try to explain in simple words what desktop heap means, the underlying important aspects from troubleshooting perspective. Desktop heap simply refers to a memory area within a session, which is used to store certain user interface objects like windows. Every process requires a desktop object to run under, every desktop requires a window station to run under and every window station requires a session to run under. One session can have only one interactive window station, but more than one non-interactive winstation. Also, there are three types of desktops - Default, Disconnect and winlogon. This also implies that one window station can have all three types of desktops running under it. I think that Disconnect and winlogon are applicable in a terminal services environment.

The question that would immediately arise is - how does this all relate to my "apache / forms" applications running as a service?

The (apache/forms or any other) service is considered as a "non-interactive" window station and implicitly uses a "Default" desktop. Windows operating system allows you to configure "Default" desktop heap value for both Interactive and non-interactive window stations separately.

Ascertain the issue:

One of the tools available to monitor and ascertain the presence of desktop heap issue is dheapmon. Run this tool to determine the status of desktop heap.

Settings and Corrective Action:

You can configure the value of "default" desktop heap for "interactive" and "non-interactive" window station using the following registry entry: HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows.

The value would look something like %SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,3072,512 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16

The second and third value (KBs) in the "SharedSection" in the above value represent the desktop heap for interactive and non-interactive window stations. This can be configured. However, please note that each of the sessions would allocate this amount of desktop heap for every "default" desktop running under the session.

Note that if your apache service is running with the "LocalSystem" account, without the "Allow service to interact with desktop" option, then the amount of desktop heap available is 512 (per the above example). This value is allocated for each service as each of the service is considered as non-interactive desktop. However, when you have large number of concurrent users using the forms or apache, the chances are 512 KB is not enough. In such situations, you can check the "Allow service to interact with desktop" option, there by leveraging on the interactive desktop heap value for the Window station 0 (console), which is 3072. One word of caution here, if 3072 is depleted, your apache will be "hung" until someone clicks on the "OK" button on the console for the apache to continue.

The downside of increasing the 512 value of the non-interactive window station to 3072 is that, every service will occupy this amount of memory. This applies to every service that you may have on the server like anti-virus software, backup software, volume manager softwares, system management softwares etc.


One last word, do not worry about the "disconnect and winlogon" desktop heaps, as issues are infrequent in this area. I am not aware if these are configurable or not either.

Wish You Happy Holidays and Happy New Year.

- Aravind Kamath

Thursday, December 13, 2007

ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS'

As year end is fast approaching, the volume of data processed in our ERP environment sees a sharp surge and i dont think i would be mistaken in saying that this is the time of the year when maximum data processing happens in our environment.

Problem Statement: Issue reported by users (and of course alerts) that different concurrent programs are experiencing ORA-30036.

Symptoms and observations:

- Storage devices reported a considerable surge in number of writes to the disk starting 9:00 AM 11-Dec-07
- Alert log has recorded ORA-30036
- The concurrent logs and/or output files have reported ORA-30036
- Alert continuosly firing reporting a long running custom program

The Approach:

One of the key things was to stay focussed on the issue. In these kind of situations, there are bound to be distractions where in users and developers alike would "guess" that every other issue they are experiencing is "somehow" related with ORA-30036 and that there is a "general" slowdown of the system etc etc. The following is the approach we used to narrow down on the issue:

1. Using the statspack, we narrowed down on the top 5 sqls that were doing the max amount of i/o. In fact, Nikhil, my colleague who was part of this all throught the issue contributed significantly in this area.
2. Obtain the details of the concurrent programs that have failed with the ORA-30036 error and also noted down if they had completed successfully in the subsequent runs.
3. Scanned the alert log to see when this error was first reported
4. Effort to determine which active sessions are consuming the undo space using a join on v$session and v$transaction. This we were primarily interested because we wanted to see if the failed programs would succeed if we asked the users to "retry" the failed jobs.

Eg. select a.sid, a.username, b.used_urec, b.used_ublk from v$session a, v$transaction b where a.saddr=b.ses_addr;

Multiplying the used_ublk column with the db_block_size would give us the size of undo space used by the session in bytes.

5. One of the most important and the decisive piece of information in this issue was the information derived from dba_undo_extents.

SELECT DISTINCT STATUS, SUM(BYTES)/1024/1024, COUNT(*) FROM DBA_UNDO_EXTENTS GROUP BY STATUS;

The above query helped us to see the amount of space held by "Active", "Expired" and "UnExpired" undo extents.

Active refers to the extents where the active transactions are holding space.
Expired refers to the extents that are not processing "Active Transactions" as well as out of the time frame defined by undo_retention parameter and hence are available for reuse.
UnExpired refers to the extents that are not active, but fall within the undo_retention time frame. The time is calculated from the end of transaction which is either a commit or a rollback.

Also, a couple other things that will help understand the issue are queries on dba_Free_Space and dba_data_Files to see if all of the space in undo is consumed or not, in conjunction with dba_undo_extents.

In our case, it turned out that the custom program was running for more than 24 hours and was updating close to 50 million records without committing in between. This program also showed up in the top sqls in the statspack report. It would also imply that this program would also generate a lot of undo data. Also, we had a large chunk of extents in our undo tablespace - over 26GB in "unexpired" status most of which would be attributed to the failed previous run of the custom program which was terminated minutes before the kick-off of the same custom program that was currently running. This led to a situation where the other programs could not overwrite the unexpired extents and faced the ORA-30036 error.

Though the error message, when looked in isolation in the context of failed requests, might trigger a knee-jerk reaction to increase the undo tablespace, a careful analysis might reveal that tuning / redesigning the offending program would be the right thing to do. Of course, there can be genuine situations where increasing the undo tablespace is the right thing to do, as reducing the undo_retention would result in increased number of ORA-1555 occurences. In such situations, the key would be to strike a balance between the undo_retention, undo tablespace size and the ORA-1555 errors.

- Aravind Kamath


graphic counter

Sunday, December 2, 2007

How to skip a job stuck on adworker while applying a patch

In this post, i would like to share the steps to skip a job while applying the patch. Well, we are all aware of the hidden option "8" in adctrl which will skip a job. However, consider this situation, you have started a patch and you discover that there is a SQL that is sitting on one of the adworkers doing nothing. You wont be able to progress with the patch because, the patch will not complete unless the "stuck" job on the worker completes. The option 8 will work only when the jobs have failed and the adpatch process has stopped due to the failed jobs.

The following steps will help in this situation:

1. Identify the "process" from v$session of the "stuck" job.
2. On the middle tier where the adpatch process is running, terminate the process with the process id as identified in the first step. (Killing the session on database should also work)
3. You will see that the job fails with "deferment number 1" and some other worker (or the same worker) will pick up the same job
4. Repeat step 1 and 2 again.
5. You will again see the job fail - this time with "deferment 2" and again the job would be started.
6. Repeat step 1 and 2 again.
7. This time, the adpatch will come to a halt reporting an error with the job you are after.
8. Use option 8 in adctrl this time to skip the worker and get over the patch.

- Aravind Kamath

Thursday, November 29, 2007

Setting up Distributed Caching

As the holiday season sets in, my blogging activity seems to have slowed down :) .

Going back to tech stuff, One of the newer caching functionality implemented through the iAS infrastructure in ERP applications is "Distributed Caching". Though the onus of leveraging on this caching framework for Self-Service Applications is on the application developer/system administrator, it is essential to have the framework setup accurately to enable "Distributed Caching" in implementations wishing to use this feature. However, the upside of having this setup accurately is fewer Jserv errors. Whether we like it or not, the functionality might have crept in "inadvertantly" in the system via the patches applied to the system eg. for a module specific upgrade, to overcome a specific issue or any other reason.

In a nutshell, the following actions must be performed to setup the distributed caching successfully.

1. Apply the appropriate patches:

At the time we implemented this in our environment, we had the following patches:

- 5455628
- 5468797
- 5639951

These patches might have gotten superceded and the recommendation is to stay current on the patches.

2. Entries of interest in jserv.properties

"wrapper.bin.parameters=-DLONG_RUNNING_JVM=true" - This should be set to true.
"wrapper.bin.parameters=-DCACHEMODE=LOCAL" - This should be eliminated completely (typically the autoconfig template would have taken care of this) from the jserv.properties or atleast commented out.

3. Make sure Workflow Components are up and running:

One of the features of "Distributed Caching" is "database invalidation". This comes into play when the data objects cached in the jserv are modified by the forms interface.

So, for this feature to work correctly, we need to have the "Workflow Java Deferred Agent Listener" up and running at a minimum. Ideally, the recommendation is to have all of the workflow background components running.

4. House Keeping Jobs:

- Workflow Control Queue Cleanup
- Workflow Background Process (To Process Deferred, Timeout and Stuck)
- Purge Obsolete Workflow Runtime Data

5. Define the appropriate ports:

- FND_CACHE_PORT_RANGE
- JTF_DIST_CACHE_PORT


FND_CACHE_PORT_RANGE needs to be defined equal to the number of OACoreGroup JVMS that you have. JTF_DIST_CACHE_PORT needs to be defined as well. The default would be 12345. So if you have more than one instance running, you need to set it separately for each instance such that there is no conflict. One way is to add 3000 to the port specified in jserv.properties for OACoreGroup. Eg. 16000+3000=19000

Be Sure to update the values in context file so that the values are preserved across autoconfig runs. If you are not a big fan of running autoconfig for these small changes, then here is what you need to do:

- Update in jserv.properties
- Update the profile options at site level. This would make it consistent across servers ( if you have a multi-node installation)


Congratulations!! You just completed all the steps to setup Distributed Caching Framework. But, how do you know that it is setup accurately? Go on to the next step.

6. Validate:

- Login to OAM
- Click on the “Diagnostics” tab
- In the diagnostics window, under HTML Platform, choose the “test Caching Framework”
- Run this test without prerequisites
- This test should complete with out any errors


-- Aravind Kamath

Saturday, October 27, 2007

Generate Jar Files Hangs While Applying A Patch

Recently while i was applying a patch in our environment, the generation of jar files was hung. It would not progress. Now, on a windows 2000 server, it is known to us that when the number of characters in the path statement exceeds 1000 characters, the patch gets stuck at adjkey (signing the jar files process) which we refer to as "genjars.cmd". However, this time around it did not seem to be the case.

The problem was the missing "identitydb.obj" file under the %HOME% directory. The identitydb.obj file aka identity database, holds the certificate authority details alongside the certificate and is required to sign a jar file that is delivered by the patch.

Actually, it looks like that the profile of the owner of the applications file system, in other words, the domain user account profile that was used to install the apps, was deleted from this particular server.

- Aravind Kamath

Wednesday, October 24, 2007

Segmentation Violation - adrelink.sh, adlibout.sh

I was in the process of 'adding' an additional mid-tier node to our existing multi-node environment. After following the standard steps (cloning etc), this node went out of sync with other nodes in terms of patch levels, as this activity took a back seat making way to other priority issues. Hence, i started patching the new node that i was setting up, which was not patched alongside the other mid-tiers because it was not ready.

This activity brought to the forefront an issue with extraction of object modules / relinking of executables during patching. I encountered "Segmentation violation" error and the patch aborted. The patch log did not contain the segmentation violation error message. It would come up only in the command prompt where i was doing the patching.

I started troubleshooting this issue and after a while, i found that the issue was with MKS toolkit version. MKS Toolkit 8.0 patch 1 was required which actually fixed the issue.

Troubleshooting hints:

- A simple test.sh script which displays the hostname was executed successfully
- adrelink.sh would throw segmentation violation error.
- adlibout.sh would throw segmentation violation error
- At times, sourcing the environment file to relink would throw the above error
- Finally, invoking KSH from start>programs>MKS Toolkit>ksh would throw the segmentation fault.

This helped me eliminate other factors like environment files, environment variables, path etc and narrow down on MKS as just invoking ksh had nothing to do with oracle apps.

- Aravind Kamath

Friday, October 19, 2007

Oracle Home Inventory Issue

I was in the process of applying iAS Rollup 5 patch to our oracle applications environment. Despite following the steps to apply the patch, where in oracle_home, classpath, jdk home etc are set, the installer refused to continue with the following error:

"Not able to find ORACLE_HOME to be patched in the inventory please set the same properly and try again.."

There was a suggestion to create "Global Inventory" by applying the OUI patch and stuff, but i thought there should be a way to get past this without having to recreate the global inventory.

Here is how i fixed the issue:

- set the environment variable %ORAINST% to point to the directory that contained the inventory.xml file under %program files%\oui\..... This would ensure that the installer of the rollup patch would look for the inventory.xml in this location. (We have alternative location created by different oracle software installation)
- Most Importantly, edited the Oracle_Home name information in inventory.xml to read consistent with the information in the registry, ie HKLM\Software\Oracle\Home1(which is the $IAS_ORACLE_HOME). For some reason, the home names were inconsistent and which is why we saw the error stated above.
- Then I ran the installer and yes, the installation completed successfully.
- How do I know? ==>
- Log file of the installation.
- cd $IAS_ORACLE_HOME/Apache/Apache
- apache -v
- Output confirms that rup5 was successfully installed.

- Aravind Kamath

Monday, October 15, 2007

Mystery Connections from Concurrent Manager Server to Online Server

Hi,

It has been quite sometime since i posted my last "AppsDBATechStuff" blog. I was held up with certain "time-bound" activities, that occupied most of my time.

Now, i would like to share with you an interesting problem that i got to work on. We were working on certain memory management issues with Jserv in our environment and it was discovered that there were connections between the Online Servers and the Concurrent Manager server. Though it was not the likely culprit in the "memory" issue that we were chasing, it caught our attention and i started working on this.

Upon my investigation, i was able to determine that the connections were originating from the concurrent manager box and terminating on the Online Servers. I tied the ports and the processess together and determined that "Workflow Java Deferred Agent Listener" process was where the connections were originating and the "OA Core Group" Jserv was where the connection was terminating.

Looking at the logs (javacache.log), i was able to determine that Workflow Java Deferred Agent Listener was making these connections to all the Jservs on the online servers. Further analysis revealed that these connections were being made as part of the "Distributed mode Java Caching" framework. The "Database Invalidation" mechanism in which the WF Java Deferred Agent Listener processes the events relating to the data objects caching invalidation, is also responsible for notifying the Jserv that the data objects that were cached are now invalid.

This is the reason we saw the connections. Though we do not rely/leverage on the component cache mechanism, the above connections were seen as part of the framework being available and enabled. We continue to rely on "time based invalidation".


- Aravind Kamath

Friday, September 21, 2007

CCMCONNSTRING Registry Value Encryption - Password Change Scenario

Oracle has introduced encryption of the apps password in the registry value CCMCONNSTRING. I do not know exactly which patch introduces this change - FND.H / ATG.H / ATG.H RUP 4/ AD.I.4. Earlier the value used to be stored in an unencrypted fashion and once the apps password was changed, the DBA could manually go and update this value in plain text. This registry value is used by the concurrent managers at startup.

Of course, FNDCPASS would be used to change the 'APPS', 'APPLSYS' and 'APPS_MRC' passwords. However, in a multinode environment, some oracle documentation pointed out, that autoconfig needs to be run on each mid-tier, after changing the apps password using FNDCPASS on the database tier (this would require APPL_TOP to be available on db tier). The autoconfig would take care of apps password update in wdbsvr.app file and ccmconnstring registry value on the midtiers.

Autoconfig is time consuming, especially when you have more nodes and customized configuration files. So, I wanted to find an alternative to running autoconfig. After researching this problem for a little while, I came up with the following approach:

- Use CCMSETUP to update the ccmconnstring registry value.
- Manually update the wdbsvr.app file under $IAS_ORACLE_HOME/...

One can enter the password in plain text in CCMSETUP and save the changes which would in turn be saved in an encrypted format in the registry value by the CCMSETUP.

Troubleshooting Hints:

- Capture the value of ccmconnstring registry value before any change.
- In a multi-node environment, logon to a non-concurrent processing node and run the FNDCPASS.
- Capture the value of ccmconnstring registry value after the password change.
- No change in the encrypted values points to the fact that FNDCPASS does not update the ccmconnstring registry value.
- On the concurrent manager box, invoke ccmsetup and update the values and save. You will see the change in the encrypted value for ccmconnstring
- Bring up the concurrent managers and this works fine.

Thus, we achieved our goal of not running autoconfig and not recreating the concurrent manager service.

- Aravind Kamath

Tuesday, September 18, 2007

The system has not been taken off maintenance mode completely

A problem was reported by the users stating they are seeing the following warning message:

"The system has not been taken off maintenance mode completely"

When I started working on this issue, I made the following observations:

- Users with profile option "Self Service Personal Home Page mode" set to "Framework Only" saw this warning message when they logged in.
- Users with profile option ""Self Service Personal Home Page mode" set to "Personal Home Page" did not see this warning message.
- Not all users who have the "Self Service Personal Home Page Mode" set to "Framework Only" experienced this issue. About a third of users were not seeing this message.
- A patch was applied in our environment previously.
- A quick check from "adadmin" revealed that maintenance mode had been disabled.

As I continued to work on this issue, I found that the reason for this problem was that the apps services on mid-tiers were started after patching without disabling the maintenance mode. The Maintenance Mode was later disabled and the services on third mid-tier brought up. This was the reason users were experiencing the warning message.

Corrective Action: Bounce of online midtier services which were started before disabling the maintenance mode.

Troubleshooting tips:

A comparison of apache/jserv startup time stamp on each online box with the maintenance mode disabled time stamp in adadmin log would reveal whether or not apache/jserv was started before disabling maintenance mode. The corrective action will need to be applied to only those online midtiers where services were started before maintenance mode was disabled.

- Aravind Kamath

Friday, September 14, 2007

Maximum Dump File Size

In this post, I will share with you something related to the max_dump_file_size parameter that i came across sometime earlier...



There was a requirement from our IT team to increase the maximum allowed user trace file size from 50 MB (the trace files would max out at 50MB) to 500 MB. I took up the request from the developer and started working on this. Interestingly, I observed the max_dump_file_size was set to 102400. As a matter of fact, I was not part of the original team that set up the databases/instances I am currently maintaining, so i am not too sure about the history of why this parameter was set as 102400 in our environment.



Now, when the size is not described with a suffix eg., K (KB) or M (MB), the size is considered as maximum size in operating system blocks. I wanted to find out the operating system block size in our windows environment (though reverse engineering from the maxed out trace file would reveal 512 ). A quick check on the windows operating system revealed the logical size (cluster size) as 4K (4096 Bytes). However, the database files in our environment are on a mounted volume which is raw partition and hence it appears that oracle is using the physical block size - 512 bytes to calculate the max size. A select on the lebsz column of sys.x$kccle revealed the size of 512 bytes, which is also the size that oracle database is picking up.



So, in our environment the max trace file size was being calculated as follows, despite the windows cluster size being at 4K:



102400*512=52428800 bytes.



This would translate to 50MB.



- Aravind Kamath

About me and my upcoming posts

Hi,

This first post of mine is not apps dba tech stuff. Instead, I would like to add a couple of lines about myself. I have been in the IT industry for more than 6 years now and I have spent most of it as an Oracle Apps DBA. I have worked extensively with oracle applications, starting with version 11.5.3 all the way upto to 11.5.10. I also had a chance to explore 11.0.3, but quite frankly, i dont remember much of that now. I have dirtied my hands in maintaining oracle applications on various flavours of unix and red hat linux. In the last two years plus, I have been managing Oracle Applications on Windows platform, something that is not too common (Atleast I presume so).

What i intend to share with you is my experiences as an Apps DBA on Windows platform - some of the issues i have faced, solutions and troubleshooting. I am not assuming any level of proficiency in the audience of my blogs. Some of the things i write may feel very rudimentory to some of the readers but may be just the solution for some other reader. I will start publishing technical articles soon.. Watch out this space for my Apps DBA tech experiences.

Aravind Kamath Posral




graphic counter