Thursday, June 23, 2011

adoacorectl.sh: exiting with status 152

Today we faced an issue where the oacore jvms were coming up to init state but not alive and then would show down. No errors logged anywhere. Enabling debugging did not give out any information.

Here is how i did the troubleshooting:
1. Validate the connect string (apps_jdbc_url) provided in dbc file by connecting using sqlplus apps/@" apps jdbc url connect string". This was working.

2. When the oacore jservs are in init stage, connect to database and query v$session:

select distinct machine from v$session;
This showed that oacore was indeed making connection to database. This also means that dbc file is NOT the suspect.

3. After clearing persistence folder etc which are usual suspects, all the efforts went in vain.

4. The problem had to be with the configuration because the database connections were going through fine.

5. I analyzed the applications deployed to this jvm and each of the applications were using a custom schema name to connect to the database.

6. This node is an external node in DMZ. This implies this is protected by firewall.

7. For the external nodes, we use OCM to route the connections to the database.

8. Someone had run autoconfig which had overwritten the customization in $IAS_ORACLE_HOME/j2ee/config/oacore/data-sources.xml and it was trying to connect "directly" to database.

9. Changed the connect string in data-sources.xml to route through OCM and bingo! the oacore came "alive"

- Aravind Kamath Posral

Wednesday, June 15, 2011

Clusterware exclusive mode start of Clusterware Ready Services failed at /oracle/product/grid_home/crs/install/crsconfig_lib.pm line 6488

I was installing the 11gR2 CRS on a two node cluster. While executing rootcrs.pl as root user, we encountered the following error:

CRS-2674: Start of 'ora.crsd' on 'lnx-ps-210' failed
CRS-2679: Attempting to clean 'ora.crsd' on 'lnx-ps-210'
CRS-2681: Clean of 'ora.crsd' on 'lnx-ps-210' succeeded
CRS-4000: Command Start failed, or completed with errors.
Clusterware exclusive mode start of Clusterware Ready Services failed at /oracle/product/grid_home/crs/install/crsconfig_lib.pm line 6488

Upon investigation, found the following error in $ORACLE_HOME/log/lnx-ps-210/crsd/crsd.log:

2011-06-14 23:16:15.424: [ OCRAPI][3917672944]clsu_get_private_ip_addresses: no ip addresses found.
[ OCRAPI][3917672944]a_init_clsss: failed to call clsu_get_private_ip_addr (7)
2011-06-14 23:16:15.630: [ OCRAPI][3917672944]a_init:13!: Clusterware init unsuccessful : [44] 2011-06-14 23:16:15.630: [ CRSOCR][3917672944] OCR context init failure. Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2011-06-14 23:16:15.630: [ CRSOCR][3917672944][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)
2011-06-14 23:16:15.630: [ CRSD][3917672944][PANIC] CRSD Exiting. OCR Failed
2011-06-14 23:16:15.630: [ CRSD][3917672944] Done.


To overcome this issue we ran rootcrs.pl deconfig, cleared up the OCR and Voting disk info from failed installation, set the correct permissions for OCR disks and executed root.sh and rootcrs.pl again.

Same result with same set of errors.

I tracked the sequence of events that happen while executing the rootcrs.pl and here is what i found in ORACLE_HOME/cfgtoollogs/rootcrs_lnx-ps-210.log:

2011-06-14 23:16:15: Command output:
> clscfg: -install mode specified
> Successfully accumulated necessary OCR keys.
> Creating OCR keys for user 'root', privgrp 'root'..
> Operation successful.
>End Command output
2011-06-14 23:16:15: Oracle Cluster Registry initialization completed
2011-06-14 23:16:15: Executing cmd: /oracle/product/grid_home/bin/crsctl start resource ora.crsd -init
2011-06-14 23:26:20: Command output:
> CRS-2672: Attempting to start 'ora.crsd' on 'lnx-ps-210'
> CRS-5017: The resource action "ora.crsd start" encountered the following error:
> Start action for daemon aborted
> CRS-2674: Start of 'ora.crsd' on 'lnx-ps-210' failed
> CRS-2679: Attempting to clean 'ora.crsd' on 'lnx-ps-210'
> CRS-2681: Clean of 'ora.crsd' on 'lnx-ps-210' succeeded
> CRS-4000: Command Start failed, or completed with errors.

This means rootcrs.pl is calling crsctl to start crsd. I navigated through the logic used in $ORACLE_HOME/bin/crsctl which led me to crswrapexece.pl. This script gave out the location of the log (/oracle/product/grid_home/log/lnx-ps-210/client) to which this file writes which is were I got the various logs that were written as part of the startup sequence.

Upon digging deep, I found the following errors in crsctl_oracle.log under /oracle/product/grid_home/log/lnx-ps-210/client:

2011-06-14 21:11:01.100: [ OCRMSG][3356149456]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2011-06-14 21:11:01.100: [ OCRMSG][3356149456]GIPC error [29] msg [gipcretConnectionRefused]
2011-06-14 21:11:01.101: [ OCRMSG][3356149456]prom_connect: error while waiting for connection complete [24]
2011-06-14 21:11:01.985: [ OCRMSG][1049650896]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2011-06-14 21:11:01.985: [ OCRMSG][1049650896]GIPC error [29] msg [gipcretConnectionRefused]

This led me to check gpnp log (gpnptool_24455.log):

Oracle Database 11g Clusterware Release 11.2.0.2.0 - Production Copyright 1996, 2010 Oracle. All rights reserved.
2011-06-14 23:13:35.202: [ default][2728037296]

================================================================================
2011-06-14 23:13:35.202: [ default][2728037296]gpnptool client START pid=24443 Oracle Grid Plug-and-Play Tool
2011-06-14 23:13:35.203: [ default][2728037296]gpnptool Command line: /oracle/product/grid_home/bin/gpnptool.bin verify -
p=/oracle/product/grid_home/gpnp/lnx-ps-210/profiles/peer/profile.xml -w=file:/oracle/product/grid_home/gpnp/lnx-ps-210/wallets/peer -wu=peer
2011-06-14 23:13:35.220: [ GPNP][2728037296]Profile signature is valid.

When I checked the file /oracle/product/grid_home/gpnp/lnx-ps-210/profiles/peer/profile.xml, it contained wrong subnet entries which i had typed in erroneously and was not getting updated subsequently by rootcrs.pl. This was not cleaned up by deconfig command as well.

To fix this issue, i renamed the folder lnx-ps-210 under $ORACLE_HOME/gpnp and executed rootcrs.pl and it fixed the issue!

Here are the complete sequence:

1. Clean up failed installation with rootcrs deconfig command as root user.
2. Clean up the OCR and Voting disks which contain information from previous run.
3. Validate the permission for OCR disk to be correct.
4. Move the directory lnx-ps-210 under $ORACLE_HOME/gpnp
5. run $ORACLE_HOME/root.sh as root user
6. run rootcrs.pl script as root user.


- Aravind Kamath Posral

Sunday, June 12, 2011

adoacorectl.sh: exiting with status 204

With 30 mins to golive, we encountered following error in production environment while trying to startup the Apache/Web services:

adoacorectl.sh: exiting with status 204
adoafmctl.sh: exiting with status 204

We have forms running in socket mode, so there were no issues with that. Apache came up fine. There were no issues with OPMN. No other errors were logged.

Subsequently, I figured out that the issue was due to /apps filesystem getting filled up. The processes were unable to write to the disk due to which the startup was not happening. During the course of troubleshooting, I figured out that oacorestd.out and oacorestd.err (predominantly stdout) logs were consuming all the space. We have 5 OACORE process and one oafm process. This means we have 6 logs which had grown to consume all the space.

To overcome this issue, we shutdown all the services and zipped the logs to create space.

Once this was done, I tried to start the services but still encountered the same error. However, this time around, I saw the following error in oacorestd.err:

11/06/12 09:14:05 Error initializing server: /apps/local/CEPROD/inst/apps/CEPROD_c7-iprd-54/ora/10.1.3/j2ee/oacore/config/server.xml, Fatal error at line 1 offset 1 in file:/apps/local/CEPROD/inst/apps/CEPROD_c7-iprd-54/ora/10.1.3/j2ee/oacore/config/server.xml: .:
XML-20108: (Fatal Error) Start of root element expected.

When i checked the server.xml, I found that it had been reduced to zero bytes. At this moment, one of the metalink notes suggested to run autoconfig to fix this issue.

However, we were already out of the outage window and autoconfig would be the last option.

As a workaround to this issue, I went to the logs of the last autoconfig run (we were sure that no patches that bring in new templates were applied
post that autoconfig run) and located the server.xml backup for both oacore and oafm and placed them back and started the services and everything came up as expected.

-Aravind Kamath Posral