Wednesday, June 15, 2011

Clusterware exclusive mode start of Clusterware Ready Services failed at /oracle/product/grid_home/crs/install/crsconfig_lib.pm line 6488

I was installing the 11gR2 CRS on a two node cluster. While executing rootcrs.pl as root user, we encountered the following error:

CRS-2674: Start of 'ora.crsd' on 'lnx-ps-210' failed
CRS-2679: Attempting to clean 'ora.crsd' on 'lnx-ps-210'
CRS-2681: Clean of 'ora.crsd' on 'lnx-ps-210' succeeded
CRS-4000: Command Start failed, or completed with errors.
Clusterware exclusive mode start of Clusterware Ready Services failed at /oracle/product/grid_home/crs/install/crsconfig_lib.pm line 6488

Upon investigation, found the following error in $ORACLE_HOME/log/lnx-ps-210/crsd/crsd.log:

2011-06-14 23:16:15.424: [ OCRAPI][3917672944]clsu_get_private_ip_addresses: no ip addresses found.
[ OCRAPI][3917672944]a_init_clsss: failed to call clsu_get_private_ip_addr (7)
2011-06-14 23:16:15.630: [ OCRAPI][3917672944]a_init:13!: Clusterware init unsuccessful : [44] 2011-06-14 23:16:15.630: [ CRSOCR][3917672944] OCR context init failure. Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2011-06-14 23:16:15.630: [ CRSOCR][3917672944][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)
2011-06-14 23:16:15.630: [ CRSD][3917672944][PANIC] CRSD Exiting. OCR Failed
2011-06-14 23:16:15.630: [ CRSD][3917672944] Done.


To overcome this issue we ran rootcrs.pl deconfig, cleared up the OCR and Voting disk info from failed installation, set the correct permissions for OCR disks and executed root.sh and rootcrs.pl again.

Same result with same set of errors.

I tracked the sequence of events that happen while executing the rootcrs.pl and here is what i found in ORACLE_HOME/cfgtoollogs/rootcrs_lnx-ps-210.log:

2011-06-14 23:16:15: Command output:
> clscfg: -install mode specified
> Successfully accumulated necessary OCR keys.
> Creating OCR keys for user 'root', privgrp 'root'..
> Operation successful.
>End Command output
2011-06-14 23:16:15: Oracle Cluster Registry initialization completed
2011-06-14 23:16:15: Executing cmd: /oracle/product/grid_home/bin/crsctl start resource ora.crsd -init
2011-06-14 23:26:20: Command output:
> CRS-2672: Attempting to start 'ora.crsd' on 'lnx-ps-210'
> CRS-5017: The resource action "ora.crsd start" encountered the following error:
> Start action for daemon aborted
> CRS-2674: Start of 'ora.crsd' on 'lnx-ps-210' failed
> CRS-2679: Attempting to clean 'ora.crsd' on 'lnx-ps-210'
> CRS-2681: Clean of 'ora.crsd' on 'lnx-ps-210' succeeded
> CRS-4000: Command Start failed, or completed with errors.

This means rootcrs.pl is calling crsctl to start crsd. I navigated through the logic used in $ORACLE_HOME/bin/crsctl which led me to crswrapexece.pl. This script gave out the location of the log (/oracle/product/grid_home/log/lnx-ps-210/client) to which this file writes which is were I got the various logs that were written as part of the startup sequence.

Upon digging deep, I found the following errors in crsctl_oracle.log under /oracle/product/grid_home/log/lnx-ps-210/client:

2011-06-14 21:11:01.100: [ OCRMSG][3356149456]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2011-06-14 21:11:01.100: [ OCRMSG][3356149456]GIPC error [29] msg [gipcretConnectionRefused]
2011-06-14 21:11:01.101: [ OCRMSG][3356149456]prom_connect: error while waiting for connection complete [24]
2011-06-14 21:11:01.985: [ OCRMSG][1049650896]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2011-06-14 21:11:01.985: [ OCRMSG][1049650896]GIPC error [29] msg [gipcretConnectionRefused]

This led me to check gpnp log (gpnptool_24455.log):

Oracle Database 11g Clusterware Release 11.2.0.2.0 - Production Copyright 1996, 2010 Oracle. All rights reserved.
2011-06-14 23:13:35.202: [ default][2728037296]

================================================================================
2011-06-14 23:13:35.202: [ default][2728037296]gpnptool client START pid=24443 Oracle Grid Plug-and-Play Tool
2011-06-14 23:13:35.203: [ default][2728037296]gpnptool Command line: /oracle/product/grid_home/bin/gpnptool.bin verify -
p=/oracle/product/grid_home/gpnp/lnx-ps-210/profiles/peer/profile.xml -w=file:/oracle/product/grid_home/gpnp/lnx-ps-210/wallets/peer -wu=peer
2011-06-14 23:13:35.220: [ GPNP][2728037296]Profile signature is valid.

When I checked the file /oracle/product/grid_home/gpnp/lnx-ps-210/profiles/peer/profile.xml, it contained wrong subnet entries which i had typed in erroneously and was not getting updated subsequently by rootcrs.pl. This was not cleaned up by deconfig command as well.

To fix this issue, i renamed the folder lnx-ps-210 under $ORACLE_HOME/gpnp and executed rootcrs.pl and it fixed the issue!

Here are the complete sequence:

1. Clean up failed installation with rootcrs deconfig command as root user.
2. Clean up the OCR and Voting disks which contain information from previous run.
3. Validate the permission for OCR disk to be correct.
4. Move the directory lnx-ps-210 under $ORACLE_HOME/gpnp
5. run $ORACLE_HOME/root.sh as root user
6. run rootcrs.pl script as root user.


- Aravind Kamath Posral

4 comments:

Anonymous said...

Good post! Today I had exactly the same problem but in my case I couldn't find anything about subnet in profile.xml file.

Mariusz said...

Today I've started force the same problem but in my case I cannot find any information about subnet in profile.xml file. Is there another place where I can find subnet settings?

Aravind Kamath said...

Actually, my post above assumes that you had a clean cluvfy output without any network related issues. Hope you have run cluvfy before starting with this..

Aravind

Anonymous said...

Thanks for this post, I had the same issue, entered a wrong subnet for the interconnect and the xml wasn't being cleared up by deconfig... after removing the offending xmls, I was able to proceed