标签归档:evmd

clusterware on 2nd node can’t startup after Oracle 11.1.0.6 Upgrade to 11.1.0.7

after upgrade from Oracle 11.1.0.6 to Oracle 11.1.0.7, the clusterware can’t startup after run the $ORA_CRS_HOME/install/root111.sh.

scripts output:

# /crs/11.1.0/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Successfully stopped Oracle Clusterware resources
Stopping Cluster Synchronization Services.
Shutting down the Cluster Synchronization Services daemon.
Shutdown request successfully issued.
# /crs/11.1.0/install/root111.sh
Creating pre-patch directory for saving pre-patch clusterware files
Completed patching clusterware files to /crs/11.1.0
Relinking some shared libraries.
Relinking of patched files is complete.
Preparing to recopy patched init and RC scripts.
Recopying init and RC scripts.
Startup will be queued to init within 30 seconds.
Starting up the CRS daemons.
Waiting for the patched CRS daemons to start.
This may take a while on some systems.
.
.
.
.
.
.
Timed out waiting for the CRS daemons to start. Look at the system message file and the CRS log files for diagnostics.

check $ORA_CRS_HOME/log/hostname/alerthostname.log, no usefule info.
check ps -ef|grep crs result, find that the evmd command is running.
check $ORA_CRS_HOME/log/hostname/evmd/evmd.log, find follwing error message:

2010-07-15 05:50:22.351: [    EVMD][4143711936] EVMD Starting
2010-07-15 05:50:22.351: [    EVMD][4143711936] Initializing OCR
2010-07-15 05:50:22.368: [    EVMD][4143711936] Get OCR context succeeded
2010-07-15 05:50:22.369: [ COMMCRS][83372976]clsc_connect: (0x9857018) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_))

2010-07-15 05:50:22.370: [ CSSCLNT][4143711936]clsssInitNative: failed to connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_)), rc 9

2010-07-15 05:50:22.370: [    EVMD][4143711936] EVMD waiting for CSS to be ready err = 3
2010-07-15 05:50:23.373: [ COMMCRS][83372976]clsc_connect: (0x9857018) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_))

2010-07-15 05:50:23.373: [ CSSCLNT][4143711936]clsssInitNative: failed to connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_)), rc 9

2010-07-15 05:50:23.374: [    EVMD][4143711936] EVMD waiting for CSS to be ready err = 3

so the EVMD is waiting for CSSD to startup.

then check the CSSD log files in $ORA_CRS_HOME/hostname/cssd/ocssd.log
find following error message:

[    CSSD]2010-07-15 06:48:22.354 [4033264560] >TRACE:   kgzf_dskm_conn4: unable to connect to master diskmon in 60260 msec

[    CSSD]2010-07-15 06:48:22.354 [4033264560] >TRACE:   kgzf_send_main1: connection to master diskmon timed out

[    CSSD]2010-07-15 06:48:22.354 [4012284848] >TRACE:   KGZF: Fatal diskmon condition, IO fencing is not available. For additional error info look at the master diskmon log file (diskmon.log)

 so , continue , check the dismon.log   $ORA_CRS_HOME/log/hostname/diskmon/diskmon.log

[ DISKMON]

        I/O Fencing and SKGXP HA monitoring daemon — Version 1.0.0.0
        Process 2323 started on 07/15/2010 at 07:00:02.867

[ DISKMON] 07/15/2010 07:00:02.893 [2323:4143413984] dskm main11: skgznp_create(default pipe) failed with error 56810
[ DISKMON] 07/15/2010 07:00:02.894 [2323:4143413984] dskm_main11: error 56810 at location skgznpcre3 – bind() – Address already in use
[ DISKMON]
        Process 2323 exiting on 07/15/2010 at 07:00:02.895

check google and metalink for error code 56810, no result .
from the error message , I think the error is caused by network bind errors,  then try to strace diskmon
strace diskmon output :
bind(5, {sa_family=AF_FILE, path=”/tmp/.oracle_master_diskmon”}, 110) = -1 EADDRINUSE (Address already in use)
close(5)                                = 0
gettimeofday({1279202402, 893531}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
gettimeofday({1279202402, 894007}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
unlink(“/tmp/.oracle_master_diskmon”)   = -1 EPERM (Operation not permitted)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL}, {0x1e5dcf4, ~[ILL ABRT BUS FPE KILL SEGV USR2 STOP XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x9b5880}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
gettimeofday({1279202402, 895365}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
exit_group(1)                           = ?

check socket file “/tmp/.oracle_master_diskmon” , find the file is created long ago, then delete it. then rerun root111.sh,  the upgrade root script complete successfully.

also can check /var/tmp/.oracle  , there are other socket file used by oracle cluster ware.