标签归档:clusterware

clusterware on 2nd node can’t startup after Oracle 11.1.0.6 Upgrade to 11.1.0.7

after upgrade from Oracle 11.1.0.6 to Oracle 11.1.0.7, the clusterware can’t startup after run the $ORA_CRS_HOME/install/root111.sh.

scripts output:

# /crs/11.1.0/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Successfully stopped Oracle Clusterware resources
Stopping Cluster Synchronization Services.
Shutting down the Cluster Synchronization Services daemon.
Shutdown request successfully issued.
# /crs/11.1.0/install/root111.sh
Creating pre-patch directory for saving pre-patch clusterware files
Completed patching clusterware files to /crs/11.1.0
Relinking some shared libraries.
Relinking of patched files is complete.
Preparing to recopy patched init and RC scripts.
Recopying init and RC scripts.
Startup will be queued to init within 30 seconds.
Starting up the CRS daemons.
Waiting for the patched CRS daemons to start.
This may take a while on some systems.
.
.
.
.
.
.
Timed out waiting for the CRS daemons to start. Look at the system message file and the CRS log files for diagnostics.

check $ORA_CRS_HOME/log/hostname/alerthostname.log, no usefule info.
check ps -ef|grep crs result, find that the evmd command is running.
check $ORA_CRS_HOME/log/hostname/evmd/evmd.log, find follwing error message:

2010-07-15 05:50:22.351: [    EVMD][4143711936] EVMD Starting
2010-07-15 05:50:22.351: [    EVMD][4143711936] Initializing OCR
2010-07-15 05:50:22.368: [    EVMD][4143711936] Get OCR context succeeded
2010-07-15 05:50:22.369: [ COMMCRS][83372976]clsc_connect: (0x9857018) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_))

2010-07-15 05:50:22.370: [ CSSCLNT][4143711936]clsssInitNative: failed to connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_)), rc 9

2010-07-15 05:50:22.370: [    EVMD][4143711936] EVMD waiting for CSS to be ready err = 3
2010-07-15 05:50:23.373: [ COMMCRS][83372976]clsc_connect: (0x9857018) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_))

2010-07-15 05:50:23.373: [ CSSCLNT][4143711936]clsssInitNative: failed to connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_stase09_)), rc 9

2010-07-15 05:50:23.374: [    EVMD][4143711936] EVMD waiting for CSS to be ready err = 3

so the EVMD is waiting for CSSD to startup.

then check the CSSD log files in $ORA_CRS_HOME/hostname/cssd/ocssd.log
find following error message:

[    CSSD]2010-07-15 06:48:22.354 [4033264560] >TRACE:   kgzf_dskm_conn4: unable to connect to master diskmon in 60260 msec

[    CSSD]2010-07-15 06:48:22.354 [4033264560] >TRACE:   kgzf_send_main1: connection to master diskmon timed out

[    CSSD]2010-07-15 06:48:22.354 [4012284848] >TRACE:   KGZF: Fatal diskmon condition, IO fencing is not available. For additional error info look at the master diskmon log file (diskmon.log)

 so , continue , check the dismon.log   $ORA_CRS_HOME/log/hostname/diskmon/diskmon.log

[ DISKMON]

        I/O Fencing and SKGXP HA monitoring daemon — Version 1.0.0.0
        Process 2323 started on 07/15/2010 at 07:00:02.867

[ DISKMON] 07/15/2010 07:00:02.893 [2323:4143413984] dskm main11: skgznp_create(default pipe) failed with error 56810
[ DISKMON] 07/15/2010 07:00:02.894 [2323:4143413984] dskm_main11: error 56810 at location skgznpcre3 – bind() – Address already in use
[ DISKMON]
        Process 2323 exiting on 07/15/2010 at 07:00:02.895

check google and metalink for error code 56810, no result .
from the error message , I think the error is caused by network bind errors,  then try to strace diskmon
strace diskmon output :
bind(5, {sa_family=AF_FILE, path=”/tmp/.oracle_master_diskmon”}, 110) = -1 EADDRINUSE (Address already in use)
close(5)                                = 0
gettimeofday({1279202402, 893531}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
gettimeofday({1279202402, 894007}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
unlink(“/tmp/.oracle_master_diskmon”)   = -1 EPERM (Operation not permitted)
rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL}, {0x1e5dcf4, ~[ILL ABRT BUS FPE KILL SEGV USR2 STOP XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x9b5880}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
gettimeofday({1279202402, 895365}, NULL) = 0
futex(0x96103fc, 0x4 /* FUTEX_??? */, 1) = 1
exit_group(1)                           = ?

check socket file “/tmp/.oracle_master_diskmon” , find the file is created long ago, then delete it. then rerun root111.sh,  the upgrade root script complete successfully.

also can check /var/tmp/.oracle  , there are other socket file used by oracle cluster ware.

Oracle 11.1.0.7 , Single Instance, Clusterware, Giving up: Oracle CSS stack appears NOT to be running.

今天恢复旧的11.1.0.7单机ASM数据库环境,用来恢复原有的11.1.0.7的数据库备份,按照以前的方法操作如下:

1.清空当前clusterware环境:

rm -rf /etc/oracle
rm -rf /var/tmp/.oracle/*
rm -rf /tmp/.oracle
rm -rf /home/user01/oracle/11g/Clusterware
rm -rf /home/user01/oracle/11g/checkpoints

# vi /etc/inittab ,去掉最后的crs启动命令。

# init q;

# mv $ORA_CRS_HOME to a new location

# reboot

2.重新配置11.1.0.7 single instance clusterware [为了使用ASM]

#/home/user01/oracle/11g/crs_11gProd/bin/localconfig reset

这一步运行出错,报错如下:

Successfully accumulated necessary OCR keys.
Creating OCR keys for user ‘root’, privgrp ‘root’..
Operation successful.
Configuration for local CSS has been initialized

Cleaning up Network socket directories
Setting up Network socket directories
Adding to inittab
Startup will be queued to init within 30 seconds.
Checking the status of new Oracle init process…
Expecting the CRS daemons to be up within 600 seconds.
ls /home/user01/oracle/11g/crs_11gProd/cdata/localhost/local.ocrGiving up: Oracle CSS stack appears NOT to be running.
Oracle CSS service would not start as installed
Automatic Storage Management(ASM) cannot be used until Oracle CSS service is started

ps -ef|grep crs, 没有进行在运行。

查看 $ORA_CRS_HOME/log/hostname/alertxxxx.log文件,未发现错误。查看Crsd和cssd目录的log,没有发现错误。

重新清空环境,再次操作,错误依旧。

查看/var/log/message文件,发现如下log:

Apr  7 05:30:08 host01 logger: Waiting for filesystem containing /home/user01/oracle/crs_11gR2/bin/crsctl.
Apr  7 05:31:08 host01 logger: Waiting for filesystem containing /home/user01/oracle/crs_11gR2/bin/crsctl.

怀疑是以前历史记录没有清除干净,造成运行root后,查找的是旧的crsctl。

crs_11gR2目录已经删除,不应该再被引用。

检查配置文件和脚本,查找原因。

到/etc/init.d/目录

grep crs_11gR2 *

发现init.ohasd和ohasd两个文件中包含上述路径,删除这两个文件。

#nit q;

发现后台仍然报

host01 logger: Waiting for filesystem containing /home/user01/oracle/crs_11gR2/bin/crsctl.
错误,

#reboot

reboot后message里面的错误提示消失,重新运行配置命令。

#/home/user01/oracle/11g/crs_11gProd/bin/localconfig reset

Successfully accumulated necessary OCR keys.
Creating OCR keys for user ‘root’, privgrp ‘root’..
Operation successful.
Configuration for local CSS has been initialized

Cleaning up Network socket directories
Setting up Network socket directories
Adding to inittab
Startup will be queued to init within 30 seconds.
Checking the status of new Oracle init process…
Expecting the CRS daemons to be up within 600 seconds.
Cluster Synchronization Services is active on these nodes.
        host01
Cluster Synchronization Services is active on all the nodes.
Oracle CSS service is installed and running under init(1M)

配置成功。

Oracle 11gR2 clusterware [INS-06006] error

今天安装11gR2 clusterware的时候在输入完hostname和vip之后,碰到[INS-06006]Passowrdless SSH connectivity not setup between the following node(s):[host1].

 

查看了文档,发现clusterware安装的时候ssh配置,不仅要两个节点能相互访问,还要能访问自身。在两个节点上分别添加自己的rsa key即可,命令如下:

cat ~/.ssh/id_rsa.pub | ssh user@host1 "cat – >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh user@host2 "cat – >> ~/.ssh/authorized_keys"

 

————————-

ps,昨天碰到另一个ssh相关问题,顺便记录一下。

host1,host2配置相同,配完ssh后,host1能ssh host2,反之不行。Google了很多,都没发现原因。最后发现时host1的user home目录权限为775. 要使ssh起作用,user home目录,必须只有owner有写权限。Google过程中,还有一些说法是 .ssh 目录权限要求为只有owner自己有写权限。authorized_keys 权限应为600.