CRS-0184 Cannot communicate with the CRS daemon
oracle rac遇到了問題:報錯:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4534: Cannot communicate with Event Manager‘
問題分析:由于網站上云,oracle有一套rac從idc機房撤回到了公司本地,,按著步驟關閉了數據庫,領導關閉的,只是su - oracle 然后shu immediate,關閉了oracle實例,asm實例則沒有關閉,然后搬到公司按著原來的位置插好了網線并嘗試啟動,我只嘗試著把ora010的實例起來了,然后就不管了,后來要用這套庫的時候,我才看ora102的狀態,才意識到數據庫實例和asm實例都沒有啟動,于是嘗試啟動,但是報錯如下:
首先先說下oracle rac
服務器需要重啟的時候,oracle相關資源關閉的的流程:
方法一:
1)關閉oracle實例
[grid@ora102 ~]$ srvctl stop database -d ORCL
2)關閉asm實例
[grid@ora102 ~]$ srvctl stop asm -n ora102
[grid@ora102 ~]$ srvctl stop asm -n ora101
如果報錯就強制關閉,如下
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上強制關閉 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
3)最后還需要關閉crs
[root@ora101 bin]# ./crsctl stop cluster -all
方法二:
1)關閉oracle實例,兩個節點都執行
su - oracle
sqlplus / as sysdba
shu immediate
2)關閉asm實例,兩個節點都執行
su - grid
sqlplus / as sysasm
shu immediate
sqlplu abort強制關閉
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
3)最后還需要關閉crs
[root@ora101 bin]# ./crsctl stop cluster -all
檢查數據庫和asm實例的狀態,以及crs的狀態
[grid@ora101 ~]$ srvctl status asm
ASM is running on ora101,ora102
[grid@ora101 ~]$ srvctl status database -d ORCL
Instance orcl1 is not running on node ora101
Instance orcl2 is not running on node ora102
好了言歸正傳,繼續說遇到的問題。
[root@ora102 ~]# su - grid
[grid@ora102 ~]$ sqlplus / as sysasm
[grid@ora102 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.4.0 Production on Wed Nov 29 22:28:20 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> startup
報錯。。。
在ora102節點上檢查集群服務的狀態,報錯
[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
根據上面報錯,可以判斷出crs是有問題。
嘗試啟動也報錯:注意需要使用root
[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
正常情況是:
[root@ora102 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
檢查crs服務,發現有問題:
[grid@ora102 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services demon
CRS-4534: Cannot communicate with Event Manager‘
然后節點ora102查看ip情況,發現vip和scan ip都已經不在,vip在節點ora101上了,可以判斷出節點ora102已經脫離了集群。
查看ip配置。。。
[root@ora102 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.44 ora101
192.168.0.45 ora102
192.168.0.46 ora101-vip
192.168.0.47 ora102-vip
192.168.0.48 ora-cluster-scan
172.168.56.101 ora101-priv
172.168.56.102 ora102-priv
查看節點的ip情況,發現只有物理ip(192.168.0.45 )了。
[root@ora102 ~]# ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp11s0f0: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.45/24 brd 192.168.0.255 scope global enp11s0f0
valid_lft forever preferred_lft forever
inet6 fe80::f451:31ab:4b4a:b224/64 scope link
valid_lft forever preferred_lft forever
3: enp11s0f1: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:42 brd ff:ff:ff:ff:ff:ff
inet 172.168.56.102/24 brd 172.168.56.255 scope global enp11s0f1
valid_lft forever preferred_lft forever
inet 169.254.20.215/16 brd 169.254.255.255 scope global enp11s0f1:1
valid_lft forever preferred_lft forever
inet6 fe80::7ee2:d8da:d7fa:12d5/64 scope link
valid_lft forever preferred_lft forever
4: enp0s29f0u2: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 5e:f3:fc:de:63:43 brd ff:ff:ff:ff:ff:ff
5: virbr0: mtu 1500 qdisc noqueue state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
6: virbr0-nic: mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
解決問題過程。。。。
首先嘗試重啟節點2的crs
關閉crs
[root@ora102 bin]# ./crsctl stop crs
或者
[root@ora102 bin]# ./crsctl stop cluster
之后啟動cluster集群:
方法一和方法二的區別:crsctl start/stop crs 只能管理本地節點的clusterware stack,并不允許我們管理遠程節點,crsctl strat/stop cluster既可以管理本地 clusterware stack,也可以管理整個集群
指定–all 啟動集群中所有節點的集群件,即啟動整個集群。-n 啟動指定節點的集群件.
方法一:
[root@ora102 bin]# ./crsctl start crs
或者
方法二:
[root@ora102 bin]# ./crsctl start cluster
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'ora102'
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'ora102' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'ora102'
CRS-2681: Clean of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'ora102'
CRS-2676: Start of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'ora102'
CRS-2676: Start of 'ora.crsd' on 'ora102' succeeded
如果還是有問題那么清理節點2的配置信息,然后重新運行root.sh
[root@ora102 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
[root@ora102 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
[root@ora102 bin]# /u01/app/11.2.0/grid/root.sh
然后檢查狀態是否正常,如果不正常,再次重啟crs,就好了。
檢查狀態,發現正常。。。。
[root@ora102 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE ora101
ora.FRA.dg ora....up.type ONLINE ONLINE ora101
ora....ER.lsnr ora....er.type ONLINE ONLINE ora101
ora....N1.lsnr ora....er.type ONLINE ONLINE ora101
ora.OCR.dg ora....up.type ONLINE ONLINE ora101
ora.asm ora.asm.type ONLINE ONLINE ora101
ora.cvu ora.cvu.type ONLINE ONLINE ora101
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE ora101
ora.oc4j ora.oc4j.type ONLINE ONLINE ora101
ora.ons ora.ons.type ONLINE ONLINE ora101
ora....SM1.asm application ONLINE ONLINE ora101
ora....01.lsnr application ONLINE ONLINE ora101
ora.ora101.gsd application OFFLINE OFFLINE
ora.ora101.ons application ONLINE ONLINE ora101
ora.ora101.vip ora....t1.type ONLINE ONLINE ora101
ora....SM2.asm application ONLINE ONLINE ora102
ora....02.lsnr application ONLINE ONLINE ora102
ora.ora102.gsd application OFFLINE OFFLINE
ora.ora102.ons application ONLINE ONLINE ora102
ora.ora102.vip ora....t1.type ONLINE ONLINE ora102
ora.orcl.db ora....se.type ONLINE ONLINE ora101
ora.scan1.vip ora....ip.type ONLINE ONLINE ora101
檢查ocr狀態
[grid@ora101 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2948
Available space (kbytes) : 259172
ID : 87127720
Device/File Name : +OCR
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check bypassed due to non-privileged user
檢查crs狀態 狀態正常。。。。
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
題外話。。
一:關閉asm實例報錯。。。。
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上強制關閉 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
或者 sqlplu abort強制關閉
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
此時查看crs:
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
使用crsctl stop crs停止CRS,同時也停止了ASM磁盤
從停止的過程可以看到VIP的飄移,
[root@ora101 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.crsd' on 'ora101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.OCR.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.DATA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.FRA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'ora101'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ora101.vip' on 'ora101'
CRS-2677: Stop of 'ora.FRA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.DATA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ora101.vip' on 'ora101' succeeded
CRS-2672: Attempting to start 'ora.ora101.vip' on 'ora102'
CRS-2676: Start of 'ora.ora101.vip' on 'ora102' succeeded -----實現vip飄逸
CRS-2677: Stop of 'ora.OCR.dg' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'ora101'
CRS-2677: Stop of 'ora.ons' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'ora101'
CRS-2677: Stop of 'ora.net1.network' on 'ora101' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'ora101' has completed
CRS-2677: Stop of 'ora.crsd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'ora101'
CRS-2673: Attempting to stop 'ora.evmd' on 'ora101'
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2673: Attempting to stop 'ora.m
dnsd' on 'ora101'
CRS-2677: Stop of 'ora.evmd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ora101'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ora101'
CRS-2677: Stop of 'ora.cssd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'ora101'
CRS-2677: Stop of 'ora.crf' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ora101'
CRS-2677: Stop of 'ora.gipcd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ora101'
CRS-2677: Stop of 'ora.gpnpd' on 'ora101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
啟動asm,先啟動crs服務
[root@ora101 bin]# ./crsctl start crs
[root@ora101 bin]# ./crsctl status crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
啟動RAC實例和數據庫
[grid@ora102 ~]$ srvctl start asm
PRCC-1014 : asm was already running
[root@ora101 bin]# ./srvctl start database -d ORCL
二:簡單概述CRS架構 :
1)Cluster Synchronization Services (CSS)—管理群集配置,誰是成員、誰來、誰走,通知成員。
2)Cluster Ready Services (CRS)—管理群集內高可用操作的主要程序,crs管理的全部內容都被看作資源,包括數據庫、實例、服務、監聽器、vip地址、應用進程等。Crs進程根據OCR中的配置信息管理群集資源,包括啟動、停止、監視和容錯操作。當某個資源的狀態發生改變時,crs進程產生事件。RAC安裝完成后,crs進程監視各種資源,發生異常時自動重啟該資源,一般來說重啟5次,如不成功不再嘗試。
3)Event Management (EVM)—后臺進程發布由crs生成的事件。
4)Oracle Notification Service (ONS)—通信FAN消息的發布和訂閱服務。
5)RACG—擴展集群支持oracle特定的需求和復雜的資源。
6)Process Monitor Daemon (OPROCD)—鎖定在內存中監視集群運行并執行I/O隔離。利用 hangchecker,監測、停止、再監測、再停止,如果醒來時時間不對則重啟該節點。
注意:
CRS進程棧默認隨著操作系統的啟動而自啟動,有時出于維護目的需要關閉這個特性,可以用root用戶執行下面命令。
[root@rac1 bin]# ./crsctl disable crs
[root@rac1 bin]# ./crsctl enable crs
這個命令實際是修改了/etc/oracle/scls_scr/raw/root/crsstart這個文件里的內容
CRS由CRS,CSS,EVM三個服務組成,每個服務又是由一系列module組成,crsctl允許對每個module進行跟蹤,并把跟蹤內容記錄到日志中。
[root@rac1 bin]# ./crsctl lsmodules css
[root@rac1 bin]# ./crsctl lsmodules evm
–跟蹤CSSD模塊,需要root用戶執行:
[root@rac1 bin]# ./crsctl debug log css "CSSD:1"
Configuration parameter trace is now set to 1.
Set CRSD Debug Module: CSSD Level: 1
–查看跟蹤日志
[root@rac1 cssd]# pwd
/u01/app/oracle/product/crs/log/rac1/cssd
[root@rac1 cssd]# more ocssd.log
四:Oracle Cluster Registry (OCR):
管理Oracle集群軟件和Oracle RAC數據庫配置信息;類似于windows的注冊表;這也包含Oracle Local Registry (OLR),存在于集群的每個節點上,管理Oracle每個節點的集群配置信息。Oracle Clusterware 把整個集群的配置信息放在共享存儲上,這個存儲就是OCR Disk.在整個集群中,只有一個節點能對OCR Disk進行讀寫操作,這個節點叫作Master Node,所有節點都會在內存中保留一份OCR的拷貝,同時有一個OCR Process從這個內存中讀取內容。OCR內容發生改變時,由Master Node的OCR Process負責同步到其他節點的OCR Process。
ocrcheck:
Ocrcheck命令用于檢查OCR內容的一致性,命令執行過程會在$CRS_HOME\log\nodename\client目錄下產生ocrcheck_pid.log日志文件。 這個命令不需要參數。
[root@rac1 bin]#./ocrcheck
五:最后檢查數據庫的狀態:
1)檢查數據庫實例的狀態:
[root@ora102 bin]# ./srvctl status database -d ORCL
Instance orcl1 is running on node ora101
Instance orcl2 is running on node ora102
2)檢查asm實例的狀態:
[root@ora102 bin]# ./srvctl status asm
ASM is running on ora101,ora102
3)檢查crs的狀態,如下是正常的
[root@ora102 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
–檢查單個狀態
[root@rac1 bin]# ./crsctl check cssd
CSS appears healthy
[root@rac1 bin]# ./crsctl check crsd
CRS appears healthy
[root@rac1 bin]# ./crsctl check evmd
EVM appears healthy
總結:oracle rac集群,是一個整體,需要同時啟動和關閉,如果你只啟動其中一個,那么另一個節點的vip就會飄到這個節點,voting disk投票把這個節點踢出集群,也就是腦裂。解決腦裂問題的基本思路就是:首先重啟被踢出集群的節點的crs(crsctl stop crs ,然后crsctl start crs ),如果不行,那就清理節點2的配置信息,然后重新運行root.sh,然后執行crsctlstart crs開啟crs即可。