您好,登錄后才能下訂單哦!
這期內容當中小編將會給大家帶來有關goldengate故障的處理方法,文章內容豐富且以專業的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。
問題描述:
我們線上的gg上線時間是上周三晚上,也就是4月19號晚上,當時上線的時候是配置在rac的節點3上的,在重啟節點3的時候由于疏忽,原本32G的內存,起來之后只識別了24G,當時沒有發現,運行幾天后,突然發現,每天都有那么一、二次,節點3并發非常高,操作系統層面平均負載從幾一下飆升到五六十,造成數據庫短暫性假死現象,恰恰在這個時間點上,gg的抽取進程在top1,再看操作系統的內存使用情況,只剩下幾十k了,一開始懷疑是nfs掛載的問題,最后測試下來,也沒什么問題,最后決定緊急處理節點3的內存問題,具體處理細節如下:
晚6點下班后,由于6點到9點這個時間段,相對來說網站和boss都還比較繁忙,這段時間就沒做任何操作,到了9點鐘,通知運維相關人員,把節點3的tomcat全部停止,然后我這里停gg,卸載nfs,關閉節點3的所有數據庫進程,最后關機,操作見下:
GGSCI (rac3) 21> stop mgr
GGSCI (rac3) 21> stop extract xxxx
GGSCI (rac3) 21> stop dpump xxxx
停的過程中,errlog中的信息如下:
2012-04-26 20:57:39 INFO OGG-00497 Oracle GoldenGate Capture for Oracle, extksr1.prm: Writing DDL operation to extract trail file.
2012-04-26 21:01:36 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop extksr1.
2012-04-26 21:01:38 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, extksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:39 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 stopped normally.
2012-04-26 21:01:41 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop dpksr1.
2012-04-26 21:01:43 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:43 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, dpksr1.prm: EXTRACT DPKSR1 stopped normally.
2012-04-26 21:01:47 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop mgr.
2012-04-26 21:01:49 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from GGSCI on host 10.1.8.49 (STOP).
2012-04-26 21:01:49 WARNING OGG-00938 Oracle GoldenGate Manager for Oracle, mgr.prm: Manager is stopping at user request.
相關進程都停止之后,卸載nfs,umount了節點1,2以及共享存儲,具體命令略過,很簡單,值得一提的是,在卸載共享存儲的時候,會出現資源忙的情況,只要加個-l參數就可以了,同時主站gg進程都停止之后,會發現gg的目標端進程雖然是running狀態,但是errlog里會提示抽取進程已停止的相關信息:
2012-04-26 20:54:38 INFO OGG-00484 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Executing DDL operation.
2012-04-26 20:54:38 INFO OGG-00483 Oracle GoldenGate Delivery for Oracle, repksr1.prm: DDL operation successful.
2012-04-26 20:54:38 INFO OGG-01408 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Restoring current schema for DDL operation to [OGG].
2012-04-26 20:58:41 INFO OGG-01735 Oracle GoldenGate Collector: Synchronizing /home/oracle/ggs/trails/t1000239 to disk.
2012-04-26 20:58:41 INFO OGG-01670 Oracle GoldenGate Collector: Closing /home/oracle/ggs/trails/t1000239.
2012-04-26 20:58:41 INFO OGG-01675 Oracle GoldenGate Collector: Terminating because extract is stopped.
以上步驟執行完了之后,停掉節點3上的數據庫相關進程和服務,略過,然后就是關機,通知在機房候命的同事,然后那邊開始處理內存問題.........大約30分鐘后,內存問題解決,服務器啟動起來后,我這里開始處理后續事宜:
首先就是在節點3上啟動portmap和nfs服務,略過................
之后掛載節點1,2以及共享存儲,之后在啟動mgr進程的時候會報錯,如下:
2012-04-26 21:50:18 ERROR OGG-01117 Oracle GoldenGate Command Interpreter for Oracle: Received signal: Program interrupt (2).
2012-04-26 21:50:18 ERROR OGG-01668 Oracle GoldenGate Command Interpreter for Oracle: PROCESS ABENDING.
2012-04-26 21:51:43 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): start mgr.
2012-04-26 21:52:13 ERROR OGG-01454 Oracle GoldenGate Manager for Oracle, mgr.prm: Unable to lock file "/share_disk/ggs/dirpcs/MGR.pcm" (error 37, No locks available).
2012-04-26 21:52:13 ERROR OGG-01668 Oracle GoldenGate Manager for Oracle, mgr.prm: PROCESS ABENDING.
以上紅色部分大概意思就是mgr進程無法獲得共享存儲上的相關鎖,直接會導致后續操作都無法進行,方法很簡單,就是在節點3上啟動nfslock服務,然后再啟動mgr進程就好了,待mgr啟動起來之后,發現抽取進程abend掉了,errlog里拋出相關extract的錯誤信息,如下:
2012-04-26 21:54:34 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Rolling over remote file /home/oracle/ggs/trails/t1000240.
2012-04-26 21:54:34 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for target file /home/oracle/ggs/trails/t1000240, at RBA 1022.
2012-04-26 21:54:34 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for all targets.
2012-04-26 21:54:35 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, extksr1.prm: Could not find archived log for sequence 16857 thread 3 under alternative destinations. SQL <SELECT MAX(sequence#) FROM v$log WHERE thread# = :ora_thread>. Last alternative log tried /arch/rac3/3_16857_744833311.dbf, error retrieving redo file name for sequence 16857, archived = 1, use_alternate = 0Not able to establish initial position for sequence 16857, rba 1529360.
2012-04-26 21:54:35 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, extksr1.prm: PROCESS ABENDING.
造成這種情況的原因很簡單,就是節點3在關閉的時候,出現vip漂移至其他節點了,導致原本節點3上的歸檔歸到了其他的節點上,在gg抽取節點3的歸檔的時候,在相關目錄下找不到必須的歸檔日志,所以就abend掉了,原因清楚之后,解決就簡單了,直接到其他節點上把節點3的歸檔日志拷貝過來,然后再啟動抽取進程就ok了:
2012-04-26 21:57:22 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 started.
2012-04-26 21:57:22 INFO OGG-01055 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery initialization completed for target file /share_disk/ggs/trails/s1000239, at RBA 24518902.
2012-04-26 21:57:22 INFO OGG-01478 Oracle GoldenGate Capture for Oracle, extksr1.prm: Output file /share_disk/ggs/trails/s1 is using format RELEASE 10.4/11.1.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 1, Sequence 29645, RBA 18568720, SCN 18.122009990, Apr 26, 2012 9:01:24 PM.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 2, Sequence 28161, RBA 12794496, SCN 18.122010368, Apr 26, 2012 9:01:32 PM.
2012-04-26 21:57:24 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, extksr1.prm: Rolling over remote file /share_disk/ggs/trails/s1000239.
2012-04-26 21:57:24 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for target file /share_disk/ggs/trails/s1000240, at RBA 1019.
2012-04-26 21:57:24 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for all targets.
gg主庫:
GGSCI (rac3) 20> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPKSR1 00:00:00 00:00:00
EXTRACT RUNNING EXTKSR1 00:00:00 00:00:04
gg備庫:
GGSCI (rptdb) 7> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REPKSR1 00:00:00 00:00:00
最后觀察了一段時間,發現主站和gg都沒什么問題了,整過程持續了大概一個小時,接下來一周時間繼續觀察監控。
記錄一下~~
上述就是小編為大家分享的goldengate故障的處理方法了,如果剛好有類似的疑惑,不妨參照上述分析進行理解。如果想知道更多相關知識,歡迎關注億速云行業資訊頻道。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。