您好,登錄后才能下訂單哦!
Part1:寫在最前
在副本集架構中,我們會經常通過rs.add(),rs.remove()命令來調整后臺數據庫架構,在本案例中,我們異常的觸發到了一個MongoDB的BUG,并盡快的找到了官方的人進行咨詢。在生產環境中,我們做實例遷移,將研發自行維護的MongoDB副本集遷移至DBA管理,由于硬件和版本都不符合規范,因此我們對集群先進行了升級處理,又使用了rs.add()和rs.remove()來完成數據庫的遷移工作。
Part2:背景
在研發自行維護的數據庫版本為2.6版本,我們先將數據庫升級至3.4版本,并利用3.4版本的特性實現0 downtime開啟認證。在研發的數據庫原有架構中,存在離線節點,即hidden和no-vote節點。也正是因為這一特性,觸發了MongoDB宕機。
Part1:整體架構
原有集群為7節點副本集架構,其中2臺為hidden節點,并且配置了no-vote參數。
Part2:遷移原理
我們利用新的機器使用rs.add()加入到原有副本集,在原有副本集的基礎上添加了新的節點,待同步完成后,rs.remove()掉老的研發機器,完成實例遷移。
Part3:錯誤日志
在使用rs.remove時,我們提前寫好了遷移文檔,由于rs.remove()的速度很快,我們采取了直接rs.remove()多個節點的復制粘貼方式,這個也可能是后續導致crash的原因之一。
rs.remove() 2018-04-17T14:54:23.793+0800 I NETWORK [conn163572] received client metadata from 192.168.1.100:16400 conn163572: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.4 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" } 2018-04-17T14:54:23.811+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57568 #163573 (73 connections now open) 2018-04-17T14:54:23.811+0800 I NETWORK [conn163573] received client metadata from 192.168.1.101:57568 conn163573: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" } 2018-04-17T14:54:23.818+0800 I - [replication-25230] Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620 2018-04-17T14:54:23.818+0800 I - [replication-25230] ***aborting after invariant() failure 2018-04-17T14:54:23.822+0800 I NETWORK [thread1] connection accepted from 192.168.1.102:32210 #163574 (74 connections now open) 2018-04-17T14:54:23.822+0800 I NETWORK [conn163574] received client metadata from 192.168.1.102:32210 conn163574: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" } 2018-04-17T14:54:23.822+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57569 #163575 (75 connections now open) 2018-04-17T14:54:23.823+0800 I NETWORK [thread1] connection accepted from 192.168.1.103:50661 #163576 (76 connections now open) 2018-04-17T14:54:23.824+0800 I NETWORK [conn163576] received client metadata from 192.168.1.103:50661 conn163576: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" } 2018-04-17T14:54:23.830+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57570 #163577 (77 connections now open) 2018-04-17T14:54:23.830+0800 I NETWORK [conn163577] received client metadata from 192.168.1.101:57570 conn163577: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" } 2018-04-17T14:54:23.832+0800 I NETWORK [thread1] connection accepted from 192.168.1.104:41163 #163578 (78 connections now open) 2018-04-17T14:54:23.838+0800 F - [replication-25230] Got signal: 6 (Aborted). 0x7ffc3f0a41d1 0x7ffc3f0a3159 0x7ffc3f0a363d 0x7ffc3c0bf500 0x7ffc3bd4f8a5 0x7ffc3bd51085 0x7ffc3e2d7a8e 0x7ffc3ea5883c 0x7ffc3ea58999 0x7ffc3eb19b9d 0x7ffc3eaa2676 0x7ffc3e9c83e9 0x7ffc3e9b0032 0x7ffc3ea41bf7 0x7ffc3e3812f1 0x7ffc3ee25a5a 0x7ffc3ee28e53 0x7ffc3ee2932b 0x7ffc3f0185bc 0x7ffc3f01906c 0x7ffc3f019a56 0x7ffc3fdc7a00 0x7ffc3c0b7851 0x7ffc3be0511d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"7FFC3DA22000","o":"16821D1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7FFC3DA22000","o":"1681159"},{"b":"7FFC3DA22000","o":"168163D"},{"b":"7FFC3C0B0000","o":"F500"},{"b":"7FFC3BD1D000","o":"328A5","s":"gsignal"},{"b":"7FFC3BD1D000","o":"34085","s":"abort"},{"b":"7FFC3DA22000","o":"8B5A8E","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7FFC3DA22000","o":"103683C"},{"b":"7FFC3DA22000","o":"1036999"},{"b":"7FFC3DA22000","o":"10F7B9D","s":"_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS8_18OplogQueryMetadataEEENS_6Date_tE"},{"b":"7FFC3DA22000","o":"1080676","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE"},{"b":"7FFC3DA22000","o":"FA63E9","s":"_ZN5mongo4repl31DataReplicatorExternalStateImpl18shouldStopFetchingERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE"},{"b":"7FFC3DA22000","o":"F8E032"},{"b":"7FFC3DA22000","o":"101FBF7","s":"_ZN5mongo4repl12OplogFetcher9_callbackERKNS_10StatusWithINS_7Fetcher13QueryResponseEEEPNS_14BSONObjBuilderE"},{"b":"7FFC3DA22000","o":"95F2F1","s":"_ZN5mongo7Fetcher9_callbackERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEPKc"},{"b":"7FFC3DA22000","o":"1403A5A"},{"b":"7FFC3DA22000","o":"1406E53","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE"},{"b":"7FFC3DA22000","o":"140732B"},{"b":"7FFC3DA22000","o":"15F65BC","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"7FFC3DA22000","o":"15F706C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"7FFC3DA22000","o":"15F7A56","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"7FFC3DA22000","o":"23A5A00"},{"b":"7FFC3C0B0000","o":"7851"},{"b":"7FFC3BD1D000","o":"E811D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.9-2.9", "gitVersion" : "dcdd0758e067949b27cfdf61c641905e619756e6", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-279.23.1.mi5.el6.x86_64", "version" : "#1 SMP Mon Sep 2 12:00:41 CST 2013", "machine" : "x86_64" }, "somap" : [ { "b" : "7FFC3DA22000", "elfType" : 3, "buildId" : "BE2829B1404DE963963BBCCA55C6EBBFE34CE20F" }, { "b" : "7FFFA8CFF000", "elfType" : 3, "buildId" : "A6539A72BE0493D91090D15510B7866310CBEA50" }, { "b" : "7FC5C55EA000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D0ABBCCAC542E41D33A638138FEC450AC08A1CF2" }, { "b" : "7FC5C37D9000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "732F8FD5054C4FA43CF0CD4CC8C5FF02CEA3CC54" }, { "b" : "7FC5BCDBF000", "path" : "/usr/lib64/libsasl2.so.2", "elfType" : 3, "buildId" : "C447C77E41A336BA9AEE12D08FBF5D15948D7468" }, { "b" : "7FC5BFF53000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "318EAB33420B000D542F09B91B716BACAB1AD546" }, { "b" : "7FC5C2773000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3A8D65B9A373C0AFAF106F3A979835B16DBEFF1A" }, { "b" : "7FC5C416B000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "5E9DDD9EE40AD0D4DDD032CE1086E402B7FA955A" }, { "b" : "7FC5C5367000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "15B0822C819020F18BBF0E0C0286373155E03BE2" }, { "b" : "7FC5C40E3000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "A26BC945B5765B1258DB01FFEFB0C4F53F3961D7" }, { "b" : "7FC5C1ACD000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "CE152B8676517F23E7F54AD6408330979BE41443" }, { "b" : "7FC5C44B0000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "14853815DD64F2B830B8DCCB3A958A3804E13EFC" }, { "b" : "7FC5C451D000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F2BBDD778ABFECFBA0C59BBCBA94D1151DDF96E4" }, { "b" : "7FC5C6800000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "D6BD776B36DAC438642CF84B282956738727901D" }, { "b" : "7FC5C2703000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "31545FEA4F1F72061992E79A1DF461EC719942E8" }, { "b" : "7FC5C04CC000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "5F5EB7F30B61E0DAF6BBF8A367C388A54B7010EA" }, { "b" : "7FC5BEA88000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "76A3DEEB6876CBED69A57D3EBC1E2AFBCA84EC76" }, { "b" : "7FC5BEBA2000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "605701A8AE551604303523B4F0D3A7E98CF9E153" }, { "b" : "7FC5C059E000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "4623A78918C882770E81AE7B5EE9DDF8DD2B6674" }, { "b" : "7FC5BEB72000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "190D45F6743DEF9DF8169D65801D4575B01825BD" }, { "b" : "7FC5BF510000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "68195872ECFB188389D29AAF01031A976FD18168" }, { "b" : "7FC5BEB05000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "DAE2A7E4E8B37D43EF6839FF5D8E012AFCF21A69" }, { "b" : "7FC5BF902000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "8A8734DC37305D8CC2EF8F8C3E5EA03171DB07EC" }, { "b" : "7FC5C16E3000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "A287DC6B86A9823038F057105CE64671E0B392EC" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7ffc3f0a41d1] mongod(+0x1681159) [0x7ffc3f0a3159] mongod(+0x168163D) [0x7ffc3f0a363d] libpthread.so.0(+0xF500) [0x7ffc3c0bf500] libc.so.6(gsignal+0x35) [0x7ffc3bd4f8a5] libc.so.6(abort+0x175) [0x7ffc3bd51085] mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7ffc3e2d7a8e] mongod(+0x103683C) [0x7ffc3ea5883c] mongod(+0x1036999) [0x7ffc3ea58999] mongod(_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS8_18OplogQueryMetadataEEENS_6Date_tE+0x32D) [0x7ffc3eb19b9d] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE+0xB6) [0x7ffc3eaa2676] mongod(_ZN5mongo4repl31DataReplicatorExternalStateImpl18shouldStopFetchingERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE+0x59) [0x7ffc3e9c83e9] mongod(+0xF8E032) [0x7ffc3e9b0032] mongod(_ZN5mongo4repl12OplogFetcher9_callbackERKNS_10StatusWithINS_7Fetcher13QueryResponseEEEPNS_14BSONObjBuilderE+0x1E57) [0x7ffc3ea41bf7] mongod(_ZN5mongo7Fetcher9_callbackERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEPKc+0x621) [0x7ffc3e3812f1] mongod(+0x1403A5A) [0x7ffc3ee25a5a] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x1B3) [0x7ffc3ee28e53] mongod(+0x140732B) [0x7ffc3ee2932b] mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x14C) [0x7ffc3f0185bc] mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xBC) [0x7ffc3f01906c] mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x96) [0x7ffc3f019a56] mongod(+0x23A5A00) [0x7ffc3fdc7a00] libpthread.so.0(+0x7851) [0x7ffc3c0b7851] libc.so.6(clone+0x6D) [0x7ffc3be0511d] ----- END BACKTRACE -----
Part4:問題分析
我們可以看到,在錯誤日志中存在如下內容:
Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620
我們可以看到,在錯誤日志中存在如下內容:
Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620
通過追尋源碼,我們定位到該處邏輯:
透過invariant關鍵字,我們向上追尋,可以找到如下內容:
看完代碼邏輯,我們認為報錯的原因是由于批量rs.remove()引起,由于批量rs.remove(),代碼邏輯中可以看出有一個判斷:
const MemberConfig& ReplSetConfig::getMemberAt(size_t i) const
{ invariant(i < _members.size()); return _members[i]; }
例如,一個5節點的副本集,那么他的_id索引值為0,1,2,3,4。此時_members.size是5。 4 must <5邏輯是正確的。
但當我們使用rs.remove()命令刪除超過1個節點成員時,例如同時刪除_id=3和_id=4的,那么此時副本集就只有3個成員,_members.size變為3
但是i值卻依舊因為某些原因停留在4,4<3這個邏輯不對,導致了被刪除節點abort退出crash。當然這個推論目前我們沒有確切的證據。
Part5:Mongo官方人員答復
同時,我們搜索到了相關的mongoDB jira,留言了我們的疑問,官方的jira中也是其他網友遇到的問題。他們的情況比我們的要更明顯一些,是由于no-vote節點rs.remove導致的
筆者案例中也確實存在no-vote節點。目前官方給出的答復計劃在4.1版本中進行修復。
官方jira連接
https://jira.mongodb.org/browse/SERVER-28079
Part6:規避方法
1. 將非投票節點升級為投票節點;
2. 請提前做好備份,以備不時之需。
——總結——
通過本文,我們了解到rs.remove()潛在可能導致mongoDB crash的場景,由于時間有限,且crash掉的機器本就是我們計劃rs.remove()掉的節點,因此在筆者復現無果后,決定暫時放棄繼續跟進這個issue。如果有網友遇到過類似的問題,且找到了根本原因和復現方式,可在博文下留言,筆者感激不盡!~由于筆者的水平有限,編寫時間也很倉促,文中難免會出現一些錯誤或者不準確的地方,不妥之處懇請讀者批評指正。喜歡筆者的文章,右上角點一波關注,謝謝~
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。