您好,登錄后才能下訂單哦!
本篇內容介紹了“怎么理解PostgreSQL中Clock Sweep算法”的有關知識,在實際案例的操作過程中,不少人都會遇到這樣的困境,接下來就讓小編帶領大家學習一下如何處理這些情況吧!希望大家仔細閱讀,能夠學有所成!
BufferDesc
共享緩沖區的共享描述符(狀態)數據
/* * Flags for buffer descriptors * buffer描述器標記 * * Note: TAG_VALID essentially means that there is a buffer hashtable * entry associated with the buffer's tag. * 注意:TAG_VALID本質上意味著有一個與緩沖區的標記相關聯的緩沖區散列表條目。 */ //buffer header鎖定 #define BM_LOCKED (1U << 22) /* buffer header is locked */ //數據需要寫入(標記為DIRTY) #define BM_DIRTY (1U << 23) /* data needs writing */ //數據是有效的 #define BM_VALID (1U << 24) /* data is valid */ //已分配buffer tag #define BM_TAG_VALID (1U << 25) /* tag is assigned */ //正在R/W #define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */ //上一個I/O出現錯誤 #define BM_IO_ERROR (1U << 27) /* previous I/O failed */ //開始寫則變DIRTY #define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */ //存在等待sole pin的其他進程 #define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */ //checkpoint發生,必須刷到磁盤上 #define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */ //持久化buffer(不是unlogged或者初始化fork) #define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged, * or init fork) */ /* * BufferDesc -- shared descriptor/state data for a single shared buffer. * BufferDesc -- 共享緩沖區的共享描述符(狀態)數據 * * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change * the tag, state or wait_backend_pid fields. In general, buffer header lock * is a spinlock which is combined with flags, refcount and usagecount into * single atomic variable. This layout allow us to do some operations in a * single atomic operation, without actually acquiring and releasing spinlock; * for instance, increase or decrease refcount. buf_id field never changes * after initialization, so does not need locking. freeNext is protected by * the buffer_strategy_lock not buffer header lock. The LWLock can take care * of itself. The buffer header lock is *not* used to control access to the * data in the buffer! * 注意:必須持有Buffer header鎖(BM_LOCKED標記)才能檢查或修改tag/state/wait_backend_pid字段. * 通常來說,buffer header lock是spinlock,它與標記位/參考計數/使用計數組合到單個原子變量中. * 這個布局設計允許我們執行原子操作,而不需要實際獲得或者釋放spinlock(比如,增加或者減少參考計數). * buf_id字段在初始化后不會出現變化,因此不需要鎖定. * freeNext通過buffer_strategy_lock鎖而不是buffer header lock保護. * LWLock可以很好的處理自己的狀態. * 務請注意的是:buffer header lock不用于控制buffer中的數據訪問! * * It's assumed that nobody changes the state field while buffer header lock * is held. Thus buffer header lock holder can do complex updates of the * state variable in single write, simultaneously with lock release (cleaning * BM_LOCKED flag). On the other hand, updating of state without holding * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag * is not set. Atomic increment/decrement, OR/AND etc. are not allowed. * 假定在持有buffer header lock的情況下,沒有人改變狀態字段. * 持有buffer header lock的進程可以執行在單個寫操作中執行復雜的狀態變量更新, * 同步的釋放鎖(清除BM_LOCKED標記). * 換句話說,如果沒有持有buffer header lock的狀態更新,會受限于CAS, * 這種情況下確保BM_LOCKED沒有被設置. * 比如原子的增加/減少(AND/OR)等操作是不允許的. * * An exception is that if we have the buffer pinned, its tag can't change * underneath us, so we can examine the tag without locking the buffer header. * Also, in places we do one-time reads of the flags without bothering to * lock the buffer header; this is generally for situations where we don't * expect the flag bit being tested to be changing. * 一種例外情況是如果我們已有buffer pinned,該buffer的tag不能改變(在本進程之下), * 因此不需要鎖定buffer header就可以檢查tag了. * 同時,在執行一次性的flags讀取時不需要鎖定buffer header. * 這種情況通常用于我們不希望正在測試的flag bit將被改變. * * We can't physically remove items from a disk page if another backend has * the buffer pinned. Hence, a backend may need to wait for all other pins * to go away. This is signaled by storing its own PID into * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present, * there can be only one such waiter per buffer. * 如果其他進程有buffer pinned,那么進程不能物理的從磁盤頁面中刪除items. * 因此,后臺進程需要等待其他pins清除.這可以通過存儲它自己的PID到wait_backend_pid中, * 并設置標記位BM_PIN_COUNT_WAITER. * 目前,每個緩沖區只能由一個等待進程. * * We use this same struct for local buffer headers, but the locks are not * used and not all of the flag bits are useful either. To avoid unnecessary * overhead, manipulations of the state field should be done without actual * atomic operations (i.e. only pg_atomic_read_u32() and * pg_atomic_unlocked_write_u32()). * 本地緩沖頭部使用同樣的結構,但并不需要使用locks,而且并不是所有的標記位都使用. * 為了避免不必要的負載,狀態域的維護不需要實際的原子操作 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32()) * * Be careful to avoid increasing the size of the struct when adding or * reordering members. Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. * 在增加或者記錄成員變量時,小心避免增加結構體的大小. * 保持結構體大小在64字節內(通常的CPU緩存線大小)對于性能是非常重要的. */ typedef struct BufferDesc { //buffer tag BufferTag tag; /* ID of page contained in buffer */ //buffer索引編號(0開始),指向相應的buffer pool slot int buf_id; /* buffer's index number (from 0) */ /* state of the tag, containing flags, refcount and usagecount */ //tag狀態,包括flags/refcount和usagecount pg_atomic_uint32 state; //pin-count等待進程ID int wait_backend_pid; /* backend PID of pin-count waiter */ //空閑鏈表鏈中下一個空閑的buffer int freeNext; /* link in freelist chain */ //緩沖區內容鎖 LWLock content_lock; /* to lock access to buffer contents */ } BufferDesc;
BufferTag
Buffer tag標記了buffer存儲的是磁盤中哪個block
/* * Buffer tag identifies which disk block the buffer contains. * Buffer tag標記了buffer存儲的是磁盤中哪個block * * Note: the BufferTag data must be sufficient to determine where to write the * block, without reference to pg_class or pg_tablespace entries. It's * possible that the backend flushing the buffer doesn't even believe the * relation is visible yet (its xact may have started before the xact that * created the rel). The storage manager must be able to cope anyway. * 注意:BufferTag必須足以確定如何寫block而不需要參照pg_class或者pg_tablespace數據字典信息. * 有可能后臺進程在刷新緩沖區的時候深圳不相信關系是可見的(事務可能在創建rel的事務之前). * 存儲管理器必須可以處理這些事情. * * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have * to be fixed to zero them, since this struct is used as a hash key. * 注意:如果在結構體中有填充的字節,INIT_BUFFERTAG必須將它們固定為零,因為這個結構體用作散列鍵. */ typedef struct buftag { //物理relation標識符 RelFileNode rnode; /* physical relation identifier */ ForkNumber forkNum; //相對于relation起始的塊號 BlockNumber blockNum; /* blknum relative to begin of reln */ } BufferTag;
算法思想,在 Buffer Manager 中已有詳細介紹,摘錄如下:
WHILE true (1) Obtain the candidate buffer descriptor pointed by the nextVictimBuffer (2) IF the candidate descriptor is unpinned THEN (3) IF the candidate descriptor's usage_count == 0 THEN BREAK WHILE LOOP /* the corresponding slot of this descriptor is victim slot. */ ELSE Decrease the candidate descriptpor's usage_count by 1 END IF END IF (4) Advance nextVictimBuffer to the next one END WHILE (5) RETURN buffer_id of the victim
算法思想樸素且簡單,比起高大上的LRU算法要簡單多了.
nextVictimBuffer可視為時鐘的時針,把緩沖區視為環形緩沖區,時針循環周而往復的循環,如緩沖區滿足unpinned(ref_count == 0) && usage_count == 0的條件,則選中該緩沖,否則,usage_count減一,繼續循環,直至找到滿足條件的buffer為止.選中的buffer一定是buffers中較少使用的那個.
StrategyGetBuffer中的代碼片段:
... /* Nothing on the freelist, so run the "clock sweep" algorithm */ //空閑鏈表中找不到或者滿足不了條件,則執行"clock sweep"算法 //int NBuffers = 1000; trycounter = NBuffers;//嘗試次數 for (;;) { //------- 循環 //獲取buffer描述符 buf = GetBufferDescriptor(ClockSweepTick()); /* * If the buffer is pinned or has a nonzero usage_count, we cannot use * it; decrement the usage_count (unless pinned) and keep scanning. * 如果buffer已pinned,或者有一個非零值的usage_count,不能使用這個buffer. * 減少usage_count(除非已pinned)繼續掃描. */ local_buf_state = LockBufHdr(buf); if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0) { //----- refcount == 0 if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0) { //usage_count <> 0 //usage_count - 1 local_buf_state -= BUF_USAGECOUNT_ONE; //重置嘗試次數 trycounter = NBuffers; } else { //usage_count = 0 /* Found a usable buffer */ //發現一個可用的buffer if (strategy != NULL) //添加到該策略的環形緩沖區中 AddBufferToRing(strategy, buf); //輸出參數賦值 *buf_state = local_buf_state; //返回buf return buf; } } else if (--trycounter == 0) { //----- refcount <> 0 && --trycounter == 0 /* * We've scanned all the buffers without making any state changes, * so all the buffers are pinned (or were when we looked at them). * We could hope that someone will free one eventually, but it's * probably better to fail than to risk getting stuck in an * infinite loop. * 在沒有改變任何狀態的情況,我們已經完成了所有buffers的遍歷, * 因此所有的buffers已pinned(或者在搜索的時候pinned). * 我們希望某些進程會周期性的釋放buffer,但如果實在拿不到,那報錯總比傻傻的死循環要好. */ UnlockBufHdr(buf, local_buf_state); elog(ERROR, "no unpinned buffers available"); } //解鎖buffer header UnlockBufHdr(buf, local_buf_state); }
ClockSweepTick()函數是StrategyGetBuffer()的輔助過程,移動時針(當前位置往前一格),返回時針指向的buffer.
其實現邏輯如下:
/* * ClockSweepTick - Helper routine for StrategyGetBuffer() * ClockSweepTick - StrategyGetBuffer()的輔助過程 * * Move the clock hand one buffer ahead of its current position and return the * id of the buffer now under the hand. * 移動時針(當前位置往前一格),返回時針指向的buffer. */ static inline uint32 ClockSweepTick(void) { uint32 victim; /* * Atomically move hand ahead one buffer - if there's several processes * doing this, this can lead to buffers being returned slightly out of * apparent order. * 原子移動時針一格 * - 如果有多個進程執行這個操作,這可能會導致緩沖返回的順序稍微有些混亂. * */ victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1); if (victim >= NBuffers) { //-------- 候選buffer大于NBuffers //記錄元素的victim uint32 originalVictim = victim; /* always wrap what we look up in BufferDescriptors */ //回卷BufferDescriptors中查找的內容 victim = victim % NBuffers; /* * If we're the one that just caused a wraparound, force * completePasses to be incremented while holding the spinlock. We * need the spinlock so StrategySyncStart() can return a consistent * value consisting of nextVictimBuffer and completePasses. * 如果我們正好導致了wraparound,在持有自旋鎖的情況下強制completePasses增加. * 之所以需要自旋鎖是因為StrategySyncStart()才能返回nextVictimBuffer和completePasses的一致性值. */ if (victim == 0) { //出現回卷 uint32 expected;//期望值(不考慮回卷) uint32 wrapped;//回卷后的值 bool success = false;//成功標記 //期望值 expected = originalVictim + 1; while (!success) { /* * Acquire the spinlock while increasing completePasses. That * allows other readers to read nextVictimBuffer and * completePasses in a consistent manner which is required for * StrategySyncStart(). In theory delaying the increment * could lead to an overflow of nextVictimBuffers, but that's * highly unlikely and wouldn't be particularly harmful. * 在增加completePasses時請求獲取自旋鎖. * 這樣可以讓其他讀取進程以一致性的方式讀取nextVictimBuffer和completePasses, * 這種一致性的方式在StrategySyncStart()中是需要的. * 理論上來說,延遲增加可能會導致nextVictimBuffers溢出, * 但但這是非常不可能的,也不會特別有害。 */ SpinLockAcquire(&StrategyControl->buffer_strategy_lock); //獲取回卷數 wrapped = expected % NBuffers; //原子比較并交換數字 success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer, &expected, wrapped); if (success) //如成功,則增加計數 StrategyControl->completePasses++; //釋放自旋鎖 SpinLockRelease(&StrategyControl->buffer_strategy_lock); } } } //返回victim return victim; } /* * pg_atomic_compare_exchange_u32 - CAS operation *pg_atomic_compare_exchange_u32 - CAS操作 * * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. * 原子比較ptr當前值與*expected,如果ptr和*expected值一致,則存儲newval到ptr中. * *ptr的當前值通常會存儲在*expected中. * * Return true if values have been exchanged, false otherwise. * 如值已交換,則返回T,否則返回F. * * Full barrier semantics. * 完整的屏障語義. */ static inline bool pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { AssertPointerAlignment(ptr, 4); AssertPointerAlignment(expected, 4); return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval); } bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { bool ret; /* * Do atomic op under a spinlock. It might look like we could just skip * the cmpxchg if the lock isn't available, but that'd just emulate a * 'weak' compare and swap. I.e. one that allows spurious failures. Since * several algorithms rely on a strong variant and that is efficiently * implementable on most major architectures let's emulate it here as * well. * 在自旋鎖保護下執行原子操作. * 這看起來如果鎖不可能的話,我們可以跳過cmpxchg,但這只是模擬了一個"淺"比較和交換. * 比如,這會引起spurious failures. * 由于有幾種算法依賴于一種強大的變體,而且這種變體可以在大多數主要架構上有效地實現, * 因此我們在這里也對其進行模擬。 */ SpinLockAcquire((slock_t *) &ptr->sema); /* perform compare/exchange logic */ //執行比較/交換邏輯 ret = ptr->value == *expected;//ptr與*expected是否一致? *expected = ptr->value;//*expected賦值為ptr if (ret) ptr->value = newval;//值一致,則ptr設置為新值 /* and release lock */ //釋放自旋鎖 SpinLockRelease((slock_t *) &ptr->sema); //返回結果 return ret; }
“怎么理解PostgreSQL中Clock Sweep算法”的內容就介紹到這里了,感謝大家的閱讀。如果想了解更多行業相關的知識可以關注億速云網站,小編將為大家輸出更多高質量的實用文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。