您好,登錄后才能下訂單哦!
這篇文章主要講解了“PostgreSQL中RelationGetBufferForTuple函數有什么作用”,文中的講解內容簡單清晰,易于學習與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學習“PostgreSQL中RelationGetBufferForTuple函數有什么作用”吧!
本節簡單介紹了PostgreSQL在執行插入過程中與緩存相關的函數RelationGetBufferForTuple,該函數返回滿足空閑空間 >= 給定大小的page,并且該page對應的buffer狀態為pinned和并持有獨占鎖。
BufferDesc
共享緩沖區的共享描述符(狀態)數據
/* * Flags for buffer descriptors * buffer描述器標記 * * Note: TAG_VALID essentially means that there is a buffer hashtable * entry associated with the buffer's tag. * 注意:TAG_VALID本質上意味著有一個與緩沖區的標記相關聯的緩沖區散列表條目。 */ //buffer header鎖定 #define BM_LOCKED (1U << 22) /* buffer header is locked */ //數據需要寫入(標記為DIRTY) #define BM_DIRTY (1U << 23) /* data needs writing */ //數據是有效的 #define BM_VALID (1U << 24) /* data is valid */ //已分配buffer tag #define BM_TAG_VALID (1U << 25) /* tag is assigned */ //正在R/W #define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */ //上一個I/O出現錯誤 #define BM_IO_ERROR (1U << 27) /* previous I/O failed */ //開始寫則變DIRTY #define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */ //存在等待sole pin的其他進程 #define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */ //checkpoint發生,必須刷到磁盤上 #define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */ //持久化buffer(不是unlogged或者初始化fork) #define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged, * or init fork) */ /* * BufferDesc -- shared descriptor/state data for a single shared buffer. * BufferDesc -- 共享緩沖區的共享描述符(狀態)數據 * * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change * the tag, state or wait_backend_pid fields. In general, buffer header lock * is a spinlock which is combined with flags, refcount and usagecount into * single atomic variable. This layout allow us to do some operations in a * single atomic operation, without actually acquiring and releasing spinlock; * for instance, increase or decrease refcount. buf_id field never changes * after initialization, so does not need locking. freeNext is protected by * the buffer_strategy_lock not buffer header lock. The LWLock can take care * of itself. The buffer header lock is *not* used to control access to the * data in the buffer! * 注意:必須持有Buffer header鎖(BM_LOCKED標記)才能檢查或修改tag/state/wait_backend_pid字段. * 通常來說,buffer header lock是spinlock,它與標記位/參考計數/使用計數組合到單個原子變量中. * 這個布局設計允許我們執行原子操作,而不需要實際獲得或者釋放spinlock(比如,增加或者減少參考計數). * buf_id字段在初始化后不會出現變化,因此不需要鎖定. * freeNext通過buffer_strategy_lock鎖而不是buffer header lock保護. * LWLock可以很好的處理自己的狀態. * 務請注意的是:buffer header lock不用于控制buffer中的數據訪問! * * It's assumed that nobody changes the state field while buffer header lock * is held. Thus buffer header lock holder can do complex updates of the * state variable in single write, simultaneously with lock release (cleaning * BM_LOCKED flag). On the other hand, updating of state without holding * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag * is not set. Atomic increment/decrement, OR/AND etc. are not allowed. * 假定在持有buffer header lock的情況下,沒有人改變狀態字段. * 持有buffer header lock的進程可以執行在單個寫操作中執行復雜的狀態變量更新, * 同步的釋放鎖(清除BM_LOCKED標記). * 換句話說,如果沒有持有buffer header lock的狀態更新,會受限于CAS, * 這種情況下確保BM_LOCKED沒有被設置. * 比如原子的增加/減少(AND/OR)等操作是不允許的. * * An exception is that if we have the buffer pinned, its tag can't change * underneath us, so we can examine the tag without locking the buffer header. * Also, in places we do one-time reads of the flags without bothering to * lock the buffer header; this is generally for situations where we don't * expect the flag bit being tested to be changing. * 一種例外情況是如果我們已有buffer pinned,該buffer的tag不能改變(在本進程之下), * 因此不需要鎖定buffer header就可以檢查tag了. * 同時,在執行一次性的flags讀取時不需要鎖定buffer header. * 這種情況通常用于我們不希望正在測試的flag bit將被改變. * * We can't physically remove items from a disk page if another backend has * the buffer pinned. Hence, a backend may need to wait for all other pins * to go away. This is signaled by storing its own PID into * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present, * there can be only one such waiter per buffer. * 如果其他進程有buffer pinned,那么進程不能物理的從磁盤頁面中刪除items. * 因此,后臺進程需要等待其他pins清除.這可以通過存儲它自己的PID到wait_backend_pid中, * 并設置標記位BM_PIN_COUNT_WAITER. * 目前,每個緩沖區只能由一個等待進程. * * We use this same struct for local buffer headers, but the locks are not * used and not all of the flag bits are useful either. To avoid unnecessary * overhead, manipulations of the state field should be done without actual * atomic operations (i.e. only pg_atomic_read_u32() and * pg_atomic_unlocked_write_u32()). * 本地緩沖頭部使用同樣的結構,但并不需要使用locks,而且并不是所有的標記位都使用. * 為了避免不必要的負載,狀態域的維護不需要實際的原子操作 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32()) * * Be careful to avoid increasing the size of the struct when adding or * reordering members. Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. * 在增加或者記錄成員變量時,小心避免增加結構體的大小. * 保持結構體大小在64字節內(通常的CPU緩存線大小)對于性能是非常重要的. */ typedef struct BufferDesc { //buffer tag BufferTag tag; /* ID of page contained in buffer */ //buffer索引編號(0開始),指向相應的buffer pool slot int buf_id; /* buffer's index number (from 0) */ /* state of the tag, containing flags, refcount and usagecount */ //tag狀態,包括flags/refcount和usagecount pg_atomic_uint32 state; //pin-count等待進程ID int wait_backend_pid; /* backend PID of pin-count waiter */ //空閑鏈表鏈中下一個空閑的buffer int freeNext; /* link in freelist chain */ //緩沖區內容鎖 LWLock content_lock; /* to lock access to buffer contents */ } BufferDesc;
BufferTag
Buffer tag標記了buffer存儲的是磁盤中哪個block
/* * Buffer tag identifies which disk block the buffer contains. * Buffer tag標記了buffer存儲的是磁盤中哪個block * * Note: the BufferTag data must be sufficient to determine where to write the * block, without reference to pg_class or pg_tablespace entries. It's * possible that the backend flushing the buffer doesn't even believe the * relation is visible yet (its xact may have started before the xact that * created the rel). The storage manager must be able to cope anyway. * 注意:BufferTag必須足以確定如何寫block而不需要參照pg_class或者pg_tablespace數據字典信息. * 有可能后臺進程在刷新緩沖區的時候深圳不相信關系是可見的(事務可能在創建rel的事務之前). * 存儲管理器必須可以處理這些事情. * * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have * to be fixed to zero them, since this struct is used as a hash key. * 注意:如果在結構體中有填充的字節,INIT_BUFFERTAG必須將它們固定為零,因為這個結構體用作散列鍵. */ typedef struct buftag { //物理relation標識符 RelFileNode rnode; /* physical relation identifier */ ForkNumber forkNum; //相對于relation起始的塊號 BlockNumber blockNum; /* blknum relative to begin of reln */ } BufferTag;
RelationGetBufferForTuple函數返回滿足空閑空間>=給定大小的page,并且該page對應的buffer狀態為pinned和并持有獨占鎖
輸入:
relation-數據表
len-需要的空間大小
otherBuffer-用于update場景,上一次pinned的buffer
options-處理選項
bistate-BulkInsert標記
vmbuffer-第1個vm(visibilitymap)
vmbuffer_other-用于update場景,上一次pinned的buffer對應的vm(visibilitymap)
注意:
otherBuffer這個參數讓人覺得困惑,原因是PG的機制使然
Update時,不是原地更新,而是原數據保留(更新xmax),新數據插入
原數據&新數據如果在不同Block中,鎖定Block的時候可能會出現Deadlock
舉個例子:Session A更新表T的第一行,第一行在Block 0中,新數據存儲在Block 2中
Session B更新表T的第二行,第二行在Block 0中,新數據存儲在Block 2中
Block 0/2均要鎖定才能完整實現Update操作:
如果Session A先鎖定了Block 2,Session B先鎖定了Block 0,
然后Session A嘗試鎖定Block 0,Session B嘗試鎖定Block 2,這時候就會出現死鎖
為了避免這種情況,PG規定鎖定時,同一個Relation,按Block的編號順序鎖定,
如需要鎖定0和2,那必須先鎖定Block 0,再鎖定2
輸出:
為Tuple分配的Buffer
其主要實現邏輯如下:
1.初始化相關變量
2.獲取預留空間
3.如為Update操作,則獲取上次pinned buffer對應的Block
4.獲取目標page:targetBlock
5.如targetBlock非法,并且使用FSM,則使用FSM尋找
6.如targetBlock仍非法,則循環遍歷page檢索合適的Block
6.1.讀取并獨占鎖定目標block,以及給定的otherBuffer(如給出)
6.2.獲取vm
6.3.讀取buffer,判斷是否有足夠的空閑空間,如足夠,則返回
6.4.如仍不足夠,則調用RecordAndGetPageWithFreeSpace獲取targetBlock,再次循環
7.遍歷完畢,仍找不到block,則擴展表
8.擴展表后,以P_NEW模式讀取buffer并鎖定
9.獲取該buffer對應的page,執行相關校驗
10.校驗不通過報錯,校驗通過則返回buffer
/* * RelationGetBufferForTuple * * Returns pinned and exclusive-locked buffer of a page in given relation * with free space >= given len. * 返回滿足空閑空間>=給定大小的page,并且該page對應的buffer狀態為pinned和并持有獨占鎖 * * If otherBuffer is not InvalidBuffer, then it references a previously * pinned buffer of another page in the same relation; on return, this * buffer will also be exclusive-locked. (This case is used by heap_update; * the otherBuffer contains the tuple being updated.) * 如果otherBuffer不是InvalidBuffer, * 那么otherBuffer依賴的是先前同一個relation但是其他page的pinned buffer. * 返回時,該buffer同時被獨占鎖定. * (heap_update會出現這種情況,otherBuffer存儲正update的tuple) * * The reason for passing otherBuffer is that if two backends are doing * concurrent heap_update operations, a deadlock could occur if they try * to lock the same two buffers in opposite orders. To ensure that this * can't happen, we impose the rule that buffers of a relation must be * locked in increasing page number order. This is most conveniently done * by having RelationGetBufferForTuple lock them both, with suitable care * for ordering. * 傳遞otherBuffer的原因是如果兩個進程在并發heap_update操作, * 如果它們嘗試以相反的順序鎖定相同的兩個buffer,那會出現死鎖. * 為了確保這種情況不會出現,我們規定,關系緩沖區必須按page的編號順序鎖定. * 要做到這一點,最方便的方法是讓RelationGetBufferForTuple注意順序鎖定它們. * * NOTE: it is unlikely, but not quite impossible, for otherBuffer to be the * same buffer we select for insertion of the new tuple (this could only * happen if space is freed in that page after heap_update finds there's not * enough there). In that case, the page will be pinned and locked only once. * 注意:這不太可能,但又不是不可能,為了讓otherBuffer與我們選擇插入新元組的buffer一致. * (這只會發生在在執行heap_update檢索page發現沒有足夠的空閑空間,但隨后空間被釋放的情況) * 在這種情況下,page會被pinned并且只會lock一次. * * For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by * locking them only after locking the corresponding heap page, and taking * no further lwlocks while they are locked. * 對于vmbuffer和vmbuffer_other參數,通過在鎖定相應的heap page后再鎖定它們來避免死鎖, * 同時,在被鎖定后,不再持有lwlocks. * * We normally use FSM to help us find free space. However, * if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to * the end of the relation if the tuple won't fit on the current target page. * This can save some cycles when we know the relation is new and doesn't * contain useful amounts of free space. * 通常來說,使用FSM檢索空閑空間.但是,如果指定了HEAP_INSERT_SKIP_FSM, * 那么如果當前的目標page不適合,則直接在relation的最后追加空page. * 這樣可以在知道relation是新的情況下,節省一些處理時間,而且不需要持有有用的空閑空間計數信息. * * HEAP_INSERT_SKIP_FSM is also useful for non-WAL-logged additions to a * relation, if the caller holds exclusive lock and is careful to invalidate * relation's smgr_targblock before the first insertion --- that ensures that * all insertions will occur into newly added pages and not be intermixed * with tuples from other transactions. That way, a crash can't risk losing * any committed data of other transactions. (See heap_insert's comments * for additional constraints needed for safe usage of this behavior.) * HEAP_INSERT_SKIP_FSM同時對于非WAL logged關系也是有用的, * 如果調用者持有獨占鎖并且在首次插入前使得關系的smgr_targblock無效 --- * 這可以確保所有的插入會出現在新增加的pages中,而不會與其他事務的tuple混起來. * 按這種方式,如果出現宕機,那么就不會有丟失其他事務提交的數據的風險. * (詳細參考heap_insert的注釋,里面提到了使用該動作的其他約束) * * The caller can also provide a BulkInsertState object to optimize many * insertions into the same relation. This keeps a pin on the current * insertion target page (to save pin/unpin cycles) and also passes a * BULKWRITE buffer selection strategy object to the buffer manager. * Passing NULL for bistate selects the default behavior. * 調用者同時提供了BulkInsertState對象用于優化大量插入到同一個relation的情況. * 這會在當前插入的目標page保持pin(節省pin/unpin處理過程) * 同時會傳遞BULKWRITE緩沖區選擇器策略對象到buffer manager中. * 如使用默認模式,則設置bitstate為NULL. * * We always try to avoid filling existing pages further than the fillfactor. * This is OK since this routine is not consulted when updating a tuple and * keeping it on the same page, which is the scenario fillfactor is meant * to reserve space for. * 我們通常嘗試避免填充現有頁面超過填充因子設定的范圍. * 這是沒有問題的,因為在更新元組并將其保存在同一個page中時,不會參考此例程, * 該場景下填充因子會用到. * * ereport(ERROR) is allowed here, so this routine *must* be called * before any (unlogged) changes are made in buffer pool. * ereport(ERROR)可在這允許使用,因此該例程必須在buffer pool出現任何變化前調用. */ /* 輸入: relation-數據表 len-需要的空間大小 otherBuffer-用于update場景,上一次pinned的buffer options-處理選項 bistate-BulkInsert標記 vmbuffer-第1個vm(visibilitymap) vmbuffer_other-用于update場景,上一次pinned的buffer對應的vm(visibilitymap) 注意: otherBuffer這個參數讓人覺得困惑,原因是PG的機制使然 Update時,不是原地更新,而是原數據保留(更新xmax),新數據插入 原數據&新數據如果在不同Block中,鎖定Block的時候可能會出現Deadlock 舉個例子:Session A更新表T的第一行,第一行在Block 0中,新數據存儲在Block 2中 Session B更新表T的第二行,第二行在Block 0中,新數據存儲在Block 2中 Block 0/2均要鎖定才能完整實現Update操作: 如果Session A先鎖定了Block 2,Session B先鎖定了Block 0, 然后Session A嘗試鎖定Block 0,Session B嘗試鎖定Block 2,這時候就會出現死鎖 為了避免這種情況,PG規定鎖定時,同一個Relation,按Block的編號順序鎖定, 如需要鎖定0和2,那必須先鎖定Block 0,再鎖定2 輸出: 為Tuple分配的Buffer 附: Pinned buffers:means buffers are currently being used,it should not be flushed out. */ Buffer RelationGetBufferForTuple(Relation relation, Size len, Buffer otherBuffer, int options, BulkInsertState bistate, Buffer *vmbuffer, Buffer *vmbuffer_other) { bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);//是否使用FSM尋找空閑空間 Buffer buffer = InvalidBuffer;// Page page;// Size pageFreeSpace = 0,//page空閑空間 saveFreeSpace = 0;//page需要預留的空間 BlockNumber targetBlock,//目標Block otherBlock;//上一次pinned的buffer對應的Block bool needLock;//是否需要上鎖 //大小對齊 len = MAXALIGN(len); /* be conservative */ /* Bulk insert is not supported for updates, only inserts. */ //otherBuffer有效,說明是update操作,不支持bi(BulkInsert) //bulk操作僅支持插入 Assert(otherBuffer == InvalidBuffer || !bistate); /* * If we're gonna fail for oversize tuple, do it right away * 對于超限的元組,直接報錯 */ //#define MaxHeapTupleSize (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))) //#define MinHeapTupleSize MAXALIGN(SizeofHeapTupleHeader) if (len > MaxHeapTupleSize) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("row is too big: size %zu, maximum size %zu", len, MaxHeapTupleSize))); /* Compute desired extra freespace due to fillfactor option */ //獲取預留空間 // #define RelationGetTargetPageFreeSpace(relation, defaultff) \ (BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100) saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); //update操作,獲取上次pinned buffer對應的Block if (otherBuffer != InvalidBuffer) otherBlock = BufferGetBlockNumber(otherBuffer); else otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ /* * We first try to put the tuple on the same page we last inserted a tuple * on, as cached in the BulkInsertState or relcache entry. If that * doesn't work, we ask the Free Space Map to locate a suitable page. * Since the FSM's info might be out of date, we have to be prepared to * loop around and retry multiple times. (To insure this isn't an infinite * loop, we must update the FSM with the correct amount of free space on * each page that proves not to be suitable.) If the FSM has no record of * a page with enough free space, we give up and extend the relation. * 首先會嘗試把元組放在最后插入元組的page上,比如BulkInsertState或者relcache條目. * 如果找不到,那么我們通過FSM來定位合適的page. * 由于FSM的信息可能過期,這時候不得不循環并嘗試多次. * (為了確保這不是一個無限循環,必須使用正確的頁面空閑空間信息更新不靠譜的FSM) * 如果FSM中信息提示沒有page有空閑空間,放棄并擴展relation. * * When use_fsm is false, we either put the tuple onto the existing target * page or extend the relation. * 如use_fsm為F,我們要不把元組放在現存的目標page上,要不擴展relation. */ if (len + saveFreeSpace > MaxHeapTupleSize) { //如果需要的大小+預留空間大于可容納的最大Tuple大小,不使用FSM,擴展后再嘗試 /* can't fit, don't bother asking FSM */ targetBlock = InvalidBlockNumber; use_fsm = false; } else if (bistate && bistate->current_buf != InvalidBuffer)//BulkInsert模式 targetBlock = BufferGetBlockNumber(bistate->current_buf); else targetBlock = RelationGetTargetBlock(relation);//普通Insert模式 if (targetBlock == InvalidBlockNumber && use_fsm) { //還沒有找到合適的BlockNumber,并且需要使用FSM /* * We have no cached target page, so ask the FSM for an initial * target. * 沒有緩存目標page,使用FSM獲取初始目標page */ //使用FSM申請空閑空間=len + saveFreeSpace的塊 targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); /* * If the FSM knows nothing of the rel, try the last page before we * give up and extend. This avoids one-tuple-per-page syndrome during * bootstrapping or in a recently-started system. * 如果FSM對rel一無所知,在放棄并擴展前嘗試下最后那個page. * 這可以避免在bootstrapping或者最近已啟動系統時一個元組一個page的情況. */ //申請不到,使用最后一個塊,否則擴展或者放棄 if (targetBlock == InvalidBlockNumber) { BlockNumber nblocks = RelationGetNumberOfBlocks(relation); if (nblocks > 0) targetBlock = nblocks - 1; } } loop: while (targetBlock != InvalidBlockNumber) { //---------- 循環直至成功獲取插入數據的塊號 /* * Read and exclusive-lock the target block, as well as the other * block if one was given, taking suitable care with lock ordering and * the possibility they are the same block. * 讀取并獨占鎖定目標block,以及給定的另外一個快(如給出),需要適當的關注鎖的順序 * 并關注它們是否同一個塊. * * If the page-level all-visible flag is set, caller will need to * clear both that and the corresponding visibility map bit. However, * by the time we return, we'll have x-locked the buffer, and we don't * want to do any I/O while in that state. So we check the bit here * before taking the lock, and pin the page if it appears necessary. * Checking without the lock creates a risk of getting the wrong * answer, so we'll have to recheck after acquiring the lock. * 如果設置了塊級別的all-visible flag,調用者需要清空該塊的標記和相應的vm標記. * 但是,在返回時,我們將持有buffer的獨占鎖,并且我們不希望在這種情況下執行I/O操作. * 因此,我們在獲取鎖前檢查標記位,如看起來需要的話,pin page. * 沒有持有鎖執行檢查會出現錯誤,因此我們將不得不在獲取鎖后重新執行檢查. */ if (otherBuffer == InvalidBuffer) { //----------- 非Update操作 /* easy case */ //這種情況比較簡單 //獲取Buffer buffer = ReadBufferBI(relation, targetBlock, bistate); if (PageIsAllVisible(BufferGetPage(buffer))) //如果Page可見,那么把Page Pin在內存中(Pin的意思是固定/保留) visibilitymap_pin(relation, targetBlock, vmbuffer); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);//鎖定buffer } else if (otherBlock == targetBlock) { //----------- Update操作,新記錄跟原記錄在同一個Block中 //這種情況也比較簡單 /* also easy case */ buffer = otherBuffer; if (PageIsAllVisible(BufferGetPage(buffer))) visibilitymap_pin(relation, targetBlock, vmbuffer); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); } else if (otherBlock < targetBlock) { //----------- Update操作,原記錄所在的Block < 新記錄的Block /* lock other buffer first */ //首先鎖定otherBlock buffer = ReadBuffer(relation, targetBlock); if (PageIsAllVisible(BufferGetPage(buffer))) visibilitymap_pin(relation, targetBlock, vmbuffer); //優先鎖定BlockNumber小的那個 LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); } else { //------------ Update操作,原記錄所在的Block > 新記錄的Block /* lock target buffer first */ buffer = ReadBuffer(relation, targetBlock); if (PageIsAllVisible(BufferGetPage(buffer))) visibilitymap_pin(relation, targetBlock, vmbuffer); //優先鎖定BlockNumber小的那個 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); } /* * We now have the target page (and the other buffer, if any) pinned * and locked. However, since our initial PageIsAllVisible checks * were performed before acquiring the lock, the results might now be * out of date, either for the selected victim buffer, or for the * other buffer passed by the caller. In that case, we'll need to * give up our locks, go get the pin(s) we failed to get earlier, and * re-lock. That's pretty painful, but hopefully shouldn't happen * often. * 現在已有了target page,并且該page(包括other buffer,如存在)已緩存到內存中(pinned)且已鎖定. * 但是,由于初始的PageIsAllVisible在獲取鎖前執行,結果可能已經過期, * 這時候可能選擇了需要被淘汰的buffer或者otherBuffer出現了變化. * 在這種情況下,需要放棄鎖,回到先前曾經失敗的pin的地方,重新鎖定. * 這蠻吐血的,希望不要經常出現. * * Note that there's a small possibility that we didn't pin the page * above but still have the correct page pinned anyway, either because * we've already made a previous pass through this loop, or because * caller passed us the right page anyway. * 注意存在較小的可能是我們在上面不需要pin page,但仍然需要持有正確的pinned page, * 這一方面是因為我們已經通過該循環執行了一遍,另外一方面是調用者通過其他方式傳入了正確的page. * * Note also that it's possible that by the time we get the pin and * retake the buffer locks, the visibility map bit will have been * cleared by some other backend anyway. In that case, we'll have * done a bit of extra work for no gain, but there's no real harm * done. * 同時要注意在我們獲取pin并且重新獲取buffer lock時,vm位已被其他后臺進程清除了. * 在這種情況下,我們需要執行一些額外的工作以避免重復工作,但這實質上并沒有什么危害. */ if (otherBuffer == InvalidBuffer || buffer <= otherBuffer) GetVisibilityMapPins(relation, buffer, otherBuffer, targetBlock, otherBlock, vmbuffer, vmbuffer_other);//Pin VM在內存中 else GetVisibilityMapPins(relation, otherBuffer, buffer, otherBlock, targetBlock, vmbuffer_other, vmbuffer);//Pin VM在內存中 /* * Now we can check to see if there's enough free space here. If so, * we're done. * 現在我們可以檢查是否有足夠的空閑空間. * 如有,則我們已完成所有工作了. */ page = BufferGetPage(buffer); pageFreeSpace = PageGetHeapFreeSpace(page); if (len + saveFreeSpace <= pageFreeSpace) { //有足夠的空間存儲數據,返回此Buffer /* use this page as future insert target, too */ //用這個page作為未來插入的目標page /* #define RelationSetTargetBlock(relation, targblock) \ do { \ RelationOpenSmgr(relation); \ (relation)->rd_smgr->smgr_targblock = (targblock); \ } while (0) */ RelationSetTargetBlock(relation, targetBlock); return buffer; } /* * Not enough space, so we must give up our page locks and pin (if * any) and prepare to look elsewhere. We don't care which order we * unlock the two buffers in, so this can be slightly simpler than the * code above. * 空間不夠,必須放棄持有的page locks和pin,準備檢索其他地方. * 在解鎖時不需要關注兩個buffer的順序,這個邏輯比先前的邏輯要簡單. */ LockBuffer(buffer, BUFFER_LOCK_UNLOCK); if (otherBuffer == InvalidBuffer) ReleaseBuffer(buffer); else if (otherBlock != targetBlock) { LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buffer); } /* Without FSM, always fall out of the loop and extend */ //不使用FSM定位空閑空間,跳出循環,執行擴展 if (!use_fsm) break; /* * Update FSM as to condition of this page, and ask for another page * to try. */ //使用FSM獲取下一個備選的Block //注意:如果全部掃描后發現沒有滿足條件的Block,targetBlock = InvalidBlockNumber,跳出循環 targetBlock = RecordAndGetPageWithFreeSpace(relation, targetBlock, pageFreeSpace, len + saveFreeSpace); } //--------- 沒有獲取滿足條件的Block,擴展表 /* * Have to extend the relation. * * We have to use a lock to ensure no one else is extending the rel at the * same time, else we will both try to initialize the same new page. We * can skip locking for new or temp relations, however, since no one else * could be accessing them. * 必須鎖定以確保其他進程不能擴展rel,否則我們會同時嘗試初始化新的page. * 但是,我們可以為新的或者臨時關系跳過鎖定,這時候沒有其他進程可以訪問它們. */ //新創建的數據表或者臨時表,無需Lock needLock = !RELATION_IS_LOCAL(relation); /* * If we need the lock but are not able to acquire it immediately, we'll * consider extending the relation by multiple blocks at a time to manage * contention on the relation extension lock. However, this only makes * sense if we're using the FSM; otherwise, there's no point. * 如果需要鎖定但不能夠馬上獲取,考慮通過一次性多個blocks的方式擴展關系, * 這樣可以在關系擴展鎖上管理競爭. * 但是,這在使用FSM的時候才會奇效,否則沒有其他太好的辦法. */ if (needLock)//需要鎖定 { if (!use_fsm) //不使用FSM LockRelationForExtension(relation, ExclusiveLock); else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock)) { /* Couldn't get the lock immediately; wait for it. */ //不能馬上獲取鎖,等待 LockRelationForExtension(relation, ExclusiveLock); /* * Check if some other backend has extended a block for us while * we were waiting on the lock. */ //如有其它進程擴展了數據表,那么可以成功獲取滿足條件的targetBlock targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); /* * If some other waiter has already extended the relation, we * don't need to do so; just use the existing freespace. * 如果其他等待進程已經擴展了關系,那么我們不需要再擴展了,使用現成的空閑空間即可. */ if (targetBlock != InvalidBlockNumber) { UnlockRelationForExtension(relation, ExclusiveLock); goto loop; } /* Time to bulk-extend. */ //其它進程沒有擴展 //Just extend it! RelationAddExtraBlocks(relation, bistate); } } /* * In addition to whatever extension we performed above, we always add at * least one block to satisfy our own request. * 處理上面執行的擴展,我們總是添加了至少一個block用以滿足自身需要. * * XXX This does an lseek - rather expensive - but at the moment it is the * only way to accurately determine how many blocks are in a relation. Is * it worth keeping an accurate file length in shared memory someplace, * rather than relying on the kernel to do it for us? * XXX 這相當于做了一次lseek - 相當昂貴的操作! - 在這時候這也是唯一可以準確確定關系有多少blocks的方法. * 相對于不是使用內核來完成這個事情,在內存的某個地方保存準確的文件尺寸是否更好? */ //擴展表后,New Page! buffer = ReadBufferBI(relation, P_NEW, bistate); /* * We can be certain that locking the otherBuffer first is OK, since it * must have a lower page number. * 這時候可以確定首先鎖定的otherBuffer沒有問題,因為它有一個較小的page編號 */ if (otherBuffer != InvalidBuffer) ////otherBuffer的順序一定在擴展的Block之前,Lock it! LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); /* * Now acquire lock on the new page. * 現在可以嘗試為新page上鎖 */ //鎖定New Page LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* * Release the file-extension lock; it's now OK for someone else to extend * the relation some more. Note that we cannot release this lock before * we have buffer lock on the new page, or we risk a race condition * against vacuumlazy.c --- see comments therein. * 是否文件擴展鎖.現在對于其他進程來說可以擴展relation了. * 注意不能在持有新page的buffer lock前釋放該鎖,否則將會在vacuumlazy.c中存在條件競爭. * 詳細可參見注釋. */ if (needLock) //釋放擴展鎖 UnlockRelationForExtension(relation, ExclusiveLock); /* * We need to initialize the empty new page. Double-check that it really * is empty (this should never happen, but if it does we don't want to * risk wiping out valid data). * 我們需要初始化空的新page. * 需再次檢查該page是空的(這應該不會出現,但執行這個操作是因為我們不希望冒刪除有效數據的風險) */ //獲取相應的Page page = BufferGetPage(buffer); if (!PageIsNew(page)) //不是New Page,那一定某個地方搞錯了! elog(ERROR, "page %u of relation \"%s\" should be empty but is not", BufferGetBlockNumber(buffer), RelationGetRelationName(relation)); //初始化New Page PageInit(page, BufferGetPageSize(buffer), 0); //New Page也滿足不了要求的大小,報錯 if (len > PageGetHeapFreeSpace(page)) { /* We should not get here given the test at the top */ elog(PANIC, "tuple is too big: size %zu", len); } /* * Remember the new page as our target for future insertions. * 記錄新page為未來插入的目標page. * * XXX should we enter the new page into the free space map immediately, * or just keep it for this backend's exclusive use in the short run * (until VACUUM sees it)? Seems to depend on whether you expect the * current backend to make more insertions or not, which is probably a * good bet most of the time. So for now, don't add it to FSM yet. * XXX 我們應該馬上把新的page放到FSM中嗎, * 或者只是把該page放在后臺進程的私有空間中在很短時間內獨占使用(直至vacuum可以看到它位置)? * 看起來這依賴于你希望當前的后臺進程是否執行更多的插入操作,這在大多數時間下會更好. * 因此,現在還沒有把它添加到FSM中. */ //終于找到了可用于存儲數據的Block RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); //返回 return buffer; }
測試腳本
15:54:13 (xdb@[local]:5432)testdb=# insert into t1 values (1,'1','1');
調用棧
(gdb) b RelationGetBufferForTuple Breakpoint 1 at 0x4ef179: file hio.c, line 318. (gdb) c Continuing. Breakpoint 1, RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0, vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318 318 bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); (gdb) bt #0 RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0, vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318 #1 0x00000000004df1f8 in heap_insert (relation=0x7f4f51fe39b8, tup=0x178a478, cid=0, options=0, bistate=0x0) at heapam.c:2468 #2 0x0000000000709dda in ExecInsert (mtstate=0x178a220, slot=0x178a680, planSlot=0x178a680, estate=0x1789eb8, canSetTag=true) at nodeModifyTable.c:529 #3 0x000000000070c475 in ExecModifyTable (pstate=0x178a220) at nodeModifyTable.c:2159 #4 0x00000000006e05cb in ExecProcNodeFirst (node=0x178a220) at execProcnode.c:445 #5 0x00000000006d552e in ExecProcNode (node=0x178a220) at ../../../src/include/executor/executor.h:247 #6 0x00000000006d7d66 in ExecutePlan (estate=0x1789eb8, planstate=0x178a220, use_parallel_mode=false, operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x17a7688, execute_once=true) at execMain.c:1723 #7 0x00000000006d5af8 in standard_ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364 #8 0x00000000006d5920 in ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307 #9 0x00000000008c1092 in ProcessQuery (plan=0x16b3ac0, sourceText=0x16b1ec8 "insert into t1 values (1,'1','1');", params=0x0, queryEnv=0x0, dest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:161 #10 0x00000000008c29a1 in PortalRunMulti (portal=0x1717488, isTopLevel=true, setHoldSnapshot=false, dest=0x17a7688, altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:1286 #11 0x00000000008c1f7a in PortalRun (portal=0x1717488, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x17a7688, altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:799 #12 0x00000000008bbf16 in exec_simple_query (query_string=0x16b1ec8 "insert into t1 values (1,'1','1');") at postgres.c:1145 #13 0x00000000008c01a1 in PostgresMain (argc=1, argv=0x16dbaf8, dbname=0x16db960 "testdb", username=0x16aeba8 "xdb") at postgres.c:4182 #14 0x000000000081e07c in BackendRun (port=0x16d3940) at postmaster.c:4361 #15 0x000000000081d7ef in BackendStartup (port=0x16d3940) at postmaster.c:4033 ---Type <return> to continue, or q <return> to quit--- #16 0x0000000000819be9 in ServerLoop () at postmaster.c:1706 #17 0x000000000081949f in PostmasterMain (argc=1, argv=0x16acb60) at postmaster.c:1379 #18 0x0000000000742941 in main (argc=1, argv=0x16acb60) at main.c:228 (gdb)
感謝各位的閱讀,以上就是“PostgreSQL中RelationGetBufferForTuple函數有什么作用”的內容了,經過本文的學習后,相信大家對PostgreSQL中RelationGetBufferForTuple函數有什么作用這一問題有了更深刻的體會,具體使用情況還需要大家實踐驗證。這里是億速云,小編將為大家推送更多相關知識點的文章,歡迎關注!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。