您好,登錄后才能下訂單哦!
這篇文章主要講解了“Kubernetes Eviction Manager怎么啟動”,文中的講解內容簡單清晰,易于學習與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學習“Kubernetes Eviction Manager怎么啟動”吧!
Kubelet在實例化一個kubelet對象的時候,調用eviction.NewManager
新建了一個evictionManager對象。
pkg/kubelet/kubelet.go:273 func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) { ... thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim) if err != nil { return nil, err } evictionConfig := eviction.Config{ PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration, MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod), Thresholds: thresholds, KernelMemcgNotification: kubeCfg.ExperimentalKernelMemcgNotification, } ... // setup eviction manager evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock) if err != nil { return nil, fmt.Errorf("failed to initialize eviction manager: %v", err) } klet.evictionManager = evictionManager klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler) ... }
kubelet執行Run方法開始工作時,啟動了一個goroutine,每5s執行一次updateRuntimeUp。在updateRuntimeUp中,待確認runtime啟動成功后,會調用initializeRuntimeDependentModules完成runtime依賴模塊的初始化工作。
pkg/kubelet/kubelet.go:1219 func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) { go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop) } pkg/kubelet/kubelet.go:2040 func (kl *Kubelet) updateRuntimeUp() { ... kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules) ... }
再跟蹤到initializeRuntimeDependentModules的代碼可見,runtime的依賴模塊包括cadvisor和evictionManager,初始化的工作其實就是分別調用它們的Start方法進行啟動。
pkg/kubelet/kubelet.go:1206 func (kl *Kubelet) initializeRuntimeDependentModules() { if err := kl.cadvisor.Start(); err != nil { // Fail kubelet and rely on the babysitter to retry starting kubelet. // TODO(random-liu): Add backoff logic in the babysitter glog.Fatalf("Failed to start cAdvisor %v", err) } // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil { kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err)) } }
因此,從這里開始就進入到evictionManager的分析了。
從上面的分析可見,kubelet在啟動過程中進行runtime依賴模塊的初始化過程中,將evictionManager啟動了。先別急,我們必須先來看看Eviction Manager是如何定義的。
pkg/kubelet/eviction/eviction_manager.go:40 // managerImpl implements Manager type managerImpl struct { // used to track time clock clock.Clock // config is how the manager is configured config Config // the function to invoke to kill a pod killPodFunc KillPodFunc // the interface that knows how to do image gc imageGC ImageGC // protects access to internal state sync.RWMutex // node conditions are the set of conditions present nodeConditions []v1.NodeConditionType // captures when a node condition was last observed based on a threshold being met nodeConditionsLastObservedAt nodeConditionsObservedAt // nodeRef is a reference to the node nodeRef *v1.ObjectReference // used to record events about the node recorder record.EventRecorder // used to measure usage stats on system summaryProvider stats.SummaryProvider // records when a threshold was first observed thresholdsFirstObservedAt thresholdsObservedAt // records the set of thresholds that have been met (including graceperiod) but not yet resolved thresholdsMet []Threshold // resourceToRankFunc maps a resource to ranking function for that resource. resourceToRankFunc map[v1.ResourceName]rankFunc // resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource. resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs // last observations from synchronize lastObservations signalObservations // notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once) notifiersInitialized bool }
managerImpl就是evictionManager的具體定義,重點關注:
config
- evictionManager的配置,包括:
PressureTransitionPeriod( --eviction-pressure-transition-period)
MaxPodGracePeriodSeconds(--eviction-max-pod-grace-period)
Thresholds(--eviction-hard, --eviction-soft)
KernelMemcgNotification(--experimental-kernel-memcg-notification)
killPodFunc
- evict pod時kill pod的接口,kubelet NewManager的時候,賦值為killPodNow方法(pkg/kubelet/pod_workers.go:285)
imageGC
- 當node出現diskPressure condition時,imageGC進行unused images刪除操作以回收disk space。
summaryProvider
- 提供node和node上所有pods的最新status數據匯總,既NodeStats and []PodStats。
thresholdsFirstObservedAt
- 記錄threshold第一次觀察到的時間。
thresholdsMet
- 保存已經觸發但還沒解決的Thresholds,包括那些處于grace period等待階段的Thresholds。
resourceToRankFunc
- 定義各種Resource進行evict 挑選時的排名方法。
resourceToNodeReclaimFuncs
- 定義各種Resource進行回收時調用的方法。
lastObservations
- 上一次獲取的eviction signal的記錄,確保每次更新thresholds時都是按照正確的時間序列進行。
notifierInitialized
- bool值,表示threshold notifier是否已經初始化,以確定是否可以利用kernel memcg notification功能來提高evict的響應速度。目前創建manager時該值為false,是否要利用kernel memcg notification,完全取決于kubelet的--experimental-kernel-memcg-notification
參數。
kubelet在NewMainKubelet時調用eviction.NewManager
進行evictionManager的創建,eviction.NewManager
的代碼很簡單,就是賦值。
pkg/kubelet/eviction/eviction_manager.go:79 // NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration. func NewManager( summaryProvider stats.SummaryProvider, config Config, killPodFunc KillPodFunc, imageGC ImageGC, recorder record.EventRecorder, nodeRef *v1.ObjectReference, clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) { manager := &managerImpl{ clock: clock, killPodFunc: killPodFunc, imageGC: imageGC, config: config, recorder: recorder, summaryProvider: summaryProvider, nodeRef: nodeRef, nodeConditionsLastObservedAt: nodeConditionsObservedAt{}, thresholdsFirstObservedAt: thresholdsObservedAt{}, } return manager, manager, nil }
但是,有一點很重要,NewManager不但返回evictionManager對象,還返回了一個lifecycle.PodAdmitHandler
實例evictionAdmitHandler,它其實和evictionManager的內容相同,但是不同的兩個實例。evictionAdmitHandler用來kubelet創建Pod前進行準入檢查,滿足條件后才會繼續創建Pod,通過Admit(attrs *lifecycle.PodAdmitAttributes)
方法來檢查,代碼如下:
pkg/kubelet/eviction/eviction_manager.go:102 // Admit rejects a pod if its not safe to admit for node stability. func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult { m.RLock() defer m.RUnlock() if len(m.nodeConditions) == 0 { return lifecycle.PodAdmitResult{Admit: true} } // the node has memory pressure, admit if not best-effort if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) { notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod) if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) { return lifecycle.PodAdmitResult{Admit: true} } } // reject pods when under memory pressure (if pod is best effort), or if under disk pressure. glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions) return lifecycle.PodAdmitResult{ Admit: false, Reason: reason, Message: fmt.Sprintf(message, m.nodeConditions), } }
上述Pod Admit邏輯,正是Kubernetes Eviction Manager工作機制分析中Scheduler一節提到的EvictionManager對Pod調度的邏輯影響:
Kubelet會定期的將Node Condition傳給kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后,會根據以下策略,阻止更多Pods Bind到該Node。
Node Condition | Scheduler Behavior |
---|---|
MemoryPressure | No new BestEffort pods are scheduled to the node. |
DiskPressure | No new pods are scheduled to the node. |
killPodNow的代碼,后面再分析。
基本上,這一小節我們把evictionManager是什么以及怎么來的問題搞清楚了。下面我們來看看evictionManager的啟動過程。
上面分析過,kubelet在啟動過程中進行runtime依賴模塊的初始化過程中,將evictionManager啟動了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)
),那我們先來看看Start方法:
pkg/kubelet/eviction/eviction_manager.go:126 // Start starts the control loop to observe and response to low compute resources. func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error { // start the eviction manager monitoring go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop) return nil }
很簡單,啟動一個goroutine,每執行完一次m.synchronize
就間隔monitoringInterval(10s)的時間再次執行m.synchronize
,如此反復。
接下來,就是evictionManager的關鍵工作流程了:
pkg/kubelet/eviction/eviction_manager.go:181 // synchronize is the main control loop that enforces eviction thresholds. func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) { // if we have nothing to do, just return thresholds := m.config.Thresholds if len(thresholds) == 0 { return } // build the ranking functions (if not yet known) if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 { // this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass. hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs() if err != nil { return } m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs) m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs) } // make observations and get a function to derive pod usage stats relative to those observations. observations, statsFunc, err := makeSignalObservations(m.summaryProvider) if err != nil { glog.Errorf("eviction manager: unexpected err: %v", err) return } // attempt to create a threshold notifier to improve eviction response time if m.config.KernelMemcgNotification && !m.notifiersInitialized { glog.Infof("eviction manager attempting to integrate with kernel memcg notification api") m.notifiersInitialized = true // start soft memory notification err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) { glog.Infof("soft memory eviction threshold crossed at %s", desc) // TODO wait grace period for soft memory limit m.synchronize(diskInfoProvider, podFunc) }) if err != nil { glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err) } // start hard memory notification err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) { glog.Infof("hard memory eviction threshold crossed at %s", desc) m.synchronize(diskInfoProvider, podFunc) }) if err != nil { glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err) } } // determine the set of thresholds met independent of grace period thresholds = thresholdsMet(thresholds, observations, false) // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim if len(m.thresholdsMet) > 0 { thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true) thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved) } // determine the set of thresholds whose stats have been updated since the last sync thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations) // track when a threshold was first observed now := m.clock.Now() thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now) // the set of node conditions that are triggered by currently observed thresholds nodeConditions := nodeConditions(thresholds) // track when a node condition was last observed nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now) // node conditions report true if it has been observed within the transition period window nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now) // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met) thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) // update internal state m.Lock() m.nodeConditions = nodeConditions m.thresholdsFirstObservedAt = thresholdsFirstObservedAt m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt m.thresholdsMet = thresholds m.lastObservations = observations m.Unlock() // determine the set of resources under starvation starvedResources := getStarvedResources(thresholds) if len(starvedResources) == 0 { glog.V(3).Infof("eviction manager: no resources are starved") return } // rank the resources to reclaim by eviction priority sort.Sort(byEvictionPriority(starvedResources)) resourceToReclaim := starvedResources[0] glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim) // determine if this is a soft or hard eviction associated with the resource softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim) // record an event about the resources we are now attempting to reclaim via eviction m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim) // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods. if m.reclaimNodeLevelResources(resourceToReclaim, observations) { glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim) return } glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim) // rank the pods for eviction rank, ok := m.resourceToRankFunc[resourceToReclaim] if !ok { glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim) return } // the only candidates viable for eviction are those pods that had anything running. activePods := podFunc() if len(activePods) == 0 { glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict") return } // rank the running pods for eviction for the specified resource rank(activePods, statsFunc) glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods)) // we kill at most a single pod during each eviction interval for i := range activePods { pod := activePods[i] status := v1.PodStatus{ Phase: v1.PodFailed, Message: fmt.Sprintf(message, resourceToReclaim), Reason: reason, } // record that we are evicting the pod m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim)) gracePeriodOverride := int64(0) if softEviction { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } // this is a blocking call and should only return when the pod and its containers are killed. err := m.killPodFunc(pod, status, &gracePeriodOverride) if err != nil { glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err) continue } // success, so we return until the next housekeeping interval glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod)) return } glog.Infof("eviction manager: unable to evict any pods from the node") }
代碼寫的非常工整,注釋也很到位,很棒。關鍵流程如下:
通過buildResourceToRankFunc
和buildResourceToNodeReclaimFuncs
分別注冊Evict Pod時各種Resource的排名函數和回收Node Resource的Reclaim函數。
通過makeSignalObservations
從cAdvisor中獲取Eviction Signal Observation和Pod的StatsFunc(后續對Pods進行Rank時需要用)。
如果kubelet配置了--experimental-kernel-memcg-notification
且為true,則通過startMemoryThresholdNotifier
啟動soft & hard memory notification,當system usage第一時間達到soft & hard memory thresholds時,會立刻通知kubelet,并觸發evictionManager.synchronize
進行資源回收的流程。這樣提高了eviction的實時性。
根據從cAdvisor數據計算得到的Observation(observasions)和配置的thresholds通過thresholdsMet
計算得到此次Met的thresholds。
再根據從cAdvisor數據計算得到的Observation(observasions)和thresholdsMet通過thresholdsMet
計算得到已記錄但還沒解決的thresholds,然后與上一步中的thresholds進行合并。
根據lastObservations中Signal的時間,對比observasions的中Signal中的時間,過濾thresholds。
更新thresholdsFirstObservedAt
, nodeConditions
。
過濾出那些從observed time到now,已經歷過grace period時間的thresholds。
更新evictionManager對象的內部數據: nodeConditions,thresholdsFirstObservedAt,nodeConditionsLastObservedAt,thresholds,observations。
根據thresholds得到starvedResources,并進行排序,如果memory屬于starvedResources,則memory排序第一。
取starvedResources排第一的Resource,調用reclaimNodeLevelResources
對Node上這種Resource進行資源回收。如果回收完后,available滿足thresholdValue+evictionMinimumReclaim
,則流程結束,不再evict user-pods。
如果reclaimNodeLevelResources
后,還不足以達到要求,則會繼續evict user-pods,首先根據前面buildResourceToRankFunc
注冊的方法對所有active Pods進行排序。
按照前面的排序,順序的調用killPodNow
將選出的pod干掉。如果kill某個pod失敗,則會跳過這個pod,再按順序挑下一個pod進行kill。只要某個pod kill成功,就返回結束,也就是說這個流程中,最多只會kill最多一個Pod。
上面流程中,有兩個最關鍵的步驟,回收節點資源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
pkg/kubelet/eviction/eviction_manager.go:340 // reclaimNodeLevelResources attempts to reclaim node level resources. returns true if thresholds were satisfied and no pod eviction is required. func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool { nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim] for _, nodeReclaimFunc := range nodeReclaimFuncs { // attempt to reclaim the pressured resource. reclaimed, err := nodeReclaimFunc() if err == nil { // update our local observations based on the amount reported to have been reclaimed. // note: this is optimistic, other things could have been still consuming the pressured resource in the interim. signal := resourceToSignal[resourceToReclaim] value, ok := observations[signal] if !ok { glog.Errorf("eviction manager: unable to find value associated with signal %v", signal) continue } value.available.Add(*reclaimed) // evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 { return true } } else { glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err) } } return false } pkg/kubelet/pod_workers.go:283 // killPodNow returns a KillPodFunc that can be used to kill a pod. // It is intended to be injected into other modules that need to kill a pod. func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc { return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error { // determine the grace period to use when killing the pod gracePeriod := int64(0) if gracePeriodOverride != nil { gracePeriod = *gracePeriodOverride } else if pod.Spec.TerminationGracePeriodSeconds != nil { gracePeriod = *pod.Spec.TerminationGracePeriodSeconds } // we timeout and return an error if we don't get a callback within a reasonable time. // the default timeout is relative to the grace period (we settle on 2s to wait for kubelet->runtime traffic to complete in sigkill) timeout := int64(gracePeriod + (gracePeriod / 2)) minTimeout := int64(2) if timeout < minTimeout { timeout = minTimeout } timeoutDuration := time.Duration(timeout) * time.Second // open a channel we block against until we get a result type response struct { err error } ch := make(chan response) podWorkers.UpdatePod(&UpdatePodOptions{ Pod: pod, UpdateType: kubetypes.SyncPodKill, OnCompleteFunc: func(err error) { ch <- response{err: err} }, KillPodOptions: &KillPodOptions{ PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus { return status }, PodTerminationGracePeriodSecondsOverride: gracePeriodOverride, }, }) // wait for either a response, or a timeout select { case r := <-ch: return r.err case <-time.After(timeoutDuration): recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.") return fmt.Errorf("timeout waiting to kill pod") } } }
講到這里,整個evictionManager的主要流程都分析完了。
感謝各位的閱讀,以上就是“Kubernetes Eviction Manager怎么啟動”的內容了,經過本文的學習后,相信大家對Kubernetes Eviction Manager怎么啟動這一問題有了更深刻的體會,具體使用情況還需要大家實踐驗證。這里是億速云,小編將為大家推送更多相關知識點的文章,歡迎關注!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。