Kubernetes Eviction Manager怎么啟動

發布時間：2021-12-20 10:23:28 來源：億速云閱讀：128 作者：iii 欄目：云計算

這篇文章主要講解了“Kubernetes Eviction Manager怎么啟動”，文中的講解內容簡單清晰，易于學習與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學習“Kubernetes Eviction Manager怎么啟動”吧！

Kubernetes Eviction Manager源碼分析

Kubernetes Eviction Manager在何處啟動

Kubelet在實例化一個kubelet對象的時候，調用eviction.NewManager新建了一個evictionManager對象。

pkg/kubelet/kubelet.go:273
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) {

	...

	thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
	if err != nil {
		return nil, err
	}
	evictionConfig := eviction.Config{
		PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
		MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
		Thresholds:               thresholds,
		KernelMemcgNotification:  kubeCfg.ExperimentalKernelMemcgNotification,
	}
	...

	// setup eviction manager
	evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock)
	
	if err != nil {
		return nil, fmt.Errorf("failed to initialize eviction manager: %v", err)
	}
	klet.evictionManager = evictionManager
	klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
	...
}

kubelet執行Run方法開始工作時，啟動了一個goroutine，每5s執行一次updateRuntimeUp。在updateRuntimeUp中，待確認runtime啟動成功后，會調用initializeRuntimeDependentModules完成runtime依賴模塊的初始化工作。

pkg/kubelet/kubelet.go:1219
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
	go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
}


pkg/kubelet/kubelet.go:2040
func (kl *Kubelet) updateRuntimeUp() {
	...
	
	kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
	
	...
}

再跟蹤到initializeRuntimeDependentModules的代碼可見，runtime的依賴模塊包括cadvisor和evictionManager，初始化的工作其實就是分別調用它們的Start方法進行啟動。

pkg/kubelet/kubelet.go:1206
func (kl *Kubelet) initializeRuntimeDependentModules() {
	if err := kl.cadvisor.Start(); err != nil {
		// Fail kubelet and rely on the babysitter to retry starting kubelet.
		// TODO(random-liu): Add backoff logic in the babysitter
		glog.Fatalf("Failed to start cAdvisor %v", err)
	}
	// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
	if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil {
		kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err))
	}
}

因此，從這里開始就進入到evictionManager的分析了。

Kubernetes Eviction Manager的定義

從上面的分析可見，kubelet在啟動過程中進行runtime依賴模塊的初始化過程中，將evictionManager啟動了。先別急，我們必須先來看看Eviction Manager是如何定義的。

pkg/kubelet/eviction/eviction_manager.go:40
// managerImpl implements Manager
type managerImpl struct {
	//  used to track time
	clock clock.Clock
	// config is how the manager is configured
	config Config
	// the function to invoke to kill a pod
	killPodFunc KillPodFunc
	// the interface that knows how to do image gc
	imageGC ImageGC
	// protects access to internal state
	sync.RWMutex
	// node conditions are the set of conditions present
	nodeConditions []v1.NodeConditionType
	// captures when a node condition was last observed based on a threshold being met
	nodeConditionsLastObservedAt nodeConditionsObservedAt
	// nodeRef is a reference to the node
	nodeRef *v1.ObjectReference
	// used to record events about the node
	recorder record.EventRecorder
	// used to measure usage stats on system
	summaryProvider stats.SummaryProvider
	// records when a threshold was first observed
	thresholdsFirstObservedAt thresholdsObservedAt
	// records the set of thresholds that have been met (including graceperiod) but not yet resolved
	thresholdsMet []Threshold
	// resourceToRankFunc maps a resource to ranking function for that resource.
	resourceToRankFunc map[v1.ResourceName]rankFunc
	// resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
	resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs
	// last observations from synchronize
	lastObservations signalObservations
	// notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once)
	notifiersInitialized bool
}

managerImpl就是evictionManager的具體定義，重點關注：

config - evictionManager的配置，包括:

PressureTransitionPeriod( --eviction-pressure-transition-period)
MaxPodGracePeriodSeconds(--eviction-max-pod-grace-period)
Thresholds(--eviction-hard, --eviction-soft)
KernelMemcgNotification(--experimental-kernel-memcg-notification)

killPodFunc - evict pod時kill pod的接口，kubelet NewManager的時候，賦值為killPodNow方法(pkg/kubelet/pod_workers.go:285)
imageGC - 當node出現diskPressure condition時，imageGC進行unused images刪除操作以回收disk space。
summaryProvider - 提供node和node上所有pods的最新status數據匯總，既NodeStats and []PodStats。
thresholdsFirstObservedAt - 記錄threshold第一次觀察到的時間。
thresholdsMet - 保存已經觸發但還沒解決的Thresholds，包括那些處于grace period等待階段的Thresholds。
resourceToRankFunc - 定義各種Resource進行evict 挑選時的排名方法。
resourceToNodeReclaimFuncs - 定義各種Resource進行回收時調用的方法。
lastObservations - 上一次獲取的eviction signal的記錄，確保每次更新thresholds時都是按照正確的時間序列進行。
notifierInitialized - bool值，表示threshold notifier是否已經初始化，以確定是否可以利用kernel memcg notification功能來提高evict的響應速度。目前創建manager時該值為false，是否要利用kernel memcg notification，完全取決于kubelet的--experimental-kernel-memcg-notification參數。

kubelet在NewMainKubelet時調用eviction.NewManager進行evictionManager的創建，eviction.NewManager的代碼很簡單，就是賦值。

pkg/kubelet/eviction/eviction_manager.go:79
// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.
func NewManager(
	summaryProvider stats.SummaryProvider,
	config Config,
	killPodFunc KillPodFunc,
	imageGC ImageGC,
	recorder record.EventRecorder,
	nodeRef *v1.ObjectReference,
	clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) {
	manager := &managerImpl{
		clock:           clock,
		killPodFunc:     killPodFunc,
		imageGC:         imageGC,
		config:          config,
		recorder:        recorder,
		summaryProvider: summaryProvider,
		nodeRef:         nodeRef,
		nodeConditionsLastObservedAt: nodeConditionsObservedAt{},
		thresholdsFirstObservedAt:    thresholdsObservedAt{},
	}
	return manager, manager, nil
}

但是，有一點很重要，NewManager不但返回evictionManager對象，還返回了一個lifecycle.PodAdmitHandler實例evictionAdmitHandler，它其實和evictionManager的內容相同，但是不同的兩個實例。evictionAdmitHandler用來kubelet創建Pod前進行準入檢查，滿足條件后才會繼續創建Pod，通過Admit(attrs *lifecycle.PodAdmitAttributes)方法來檢查，代碼如下：

pkg/kubelet/eviction/eviction_manager.go:102
// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
	m.RLock()
	defer m.RUnlock()
	if len(m.nodeConditions) == 0 {
		return lifecycle.PodAdmitResult{Admit: true}
	}

	// the node has memory pressure, admit if not best-effort
	if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {
		notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod)
		if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) {
			return lifecycle.PodAdmitResult{Admit: true}
		}
	}

	// reject pods when under memory pressure (if pod is best effort), or if under disk pressure.
	glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions)
	return lifecycle.PodAdmitResult{
		Admit:   false,
		Reason:  reason,
		Message: fmt.Sprintf(message, m.nodeConditions),
	}
}

上述Pod Admit邏輯，正是Kubernetes Eviction Manager工作機制分析中Scheduler一節提到的EvictionManager對Pod調度的邏輯影響：

Kubelet會定期的將Node Condition傳給kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后，會根據以下策略，阻止更多Pods Bind到該Node。

Node Condition	Scheduler Behavior
MemoryPressure	No new BestEffort pods are scheduled to the node.
DiskPressure	No new pods are scheduled to the node.

killPodNow的代碼，后面再分析。

基本上，這一小節我們把evictionManager是什么以及怎么來的問題搞清楚了。下面我們來看看evictionManager的啟動過程。

Kubernetes Eviction Manager的啟動

上面分析過，kubelet在啟動過程中進行runtime依賴模塊的初始化過程中，將evictionManager啟動了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)),那我們先來看看Start方法：

pkg/kubelet/eviction/eviction_manager.go:126
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error {
	// start the eviction manager monitoring
	go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop)
	return nil
}

很簡單，啟動一個goroutine，每執行完一次m.synchronize就間隔monitoringInterval(10s)的時間再次執行m.synchronize，如此反復。

接下來，就是evictionManager的關鍵工作流程了：

pkg/kubelet/eviction/eviction_manager.go:181
// synchronize is the main control loop that enforces eviction thresholds.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) {
	// if we have nothing to do, just return
	thresholds := m.config.Thresholds
	if len(thresholds) == 0 {
		return
	}

	// build the ranking functions (if not yet known)
	if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {
		// this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.
		hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()
		if err != nil {
			return
		}
		m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs)
		m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs)
	}

	// make observations and get a function to derive pod usage stats relative to those observations.
	observations, statsFunc, err := makeSignalObservations(m.summaryProvider)
	if err != nil {
		glog.Errorf("eviction manager: unexpected err: %v", err)
		return
	}

	// attempt to create a threshold notifier to improve eviction response time
	if m.config.KernelMemcgNotification && !m.notifiersInitialized {
		glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
		m.notifiersInitialized = true
		// start soft memory notification
		err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
			glog.Infof("soft memory eviction threshold crossed at %s", desc)
			// TODO wait grace period for soft memory limit
			m.synchronize(diskInfoProvider, podFunc)
		})
		if err != nil {
			glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
		}
		// start hard memory notification
		err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {
			glog.Infof("hard memory eviction threshold crossed at %s", desc)
			m.synchronize(diskInfoProvider, podFunc)
		})
		if err != nil {
			glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
		}
	}

	// determine the set of thresholds met independent of grace period
	thresholds = thresholdsMet(thresholds, observations, false)

	// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
	if len(m.thresholdsMet) > 0 {
		thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
		thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
	}

	// determine the set of thresholds whose stats have been updated since the last sync
	thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)

	// track when a threshold was first observed
	now := m.clock.Now()
	thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

	// the set of node conditions that are triggered by currently observed thresholds
	nodeConditions := nodeConditions(thresholds)

	// track when a node condition was last observed
	nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)

	// node conditions report true if it has been observed within the transition period window
	nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)

	// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
	thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)

	// update internal state
	m.Lock()
	m.nodeConditions = nodeConditions
	m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
	m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
	m.thresholdsMet = thresholds
	m.lastObservations = observations
	m.Unlock()

	// determine the set of resources under starvation
	starvedResources := getStarvedResources(thresholds)
	if len(starvedResources) == 0 {
		glog.V(3).Infof("eviction manager: no resources are starved")
		return
	}

	// rank the resources to reclaim by eviction priority
	sort.Sort(byEvictionPriority(starvedResources))
	resourceToReclaim := starvedResources[0]
	glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)

	// determine if this is a soft or hard eviction associated with the resource
	softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)

	// record an event about the resources we are now attempting to reclaim via eviction
	m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)

	// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
	if m.reclaimNodeLevelResources(resourceToReclaim, observations) {
		glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
		return
	}

	glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)

	// rank the pods for eviction
	rank, ok := m.resourceToRankFunc[resourceToReclaim]
	if !ok {
		glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)
		return
	}

	// the only candidates viable for eviction are those pods that had anything running.
	activePods := podFunc()
	if len(activePods) == 0 {
		glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
		return
	}

	// rank the running pods for eviction for the specified resource
	rank(activePods, statsFunc)

	glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))

	// we kill at most a single pod during each eviction interval
	for i := range activePods {
		pod := activePods[i]
		status := v1.PodStatus{
			Phase:   v1.PodFailed,
			Message: fmt.Sprintf(message, resourceToReclaim),
			Reason:  reason,
		}
		// record that we are evicting the pod
		m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
		gracePeriodOverride := int64(0)
		if softEviction {
			gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
		}
		// this is a blocking call and should only return when the pod and its containers are killed.
		err := m.killPodFunc(pod, status, &gracePeriodOverride)
		if err != nil {
			glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)
			continue
		}
		// success, so we return until the next housekeeping interval
		glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod))
		return
	}
	glog.Infof("eviction manager: unable to evict any pods from the node")
}

代碼寫的非常工整，注釋也很到位，很棒。關鍵流程如下：

通過buildResourceToRankFunc和buildResourceToNodeReclaimFuncs分別注冊Evict Pod時各種Resource的排名函數和回收Node Resource的Reclaim函數。
通過makeSignalObservations從cAdvisor中獲取Eviction Signal Observation和Pod的StatsFunc(后續對Pods進行Rank時需要用)。
如果kubelet配置了--experimental-kernel-memcg-notification且為true，則通過startMemoryThresholdNotifier啟動soft & hard memory notification，當system usage第一時間達到soft & hard memory thresholds時，會立刻通知kubelet，并觸發evictionManager.synchronize進行資源回收的流程。這樣提高了eviction的實時性。
根據從cAdvisor數據計算得到的Observation（observasions）和配置的thresholds通過thresholdsMet計算得到此次Met的thresholds。
再根據從cAdvisor數據計算得到的Observation（observasions）和thresholdsMet通過thresholdsMet計算得到已記錄但還沒解決的thresholds，然后與上一步中的thresholds進行合并。
根據lastObservations中Signal的時間，對比observasions的中Signal中的時間，過濾thresholds。
更新thresholdsFirstObservedAt, nodeConditions。
過濾出那些從observed time到now，已經歷過grace period時間的thresholds。
更新evictionManager對象的內部數據: nodeConditions，thresholdsFirstObservedAt，nodeConditionsLastObservedAt，thresholds，observations。
根據thresholds得到starvedResources，并進行排序，如果memory屬于starvedResources，則memory排序第一。
取starvedResources排第一的Resource，調用reclaimNodeLevelResources對Node上這種Resource進行資源回收。如果回收完后，available滿足thresholdValue+evictionMinimumReclaim,則流程結束，不再evict user-pods。
如果reclaimNodeLevelResources后，還不足以達到要求，則會繼續evict user-pods，首先根據前面buildResourceToRankFunc注冊的方法對所有active Pods進行排序。
按照前面的排序，順序的調用killPodNow將選出的pod干掉。如果kill某個pod失敗，則會跳過這個pod，再按順序挑下一個pod進行kill。只要某個pod kill成功，就返回結束，也就是說這個流程中，最多只會kill最多一個Pod。

上面流程中，有兩個最關鍵的步驟，回收節點資源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。

pkg/kubelet/eviction/eviction_manager.go:340
// reclaimNodeLevelResources attempts to reclaim node level resources.  returns true if thresholds were satisfied and no pod eviction is required.
func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool {
	nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]
	for _, nodeReclaimFunc := range nodeReclaimFuncs {
		// attempt to reclaim the pressured resource.
		reclaimed, err := nodeReclaimFunc()
		if err == nil {
			// update our local observations based on the amount reported to have been reclaimed.
			// note: this is optimistic, other things could have been still consuming the pressured resource in the interim.
			signal := resourceToSignal[resourceToReclaim]
			value, ok := observations[signal]
			if !ok {
				glog.Errorf("eviction manager: unable to find value associated with signal %v", signal)
				continue
			}
			value.available.Add(*reclaimed)

			// evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals
			if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {
				return true
			}
		} else {
			glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
		}
	}
	return false
}


pkg/kubelet/pod_workers.go:283
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
	return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error {
		// determine the grace period to use when killing the pod
		gracePeriod := int64(0)
		if gracePeriodOverride != nil {
			gracePeriod = *gracePeriodOverride
		} else if pod.Spec.TerminationGracePeriodSeconds != nil {
			gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
		}

		// we timeout and return an error if we don't get a callback within a reasonable time.
		// the default timeout is relative to the grace period (we settle on 2s to wait for kubelet->runtime traffic to complete in sigkill)
		timeout := int64(gracePeriod + (gracePeriod / 2))
		minTimeout := int64(2)
		if timeout < minTimeout {
			timeout = minTimeout
		}
		timeoutDuration := time.Duration(timeout) * time.Second

		// open a channel we block against until we get a result
		type response struct {
			err error
		}
		ch := make(chan response)
		podWorkers.UpdatePod(&UpdatePodOptions{
			Pod:        pod,
			UpdateType: kubetypes.SyncPodKill,
			OnCompleteFunc: func(err error) {
				ch <- response{err: err}
			},
			KillPodOptions: &KillPodOptions{
				PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
					return status
				},
				PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
			},
		})

		// wait for either a response, or a timeout
		select {
		case r := <-ch:
			return r.err
		case <-time.After(timeoutDuration):
			recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
			return fmt.Errorf("timeout waiting to kill pod")
		}
	}
}

講到這里，整個evictionManager的主要流程都分析完了。

感謝各位的閱讀，以上就是“Kubernetes Eviction Manager怎么啟動”的內容了，經過本文的學習后，相信大家對Kubernetes Eviction Manager怎么啟動這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是億速云，小編將為大家推送更多相關知識點的文章，歡迎關注！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Kubernetes Eviction Manager怎么啟動

Kubernetes Eviction Manager源碼分析

Kubernetes Eviction Manager在何處啟動

Kubernetes Eviction Manager的定義

Kubernetes Eviction Manager的啟動

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Kubernetes Eviction Manager怎么啟動

Kubernetes Eviction Manager源碼分析

Kubernetes Eviction Manager在何處啟動

Kubernetes Eviction Manager的定義

Kubernetes Eviction Manager的啟動

猜你喜歡

最新資訊

相關推薦

相關標簽