您好,登錄后才能下訂單哦!
這篇文章主要介紹“Kubernetes的PDB怎么應用”,在日常操作中,相信很多人在Kubernetes的PDB怎么應用問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”Kubernetes的PDB怎么應用”的疑惑有所幫助!接下來,請跟著小編一起來學習吧!
大概在Kubernetes 1.4新增了PodDisruptionBudget Object(后面簡稱PDB),在1.5的時候升級到Beta,但是直到1.9 Released還是Beta。不過沒關系,我們拋開這些,先來想想PDB是為了解決什么問題的。PDB Feature已經一年多了,以前沒有研究過它,主要是沒場景。最近在做基于Kubernetes的ElasticSearch as a Service(簡稱ESaaS)項目方案,要盡量保證任何ElasticSearch Cluster中始終至少要有一個健康可用的ES client pod, ES master pod和ES data pod。很多同學都學想到Deployment中可以設置maxUnavailable,那不就行了嗎?再說了,還會有RS Controller在做副本控制呢?
等下!Deployment中的maxUnavailable是什么時候用的?—— 是用來對使用Deployment部署的應用進行滾動更新時保障最少可服務副本數的!RS Controller呢?—— 那只是副本控制器之一,它并不能給你保證集群中始終有幾個副本的,它是負責盡快的讓實際副本數跟你的期望副本數相同的,它才不管中間某些時刻的實際副本數呢。這個時候,你就可以考慮使用Kubernetes PDB了,它是用來保證應用的高可用的,對那些Voluntary(自愿的)Disruption做好Budgets(預算方案)。
前面提到了Voluntary Disruption,我們來捋一下,什么是Voluntary Disruption?什么又是Involuntary Disruption?
Involuntary Disruption指的是那些不可控的(或者目前來說難于控制的)外界因素導致的Disruption,比如:
服務器的硬件故障或者內核崩潰導致節點Down了。
如果容器部署在VM,VM被誤刪了或者Hyperwisor出問題了。
集群出現了網絡腦裂。(Kubernetes通過NodeController來處理網絡腦裂情況,但是evict pods時仍然沒有考慮到保證應用的高可用)關于NodeController深度解析,請參考我的下面博文:
Kubernetes Node Controller源碼分析之執行篇
Kubernetes Node Controller源碼分析之創建篇
Kubernetes Node Controller源碼分析之配置篇
Kubernetes Node Controller源碼分析之Taint Controller
某個節點因為不合理的超配導致出現計算資源不足時,觸發kubelet eviction時也沒有考慮到保證應用的高可用。關于kubelet eviction深度解析,請參考我的下面博文:
Kubernetes Eviction Manager源碼分析
Kubernetes Eviction Manager工作機制分析
PDB不是解決Involuntary Disruption的,我們如何在使用Kubernetes時盡量減輕或者緩解Involuntary Disruption對應用高可用的影響呢?
一個應用盡量使用Deployment,RS,StatefulSet等副本控制器部署,并且replicas大于1。
設置應用container的request值,使得即使在資源非常緊張的情況下,也能有足夠的資源供它使用。
另外,盡量考慮物理設備上的HA,比如一個應用的不同副本要跨服務器部署,跨機柜跨機架部署,跨交換機部署等。
Involuntary Disruption對立的場景,自然就是Voluntary Disruption了,指的是用戶或者集群管理員觸發的,Kubernetes可控的Disruption場景,比如:
刪除那些管理Pods的控制器,比如Deployment,RS,RC,StatefulSet。
觸發應用的滾動更新。
直接批量刪除Pods。
kubectl drain一個節點(節點下線、集群縮容)
PDB就是針對Voluntary Disruption場景設計的,屬于Kubernetes可控的范疇之一,而不是為Involuntary Disruption設計的。
Kube-Node項目上線后,可以支持對接Openstack,AWS,GCE等cloud provider實現Node的自動管理,因此可能會經常有HNA(Horizontal Node Autoscaleer)事件,工作流就有類似drain a node
的邏輯,因此需要使用PDB來保障應用的HA。
部署在Kubernetes的每個App都可以創建一個對應PDB Object,用來限制Voluntary Disruptions時最大可以down的副本數或者最少應該保持Available的副本數,以此來保證應用的高可用。
PDB可以用來保護由Kubernetes內置控制器管理的應用,這種情況下要求DPB selector等同于這些Controller Object的Selector:
Deployment
ReplicationController
ReplicaSet
StatefulSet
也可以用來保護那些僅僅由PDB Selector自己選擇的Pods Set,但是有兩個使用限制:
只能配置.spec.minAvailable
,不能使用maxUnavailable
;
.spec.minAvailable
只能為整型值,不能是百分比。
因此,不管怎么說,PDB影響的Pods Set都是通過自己的Selector來選擇的,使用時要注意同一個namespace下不同的PDB Object不要使用有重疊的Selectors。
在使用PDB時,你需要弄清楚你的應用類型以及你想要的應對措施:
無狀態應用:比如想至少有60%的副本Available。
解決辦法:創建PDB Object,指定minAvailable為60%,或者maxUnavailable為40%。
單實例的有狀態應用:終止這個實例之前必須提前通知客戶并取得同意。
解決辦法:創建PDB Object,并設置maxUnavailable為0,這樣Kubernetes就會阻止這個實例的刪除,然后去通知并征求用戶同意后,再把這個PDB刪除從而解除這個阻止,然后再去recreate。單實例的statefulset的滾動更新一定會有服務停止時間,因此建議生產環境不要創建單實例的StatefulSet。
多實例的有狀態應用:最少可用的實例數不能少于某個數N
(比如受限于raft協議類應用的選舉機制)
解決辦法:設置maxUnavailable=1或者minAvailable=N,分別允許每次只刪除一個實例和每次刪除expected_replicas - minAvailable
個實例。
批處理Job:Job需要最終有一個Pod成功完成任務。
Job Controller有自己的機制保證這個,不需要創建PDB。
關于Job Controller深入解讀,請參考我的博文:Kubernetes Job Controller源碼分析
進行了以上思考后,確定了要創建PDB,接下來就看看PodDisruptionBudget
怎么定義的,下面是個Sample:
apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: zk-pdb spec: minAvailable: 2 selector: matchLabels: app: zookeeper
PDB的定義,其實就三項關鍵內容:
.spec.selector
用來選擇后端Pods Set,最佳實踐是與應用對應的Deployment,StatefulSet的Selector一致;
.spec.minAvailable
表示發生voluntary disruptions的過程中,要保證至少可用的Pods數或者比例;
.spec.maxUnavailable
表示發生voluntary disruptions的過程中,要保證最大不可用的Pods數或者比例,要求Kubernetes version >= 1.7
;這個配置只能用來對應Deployment,RS,RC,StatefulSet的Pods,推薦優先使用.spec.maxUnavailable
。
注意:
同一個PDB Object中不能同時定義
.spec.minAvailable
和.spec.maxUnavailable
。前面提到,應用滾動更新時Pod的delete和unavailable雖然也屬于voluntary disruption,但是實際上滾動更新有自己的策略控制(marSurge和maxUnavailable),因此PDB不會干預這個過程。
PDB只能保證voluntary disruptions時的副本數,比如evict pod過程中剛好滿足
.spec.minAvailable
或.spec.maxUnavailable
,這時某個本來正常的Pod突然因為Node Down(Involuntary Disruption)了掛了,那么這個時候實際Pods數就比PDB中要求的少了,因此PDB不是萬能的!
使用上,如果設置.spec.minAvailable
為100%或者.spec.maxUnavailable
為0%,意味著會完全阻止evict pods的過程(Deployment和StatefulSet的滾動更新除外)。
kubectl apply -f zk-pdb.yaml
創建該PDB Object;
$ kubectl get poddisruptionbudgets NAME MIN-AVAILABLE ALLOWED-DISRUPTIONS AGE zk-pdb 2 1 7s
kubect get pdb zk-pdb -o yaml
查看:
$ kubectl get poddisruptionbudgets zk-pdb -o yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: creationTimestamp: 2017-08-28T02:38:26Z generation: 1 name: zk-pdb ... status: currentHealthy: 3 desiredHealthy: 3 disruptedPods: null disruptionsAllowed: 1 expectedPods: 3 observedGeneration: 1
PDB Object定義是遇到voluntary disruption時用戶的期望狀態,真正去維護這個期望狀態的也是一個由kube-controller-manager管理的Controller,那便是Disruption Controller。
Disruption Controller主要watch Pods和PDBs,當監聽到pod/pdb的Add/Del/Update事件后,并會將對應的pdb object放到rate limit queue中等待worker處理,worker的主要邏輯就是計算PodDisruptionBudgetStatus
的currentHealthy, desiredHealthy, expectedCount, disruptedPods,然后調用api更新PDB Status。
pkg/controller/disruption/disruption.go:498 func (dc *DisruptionController) trySync(pdb *policy.PodDisruptionBudget) error { pods, err := dc.getPodsForPdb(pdb) if err != nil { dc.recorder.Eventf(pdb, v1.EventTypeWarning, "NoPods", "Failed to get pods: %v", err) return err } if len(pods) == 0 { dc.recorder.Eventf(pdb, v1.EventTypeNormal, "NoPods", "No matching pods found") } expectedCount, desiredHealthy, err := dc.getExpectedPodCount(pdb, pods) if err != nil { dc.recorder.Eventf(pdb, v1.EventTypeWarning, "CalculateExpectedPodCountFailed", "Failed to calculate the number of expected pods: %v", err) return err } currentTime := time.Now() disruptedPods, recheckTime := dc.buildDisruptedPodMap(pods, pdb, currentTime) currentHealthy := countHealthyPods(pods, disruptedPods, currentTime) err = dc.updatePdbStatus(pdb, currentHealthy, desiredHealthy, expectedCount, disruptedPods) if err == nil && recheckTime != nil { // There is always at most one PDB waiting with a particular name in the queue, // and each PDB in the queue is associated with the lowest timestamp // that was supplied when a PDB with that name was added. dc.enqueuePdbForRecheck(pdb, recheckTime.Sub(currentTime)) } return err }
下面是PodDisruptionBudgetStatus
的定義:
pkg/apis/policy/types.go:48 type PodDisruptionBudgetStatus struct { // Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other // status informatio is valid only if observedGeneration equals to PDB's object generation. // +optional ObservedGeneration int64 `json:"observedGeneration,omitempty" protobuf:"varint,1,opt,name=observedGeneration"` // DisruptedPods contains information about pods whose eviction was // processed by the API server eviction subresource handler but has not // yet been observed by the PodDisruptionBudget controller. // A pod will be in this map from the time when the API server processed the // eviction request to the time when the pod is seen by PDB controller // as having been marked for deletion (or after a timeout). The key in the map is the name of the pod // and the value is the time when the API server processed the eviction request. If // the deletion didn't occur and a pod is still there it will be removed from // the list automatically by PodDisruptionBudget controller after some time. // If everything goes smooth this map should be empty for the most of the time. // Large number of entries in the map may indicate problems with pod deletions. DisruptedPods map[string]metav1.Time `json:"disruptedPods" protobuf:"bytes,2,rep,name=disruptedPods"` // Number of pod disruptions that are currently allowed. PodDisruptionsAllowed int32 `json:"disruptionsAllowed" protobuf:"varint,3,opt,name=disruptionsAllowed"` // current number of healthy pods CurrentHealthy int32 `json:"currentHealthy" protobuf:"varint,4,opt,name=currentHealthy"` // minimum desired number of healthy pods DesiredHealthy int32 `json:"desiredHealthy" protobuf:"varint,5,opt,name=desiredHealthy"` // total number of pods counted by this disruption budget ExpectedPods int32 `json:"expectedPods" protobuf:"varint,6,opt,name=expectedPods"` }
PodDisruptionBudgetStatus最重要的元素就是**DisruptedPods
和PodDisruptionsAllowed
**:
DisruptedPods:用來保存那些已經通過apiserver pod eviction subresource處理的pods,但是還沒被PDB Controller發現處理的Pods,是Map類型,key為Pod Name,value是apiserver接受eviction subresource請求的時間。加入里面的Pod有2min的超時時間,如果2min后Pod仍然沒有被刪除,則會將該Pod從隊列中剔除。
PodDisruptionsAllowed:表示當前允許Disruption的Pods數。
Disruption Controller的主要邏輯就是更新PDB.Status,那么問題來了,到底是誰去控制voluntary distribution時eviction的maxUnavailable或者minAvailable的呢?
要再次提醒的是,PDB Controller只處理那些通過pod eviction subresource請求對應的pods,因此上面的這個問題就要到對應的Pod的evictionRest中去找了。
pkg/registry/core/pod/storage/eviction.go:81 // Create attempts to create a new eviction. That is, it tries to evict a pod. func (r *EvictionREST) Create(ctx genericapirequest.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, includeUninitialized bool) (runtime.Object, error) { eviction := obj.(*policy.Eviction) obj, err := r.store.Get(ctx, eviction.Name, &metav1.GetOptions{}) if err != nil { return nil, err } pod := obj.(*api.Pod) var rtStatus *metav1.Status var pdbName string err = retry.RetryOnConflict(EvictionsRetry, func() error { pdbs, err := r.getPodDisruptionBudgets(ctx, pod) if err != nil { return err } if len(pdbs) > 1 { rtStatus = &metav1.Status{ Status: metav1.StatusFailure, Message: "This pod has more than one PodDisruptionBudget, which the eviction subresource does not support.", Code: 500, } return nil } else if len(pdbs) == 1 { pdb := pdbs[0] pdbName = pdb.Name // Try to verify-and-decrement // If it was false already, or if it becomes false during the course of our retries, // raise an error marked as a 429. if err := r.checkAndDecrement(pod.Namespace, pod.Name, pdb); err != nil { return err } } return nil }) if err == wait.ErrWaitTimeout { err = errors.NewTimeoutError(fmt.Sprintf("couldn't update PodDisruptionBudget %q due to conflicts", pdbName), 10) } if err != nil { return nil, err } if rtStatus != nil { return rtStatus, nil } // At this point there was either no PDB or we succeded in decrementing // Try the delete _, _, err = r.store.Delete(ctx, eviction.Name, eviction.DeleteOptions) if err != nil { return nil, err } // Success! return &metav1.Status{Status: metav1.StatusSuccess}, nil }
通過EvictionREST去請求evict pod的時候,會檢查pod只有一個對應的pdb,否則報錯。關于Eviction API的使用,請參考The Eviction API,下面只給出簡單的Sample:
{ "apiVersion": "policy/v1beta1", "kind": "Eviction", "metadata": { "name": "quux", "namespace": "default" } } $ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
然后通過checkAndDecrement去檢查是否滿足PDB的manUnavailable或者minAvailable,如果滿足的話對pdb.Status.PodDisruptionsAllowed
減1處理。
checkAndDecrement成功的話,就真正去delete對應的Pod。
// checkAndDecrement checks if the provided PodDisruptionBudget allows any disruption. func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) error { if pdb.Status.ObservedGeneration < pdb.Generation { // TODO(mml): Add a Retry-After header. Once there are time-based // budgets, we can sometimes compute a sensible suggested value. But // even without that, we can give a suggestion (10 minutes?) that // prevents well-behaved clients from hammering us. err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0) err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: "DisruptionBudget", Message: fmt.Sprintf("The disruption budget %s is still being processed by the server.", pdb.Name)}) return err } if pdb.Status.PodDisruptionsAllowed < 0 { return errors.NewForbidden(policy.Resource("poddisruptionbudget"), pdb.Name, fmt.Errorf("pdb disruptions allowed is negative")) } if len(pdb.Status.DisruptedPods) > MaxDisruptedPodSize { return errors.NewForbidden(policy.Resource("poddisruptionbudget"), pdb.Name, fmt.Errorf("DisruptedPods map too big - too many evictions not confirmed by PDB controller")) } if pdb.Status.PodDisruptionsAllowed == 0 { err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0) err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: "DisruptionBudget", Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)}) return err } pdb.Status.PodDisruptionsAllowed-- if pdb.Status.DisruptedPods == nil { pdb.Status.DisruptedPods = make(map[string]metav1.Time) } // Eviction handler needs to inform the PDB controller that it is about to delete a pod // so it should not consider it as available in calculations when updating PodDisruptions allowed. // If the pod is not deleted within a reasonable time limit PDB controller will assume that it won't // be deleted at all and remove it from DisruptedPod map. pdb.Status.DisruptedPods[podName] = metav1.Time{Time: time.Now()} if _, err := r.podDisruptionBudgetClient.PodDisruptionBudgets(namespace).UpdateStatus(&pdb); err != nil { return err } return nil }
checkAndDecrement主要檢查pdb.Status.PodDisruptionsAllowed
是否大于0,并且DisruptedPods包含的Pods數不能超過2000(Disruption Controller性能可能不足以支撐這么多)。
檢查通過,就對pdb.Status.PodDisruptionsAllowed
減1,然后將該Pod加到DisruptedPods這個Map中,map的value就是當前時間(apiserver接受該eviction request的時間)。
更新PDB,PDB Controller因為監聽了PDB的Update Event,接著就會觸發PDB Controller的邏輯,再次去維護PDB Status。
Note:PDB在scheduler中也有用到。基于Pod Priority進行搶占式調度時,generic_scheduler進行preempte pod時會對Node上所有Pod進行PDB驗證,統計違背PDB的Pods數量,Select Node時盡量選擇違背PDB Pods數更少的node。
到此,關于“Kubernetes的PDB怎么應用”的學習就結束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習,快去試試吧!若想繼續學習更多相關知識,請繼續關注億速云網站,小編會繼續努力為大家帶來更多實用的文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。