kubelet: Mark ready condition as false explicitly for terminal pods #110256

bobbypage · 2022-05-27T22:34:19Z

Terminal pods may continue to report a ready condition of true because
there is a delay in reconciling the ready condition of the containers
from the runtime with the pod status. It should be invalid for kubelet
to report a terminal phase with a true ready condition. To fix the
issue, explicitly override the ready condition to false for terminal
pods during status updates.

Signed-off-by: David Porter david@porter.me

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #108594

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a 1.22 regression kubelet issue that could result in invalid pod status updates to be sent to the api-server where pods would be reported in a terminal phase but also report a ready condition of true in some cases.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bobbypage · 2022-05-27T23:43:52Z

/retest

robscott · 2022-05-28T00:14:56Z

/retest

bobbypage · 2022-05-28T00:27:22Z

xref: see #110257 where after lowering the pollInternval in the existing regression test for #108594 we were able to detect the invalid pod status update by kubelet of (phase=Failed, Ready=True).

bobbypage · 2022-05-28T00:29:04Z

/cc @smarterclayton @rphillips

aojea · 2022-05-28T11:25:12Z

/cc

pkg/kubelet/status/status_manager.go

rphillips · 2022-05-31T17:21:25Z

Great find!

rphillips · 2022-05-31T17:22:55Z

/kind bug
/triage accepted
/priority important-soon

bobbypage · 2022-06-08T22:43:51Z

Thanks Tim! I agree with you, ideally clients should ignore terminal pods, and if the pod is terminal ignore the ready condition in the first place. We can't guarantee that all clients will do that though.

However, at least pre-1.22 (before this regression was introduced), clients could assume that terminal pods would always report ready=false, which is no longer the case. This PR aims to revert to that behavior and ensure kubelet never reports an "invalid" status update of Phase=Terminal, Ready=True.

That said, I guess we've had this problem FOREVER, right? If you are examining pods and not handling Phase properly, this doesn't make you worse than you were before.

Yup, this doesn't make it any worse for clients ignoring phases but rather brings back the guarantee that terminal pods will always report a ready status of false.

thockin

/approve

I'm a big fan of leaving future-me clues.

We should move directories around so that podutil OWNERs don't need API approvers.

thockin · 2022-06-08T22:42:38Z

pkg/kubelet/status/generate.go

@@ -69,6 +71,15 @@ func GenerateContainersReadyCondition(spec *v1.PodSpec, containerStatuses []v1.C
 		}
 	}

+	// If the pod phase is failed, explicitly set the ready condition to false for containers since they may be in progress of terminating.
+	if podPhase == v1.PodFailed {


Why is this not also checking PodSucceeded ? Should we leave a comment?

And why does this not call generateContainersReadyConditionForTerminalPhase(podPhase) ?

Why is this not also checking PodSucceeded ? Should we leave a comment?

PodSucceded is already handled above on line 66, see here

And why does this not call generateContainersReadyConditionForTerminalPhase(podPhase) ?

I created that helper function to be used in status_manager.go to be able to handle either phase. I returned the condition directly inline here, since it's what done in the succeeded case above, but I agree with you, we can switch it to use the helper function as well.

I updated, based on your feedback and reused generateContainersReadyConditionForTerminalPhase for both succeeded and failed cases, so the condition generation logic is in one place.

k8s-ci-robot · 2022-06-08T22:48:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobbypage, dchen1107, mrunalp, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/api/OWNERS~~ [thockin]
~~pkg/kubelet/OWNERS~~ [dchen1107,mrunalp,thockin]
~~test/e2e_node/OWNERS~~ [dchen1107,mrunalp,thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bobbypage · 2022-06-08T23:15:00Z

/hold

to update based on feedback

Terminal pods may continue to report a ready condition of true because there is a delay in reconciling the ready condition of the containers from the runtime with the pod status. It should be invalid for kubelet to report a terminal phase with a true ready condition. To fix the issue, explicitly override the ready condition to false for terminal pods during status updates. Signed-off-by: David Porter <david@porter.me>

Use a watch to detect invalid pod status updates in graceful node shutdown node e2e test. By using a watch, all pod updates will be captured while the previous logic required polling the api-server which could miss some intermediate updates. Signed-off-by: David Porter <david@porter.me>

bobbypage · 2022-06-08T23:21:26Z

/unhold

Pushed an additional revision to fix suggestions here

thockin · 2022-06-08T23:24:11Z

Assuming previous LGTM holds after minor changes:

/lgtm

bobbypage · 2022-06-09T02:04:14Z

/retest

bobbypage · 2022-06-09T17:48:36Z

Cherrypicks:

1.24 - #110479
1.23 - #110480
1.22 - #110481

…10256-upstream-release-1.23 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

…10256-upstream-release-1.22 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

…10256-upstream-release-1.24 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

smarterclayton · 2022-08-01T14:59:11Z

pkg/kubelet/status/generate.go

-			Status: v1.ConditionFalse,
-			Reason: PodCompleted,
-		}
+		return generateContainersReadyConditionForTerminalPhase(podPhase)


Going back and looking through this, the len(unknownContainers) == 0 above has me worried.

There are two phases in play - the apiserver phase, and the one the Kubelet tracks. Setting phase to terminal on the apiserver is roughly the same outcome as deleting the pod - the kubelet should inexorably converge the actual pod state to stopped. Likewise if the Kubelet observes a failure or evicts the pod, the phase should inexorably converge. However, internal to the Kubelet status loop the phase should match the actual observed state of the pod containers, and so we should never be in a terminal state here without these containers being in some state we recognize (I think).

This is something I want to have laid out in the pod lifecycle clarification KEP so we can look at all the factors and eliminate my "shoulds" from above - it might be that the correct status when succeeded + > 1 unknown container = pod ready = false here, always. But it's super subtle, and we need this sort of stuff in a clear top level doc, not in comment threads in a PR. I'll add it to the list.

k8s-ci-robot requested review from Random-Liu and tallclair May 27, 2022 22:34

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 27, 2022

bobbypage force-pushed the terminal-ready-condition branch from 3876e5e to 00c321c Compare May 27, 2022 23:27

bobbypage force-pushed the terminal-ready-condition branch from 00c321c to ac975b4 Compare May 28, 2022 00:01

bobbypage force-pushed the terminal-ready-condition branch from ac975b4 to 44c16cd Compare May 28, 2022 00:18

k8s-ci-robot requested a review from smarterclayton May 28, 2022 00:29

k8s-ci-robot requested a review from aojea May 28, 2022 11:25

rata reviewed May 31, 2022

View reviewed changes

pkg/kubelet/status/status_manager.go Show resolved Hide resolved

thockin reviewed Jun 8, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2022

bobbypage and others added 2 commits June 8, 2022 16:19

bobbypage force-pushed the terminal-ready-condition branch from d0cd673 to b4b338d Compare June 8, 2022 23:19

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 8, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 8, 2022

k8s-ci-robot merged commit 2263231 into kubernetes:master Jun 9, 2022

SIG Node CI/Test Board automation moved this from Archive-it to Done Jun 9, 2022

SIG Node PR Triage automation moved this from Needs Approver to Done Jun 9, 2022

k8s-ci-robot added this to the v1.25 milestone Jun 9, 2022

openshift-ci-robot mentioned this pull request Jun 9, 2022

Bug 2089933: Backport 110191 Re-enable Kubelet Pod Readiness Probes on Termination and Pod probes should be handled by pod worker openshift/kubernetes#1285

Merged

k8s-ci-robot added a commit that referenced this pull request Jun 9, 2022

Merge pull request #110480 from bobbypage/automated-cherry-pick-of-#1…

6f31631

…10256-upstream-release-1.23 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

k8s-ci-robot added a commit that referenced this pull request Jun 9, 2022

Merge pull request #110481 from bobbypage/automated-cherry-pick-of-#1…

6de0e5a

…10256-upstream-release-1.22 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

k8s-ci-robot added a commit that referenced this pull request Jun 10, 2022

Merge pull request #110479 from bobbypage/automated-cherry-pick-of-#1…

050f930

…10256-upstream-release-1.24 Automated cherry pick of #110256: kubelet: Mark ready condition as false explicitly for terminal pods

smarterclayton reviewed Aug 1, 2022

View reviewed changes

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Jan 30, 2023

mikekap mentioned this pull request Jul 4, 2023

Mark terminal pods not ready before they finish shut down #117822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: Mark ready condition as false explicitly for terminal pods #110256

kubelet: Mark ready condition as false explicitly for terminal pods #110256

bobbypage commented May 27, 2022 •

edited by liggitt

bobbypage commented May 27, 2022

robscott commented May 28, 2022

bobbypage commented May 28, 2022 •

edited

bobbypage commented May 28, 2022 •

edited

aojea commented May 28, 2022

rphillips commented May 31, 2022

rphillips commented May 31, 2022

bobbypage commented Jun 8, 2022

thockin left a comment

thockin Jun 8, 2022

thockin Jun 8, 2022

bobbypage Jun 8, 2022

bobbypage Jun 8, 2022

k8s-ci-robot commented Jun 8, 2022

bobbypage commented Jun 8, 2022

bobbypage commented Jun 8, 2022

thockin commented Jun 8, 2022

bobbypage commented Jun 9, 2022

bobbypage commented Jun 9, 2022

smarterclayton Aug 1, 2022

kubelet: Mark ready condition as false explicitly for terminal pods #110256

kubelet: Mark ready condition as false explicitly for terminal pods #110256

Conversation

bobbypage commented May 27, 2022 • edited by liggitt

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bobbypage commented May 27, 2022

robscott commented May 28, 2022

bobbypage commented May 28, 2022 • edited

bobbypage commented May 28, 2022 • edited

aojea commented May 28, 2022

rphillips commented May 31, 2022

rphillips commented May 31, 2022

bobbypage commented Jun 8, 2022

thockin left a comment

Choose a reason for hiding this comment

thockin Jun 8, 2022

Choose a reason for hiding this comment

thockin Jun 8, 2022

Choose a reason for hiding this comment

bobbypage Jun 8, 2022

Choose a reason for hiding this comment

bobbypage Jun 8, 2022

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 8, 2022

bobbypage commented Jun 8, 2022

bobbypage commented Jun 8, 2022

thockin commented Jun 8, 2022

bobbypage commented Jun 9, 2022

bobbypage commented Jun 9, 2022

smarterclayton Aug 1, 2022

Choose a reason for hiding this comment

bobbypage commented May 27, 2022 •

edited by liggitt

bobbypage commented May 28, 2022 •

edited

bobbypage commented May 28, 2022 •

edited