New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix job tracking leaving pods with finalizers #109486
Fix job tracking leaving pods with finalizers #109486
Conversation
/test pull-kubernetes-integration |
c4bbf31
to
c80d267
Compare
c80d267
to
39b106f
Compare
/test pull-kubernetes-integration |
pods, err := clientSet.CoreV1().Pods(jobObj.Namespace).List(ctx, metav1.ListOptions{ | ||
LabelSelector: metav1.FormatLabelSelector(jobObj.Spec.Selector), | ||
}) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to discard that err is different than notFound? or is not a possibility in these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the list has zero items, the query doesn't return a NotFound error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotFound can only be returned in a get by name case, list at most will return an empty array.
Change-Id: Ic231ce9a5504d3aae4191901d7eb5fe69bf017ac
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This needs to be picked back to 1.24 and 1.23 to ensure folks using this feature don't struggle with orphaned pods. |
/milestone v1.24 |
@@ -782,17 +783,17 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr | |||
if uncounted == nil { | |||
// Legacy behavior: pretend all active pods were successfully removed. | |||
deleted = active | |||
} else if deleted != active { | |||
} else if deleted != active || !satisfiedExpectations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have context to review this line... if we skip this because we haven't satisfied expectations yet, are we relying on a retry and eventually satisfying those expectations, or is this a one-time short-circuit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not relying on retry per-se. We are relying on the pod creation events to add the job to the workqueue again. This is the same thing we do when starting a job to avoid creating extra pods.
// Can't declare the Job as finished yet, as there might be remaining | ||
// pod finalizers. | ||
// pod finalizers or pods that are not in the informer's cache yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the informer cache bit makes it sound like we're waiting for the informer to catch up so that expectations are satisfied... is that always going to happen (for example, if the namespace has been deleted and no new pods can be created)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectations are based on successful pod creations. If the namespace gets deleted in between, we would at least get a pod creation, because the pod is created with a finalizer.
We are too late for 1.24.0, so we are aiming to get this in 1.25 and 1.24.1 |
👋 Bug Triage Shadow for 1.24 here, We discussed this issue on the |
/milestone clear To avoid accidental merge into |
/hold cancel Is code-freeze over soon? |
…of-#109486-upstream-release-1.24 Automated cherry pick of #109486: Integration test for backoff limit and finalizers
…of-#109486-upstream-release-1.23 Automated cherry pick of #109486: Integration test for backoff limit and finalizers
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #109485
Special notes for your reviewer:
This PR does not re-enable the feature. We can do so after we get good signal from the integration tests in CI.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: