Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JobTrackingWithFinalizers when a pod succeeds after the job fails #111646

Merged
merged 1 commit into from Aug 3, 2022

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Aug 2, 2022

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

When JobTrackingWithFinalizers is enabled, this sequence of steps led to a failure:

  1. The job fails (this could be, for example, when there is a zero backoff limit and a pod fails)
  2. The job controller adds all running pods to .status.uncountedTerminatedPods.failed.
  3. One of the running pods finishes.
  4. When the job controller tries to update the status for the last time, there is a failure.
  5. In the next sync, the job controller tries to count the pod as succeeded, but there is a conflict with step 2 that apiserver rejects.

The fix is to check if a succeeded pod was considered as failed before and stick with that decision.

Which issue(s) this PR fixes:

Ref kubernetes/enhancements#2307

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix JobTrackingWithFinalizers when a pod succeeds after the job is considered failed, which led to API conflicts that blocked finishing the job.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 2, 2022
@alculquicondor
Copy link
Member Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 2, 2022
@alculquicondor
Copy link
Member Author

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 2, 2022
@alculquicondor
Copy link
Member Author

/assign @soltysh

@ahg-g
Copy link
Member

ahg-g commented Aug 2, 2022

/milestone v1.25

@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Aug 2, 2022
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for test due to the urgency but deferring lgtm/review to *Job owners.

@neolit123
Copy link
Member

neolit123 commented Aug 2, 2022

Fix JobTrackingWithFinalizers when a pod succeeds after the job is considered failed, which led to API conflicts.

release notes should ideally exclude code details and cover what users might see as a bug and post the fix.

@alculquicondor
Copy link
Member Author

release notes should ideally exclude code details and cover what users might see as a bug and post the fix.

which part of the note do you think has code details?

JobTrackingWithFinalizers is a feature gate, so user-visible. Should I remove the "which led to API conflicts" part?

@@ -1023,7 +1023,7 @@ func (jm *Controller) trackJobStatusAndRemoveFinalizers(ctx context.Context, job
if podFinished || podTerminating || job.DeletionTimestamp != nil {
podsToRemoveFinalizer = append(podsToRemoveFinalizer, pod)
}
if pod.Status.Phase == v1.PodSucceeded {
if pod.Status.Phase == v1.PodSucceeded && !uncounted.failed.Has(string(pod.UID)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the comment above at line 1018, are we counting them as failed because we may not get another job sync to remove their finalizers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that is covered in line 1020.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is nit, but I don't think it is covered in that comment if the sole purpose is to "trigger another sync to remove the finalizers".

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold
to get the integration sorted out

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022
@soltysh
Copy link
Contributor

soltysh commented Aug 2, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Aug 2, 2022
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, neolit123, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2022
Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47
@alculquicondor
Copy link
Member Author

Fixed the integration test.

If the finalizers were removed before setting pods to Succeeded, apiserver would delete the pods, because they were unscheduled. Solved by adding a NodeName directly into the pod template.

@alculquicondor
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022
@ahg-g
Copy link
Member

ahg-g commented Aug 3, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit 182e098 into kubernetes:master Aug 3, 2022
k8s-ci-robot added a commit that referenced this pull request Aug 3, 2022
…of-#111646-upstream-release-1.23

Automated cherry pick of #111646: Fix JobTrackingWithFinalizers when a pod succeeds after the
k8s-ci-robot added a commit that referenced this pull request Aug 3, 2022
…of-#111646-upstream-release-1.24

Automated cherry pick of #111646: Fix JobTrackingWithFinalizers when a pod succeeds after the
@liggitt liggitt removed the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants