Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job tracking leaving pods with finalizers #109486

Merged
merged 4 commits into from May 4, 2022

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Apr 14, 2022

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

  • Add a repro in integration (it's very hard to make it fail, but it does fail)
  • Fix the underlying bug: Do not declare a job as finished until the pod expectations are satisfied.
  • Fix side problem: clusters running with the bug might have left pods with finalizers that belong to finished jobs. Also related to Jobs deletion races with jobs-tracking pods finalizer #109429
  • Enable integration test Foreground job deletion + GC.

Which issue(s) this PR fixes:

Fixes #109485

Special notes for your reviewer:

This PR does not re-enable the feature. We can do so after we get good signal from the integration tests in CI.

Does this PR introduce a user-facing change?

Fix JobTrackingWithFinalizers that:
- was declaring a job finished before counting all the created pods in the status
- was leaving pods with finalizers, blocking pod and job deletions

JobTrackingWithFinalizers is still disabled by default.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 14, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 14, 2022
@alculquicondor alculquicondor changed the title Integration test for backoff limit and finalizers WIP Integration test for backoff limit and finalizers Apr 14, 2022
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2022
@alculquicondor
Copy link
Member Author

/test pull-kubernetes-integration

@alculquicondor alculquicondor force-pushed the job-backofflimit branch 2 times, most recently from c4bbf31 to c80d267 Compare April 14, 2022 18:58
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2022
@alculquicondor
Copy link
Member Author

/test pull-kubernetes-integration

pods, err := clientSet.CoreV1().Pods(jobObj.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: metav1.FormatLabelSelector(jobObj.Spec.Selector),
})
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to discard that err is different than notFound? or is not a possibility in these tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the list has zero items, the query doesn't return a NotFound error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotFound can only be returned in a get by name case, list at most will return an empty array.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 20, 2022
Change-Id: Ic231ce9a5504d3aae4191901d7eb5fe69bf017ac
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 21, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2022
@soltysh
Copy link
Contributor

soltysh commented Apr 21, 2022

This needs to be picked back to 1.24 and 1.23 to ensure folks using this feature don't struggle with orphaned pods.

@ahg-g
Copy link
Member

ahg-g commented Apr 25, 2022

/milestone v1.24
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 25, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Apr 25, 2022
@@ -782,17 +783,17 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr
if uncounted == nil {
// Legacy behavior: pretend all active pods were successfully removed.
deleted = active
} else if deleted != active {
} else if deleted != active || !satisfiedExpectations {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have context to review this line... if we skip this because we haven't satisfied expectations yet, are we relying on a retry and eventually satisfying those expectations, or is this a one-time short-circuit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not relying on retry per-se. We are relying on the pod creation events to add the job to the workqueue again. This is the same thing we do when starting a job to avoid creating extra pods.

// Can't declare the Job as finished yet, as there might be remaining
// pod finalizers.
// pod finalizers or pods that are not in the informer's cache yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the informer cache bit makes it sound like we're waiting for the informer to catch up so that expectations are satisfied... is that always going to happen (for example, if the namespace has been deleted and no new pods can be created)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectations are based on successful pod creations. If the namespace gets deleted in between, we would at least get a pod creation, because the pod is created with a finalizer.

@alculquicondor
Copy link
Member Author

/milestone v1.24

We are too late for 1.24.0, so we are aiming to get this in 1.25 and 1.24.1

@helayoty
Copy link
Member

/milestone v1.24

We are too late for 1.24.0, so we are aiming to get this in 1.25 and 1.24.1

👋 Bug Triage Shadow for 1.24 here, We discussed this issue on the sig-release channel and we see it's best to be removed from 1.24.0, instead you can target 1.24.1 or 1.25.

@JamesLaverack
Copy link
Member

JamesLaverack commented Apr 25, 2022

/milestone clear

To avoid accidental merge into master before the 1.24 freeze finishes.

@JamesLaverack JamesLaverack removed this from the v1.24 milestone Apr 25, 2022
@alculquicondor
Copy link
Member Author

/hold cancel

Is code-freeze over soon?

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit 63a618a into kubernetes:master May 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone May 4, 2022
k8s-ci-robot added a commit that referenced this pull request May 9, 2022
…of-#109486-upstream-release-1.24

Automated cherry pick of #109486: Integration test for backoff limit and finalizers
k8s-ci-robot added a commit that referenced this pull request May 9, 2022
…of-#109486-upstream-release-1.23

Automated cherry pick of #109486: Integration test for backoff limit and finalizers
@liggitt liggitt removed the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/batch area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job tracking might leave pods with finalizers when backoff limit is reached
8 participants