Fix job tracking leaving pods with finalizers #109486

alculquicondor · 2022-04-14T15:31:45Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

Add a repro in integration (it's very hard to make it fail, but it does fail)
Fix the underlying bug: Do not declare a job as finished until the pod expectations are satisfied.
Fix side problem: clusters running with the bug might have left pods with finalizers that belong to finished jobs. Also related to Jobs deletion races with jobs-tracking pods finalizer #109429
Enable integration test Foreground job deletion + GC.

Which issue(s) this PR fixes:

Fixes #109485

Special notes for your reviewer:

This PR does not re-enable the feature. We can do so after we get good signal from the integration tests in CI.

Does this PR introduce a user-facing change?

Fix JobTrackingWithFinalizers that:
- was declaring a job finished before counting all the created pods in the status
- was leaving pods with finalizers, blocking pod and job deletions

JobTrackingWithFinalizers is still disabled by default.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

alculquicondor · 2022-04-14T17:07:57Z

/test pull-kubernetes-integration

alculquicondor · 2022-04-14T19:48:22Z

/test pull-kubernetes-integration

aojea · 2022-04-18T14:57:48Z

test/integration/job/job_test.go

+		pods, err := clientSet.CoreV1().Pods(jobObj.Namespace).List(ctx, metav1.ListOptions{
+			LabelSelector: metav1.FormatLabelSelector(jobObj.Spec.Selector),
+		})
+		if err != nil {


do you need to discard that err is different than notFound? or is not a possibility in these tests?

Even if the list has zero items, the query doesn't return a NotFound error.

NotFound can only be returned in a get by name case, list at most will return an empty array.

Change-Id: Ic231ce9a5504d3aae4191901d7eb5fe69bf017ac

Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4

k8s-ci-robot · 2022-04-21T14:16:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [soltysh]
~~test/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

soltysh · 2022-04-21T14:17:32Z

This needs to be picked back to 1.24 and 1.23 to ensure folks using this feature don't struggle with orphaned pods.

pkg/controller/job/job_controller.go

ahg-g · 2022-04-25T04:22:05Z

/milestone v1.24
/hold

liggitt · 2022-04-25T13:19:21Z

pkg/controller/job/job_controller.go

@@ -782,17 +783,17 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr
 		if uncounted == nil {
 			// Legacy behavior: pretend all active pods were successfully removed.
 			deleted = active
-		} else if deleted != active {
+		} else if deleted != active || !satisfiedExpectations {


I don't have context to review this line... if we skip this because we haven't satisfied expectations yet, are we relying on a retry and eventually satisfying those expectations, or is this a one-time short-circuit

We are not relying on retry per-se. We are relying on the pod creation events to add the job to the workqueue again. This is the same thing we do when starting a job to avoid creating extra pods.

liggitt · 2022-04-25T13:20:05Z

pkg/controller/job/job_controller.go

 			// Can't declare the Job as finished yet, as there might be remaining
-			// pod finalizers.
+			// pod finalizers or pods that are not in the informer's cache yet.


the informer cache bit makes it sound like we're waiting for the informer to catch up so that expectations are satisfied... is that always going to happen (for example, if the namespace has been deleted and no new pods can be created)?

The expectations are based on successful pod creations. If the namespace gets deleted in between, we would at least get a pod creation, because the pod is created with a finalizer.

alculquicondor · 2022-04-25T14:17:05Z

/milestone v1.24

We are too late for 1.24.0, so we are aiming to get this in 1.25 and 1.24.1

helayoty · 2022-04-25T17:59:49Z

/milestone v1.24

We are too late for 1.24.0, so we are aiming to get this in 1.25 and 1.24.1

👋 Bug Triage Shadow for 1.24 here, We discussed this issue on the sig-release channel and we see it's best to be removed from 1.24.0, instead you can target 1.24.1 or 1.25.

JamesLaverack · 2022-04-25T21:53:57Z

/milestone clear

To avoid accidental merge into master before the 1.24 freeze finishes.

alculquicondor · 2022-05-03T15:15:14Z

/hold cancel

Is code-freeze over soon?

…of-#109486-upstream-release-1.24 Automated cherry pick of #109486: Integration test for backoff limit and finalizers

…of-#109486-upstream-release-1.23 Automated cherry pick of #109486: Integration test for backoff limit and finalizers

k8s-ci-robot requested review from johnSchnake and neolit123 April 14, 2022 15:32

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 14, 2022

alculquicondor changed the title ~~Integration test for backoff limit and finalizers~~ WIP Integration test for backoff limit and finalizers Apr 14, 2022

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2022

alculquicondor force-pushed the job-backofflimit branch 2 times, most recently from c4bbf31 to c80d267 Compare April 14, 2022 18:58

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2022

alculquicondor force-pushed the job-backofflimit branch from c80d267 to 39b106f Compare April 14, 2022 18:59

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2022

aojea reviewed Apr 18, 2022

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 20, 2022

alculquicondor added 2 commits April 20, 2022 16:39

Integration test for backoff limit and finalizers

f2c8030

Change-Id: Ic231ce9a5504d3aae4191901d7eb5fe69bf017ac

Don't mark job as failed until expectations are satisfied

53aa05d

Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 21, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2022

liggitt reviewed Apr 21, 2022

View reviewed changes

pkg/controller/job/job_controller.go Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 25, 2022

k8s-ci-robot added this to the v1.24 milestone Apr 25, 2022

liggitt reviewed Apr 25, 2022

View reviewed changes

JamesLaverack removed this from the v1.24 milestone Apr 25, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2022

k8s-ci-robot merged commit 63a618a into kubernetes:master May 4, 2022

k8s-ci-robot added this to the v1.25 milestone May 4, 2022

This was referenced May 4, 2022

Automated cherry pick of #109486: Integration test for backoff limit and finalizers #109786

Merged

Automated cherry pick of #109486: Integration test for backoff limit and finalizers #109788

Merged

pacoxu mentioned this pull request May 5, 2022

hardens integration job tests #109749

Merged

k8s-ci-robot added a commit that referenced this pull request May 9, 2022

Merge pull request #109786 from alculquicondor/automated-cherry-pick-…

f55e13b

…of-#109486-upstream-release-1.24 Automated cherry pick of #109486: Integration test for backoff limit and finalizers

k8s-ci-robot added a commit that referenced this pull request May 9, 2022

Merge pull request #109788 from alculquicondor/automated-cherry-pick-…

8f079de

…of-#109486-upstream-release-1.23 Automated cherry pick of #109486: Integration test for backoff limit and finalizers

This was referenced May 10, 2022

Flaky TestWatchOrphanPods/orphan #109943

Closed

Wait for cache to sync in job's TestWatchOrphanPods #109947

Merged

Jobs deletion races with jobs-tracking pods finalizer #109429

Closed

josepperna mentioned this pull request Dec 8, 2022

job-tracking Finalizers prevent pod from deleting #114366

Closed

liggitt removed the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix job tracking leaving pods with finalizers #109486

Fix job tracking leaving pods with finalizers #109486

alculquicondor commented Apr 14, 2022 •

edited

alculquicondor commented Apr 14, 2022

alculquicondor commented Apr 14, 2022

aojea Apr 18, 2022

alculquicondor Apr 20, 2022

soltysh Apr 21, 2022

k8s-ci-robot commented Apr 21, 2022

soltysh commented Apr 21, 2022

ahg-g commented Apr 25, 2022

liggitt Apr 25, 2022

alculquicondor Apr 25, 2022

liggitt Apr 25, 2022

alculquicondor Apr 25, 2022

alculquicondor commented Apr 25, 2022

helayoty commented Apr 25, 2022

JamesLaverack commented Apr 25, 2022 •

edited

alculquicondor commented May 3, 2022

Fix job tracking leaving pods with finalizers #109486

Fix job tracking leaving pods with finalizers #109486

Conversation

alculquicondor commented Apr 14, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

alculquicondor commented Apr 14, 2022

alculquicondor commented Apr 14, 2022

aojea Apr 18, 2022

Choose a reason for hiding this comment

alculquicondor Apr 20, 2022

Choose a reason for hiding this comment

soltysh Apr 21, 2022

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 21, 2022

soltysh commented Apr 21, 2022

ahg-g commented Apr 25, 2022

liggitt Apr 25, 2022

Choose a reason for hiding this comment

alculquicondor Apr 25, 2022

Choose a reason for hiding this comment

liggitt Apr 25, 2022

Choose a reason for hiding this comment

alculquicondor Apr 25, 2022

Choose a reason for hiding this comment

alculquicondor commented Apr 25, 2022

helayoty commented Apr 25, 2022

JamesLaverack commented Apr 25, 2022 • edited

alculquicondor commented May 3, 2022

alculquicondor commented Apr 14, 2022 •

edited

JamesLaverack commented Apr 25, 2022 •

edited