Prevent dirty service object leaking between reconciles #109601

mdbooth · 2022-04-21T20:02:54Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

We were passing a Service object returned from the Service informer directly to the Service controller loop. This loop is expected to modify the Service object during execution, but this is not safe. Specifically any modification to the Service object by the controller will pollute the Service informer's cache and result in a potentially dirty object being passed to subsequent invocation of the controller loop.

We fix this issue by passing a copy of the object from the informer instead.

I have described in the comments how this bug affected cloud-provider-openstack, but I suspect it affects multiple cloud providers.

Which issue(s) this PR fixes:

Fixes kubernetes/cloud-provider#59

Special notes for your reviewer:

Only manually tested. See manual test description in comment below.

Does this PR introduce a user-facing change?

Fixes a bug which could have allowed an improperly annotated LoadBalancer service to become active.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-04-21T20:02:56Z

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

k8s-ci-robot · 2022-04-21T20:03:02Z

@mdbooth: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mdbooth · 2022-04-22T09:52:27Z

I have now tested this manually and verified that it fixes the problem.

To describe the problem: cloud-provider-openstack adds an annotation to the Service object during reconciliation. In our environment we were incorrectly deploying cloud-provider-openstack without the required permission to patch the service object, which resulted in an error message like:

I0422 10:30:04.823353   64011 event.go:294] "Event occurred" object="lb-test-ns/lb-test-svc2" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to patch service object lb-test-ns/lb-test-svc2: services \"lb-test-svc2\" is forbidden: User \"system:serviceaccount:kube-system:cloud-controller-manager\" cannot patch resource \"services\" in API group \"\" in the namespace \"lb-test-ns\

However, what we observed was that the next time the Service was reconciled after the failure, the reconcile would succeed but the annotation had not been added to the Service object.

The reason for this was that the Service object passed to cloud-provider-openstack from syncService was dirty from the previous invocation. As the annotation was still present from the previous run there was no difference for the Patcher to patch, and therefore the reconcile erroneously succeeds.

I have verified that with:

cloud-provider-openstack deployed without the required permissions to patch Service objects
this patch in place

cloud-provider-openstack will now fail with the above permission error on every reconcile attempt, not just the first one.

mdbooth · 2022-04-22T12:44:18Z

I would appreciate advice on how to write an automated test for this. I think the ideal test would be:

Have the informer return a Service object
Modify the object in a reconciler, but don't push the modification to apiserver and return an error
sync the Service a second time, retrieving the object from the informer
assert that the object is not dirty

It might be possible to construct a unit test to do this but it would be quite complicated and probably not very easy to understand or maintain. It feels more like envtest territory?

Alternatively, we could argue that this would really be a test of the behaviour of informers, not the service controller, and doesn't require a separate test?

mdbooth · 2022-04-22T13:20:54Z

controller-runtime does this here, btw: https://github.com/kubernetes-sigs/controller-runtime/blob/c162794a9b12dea31c076820938a779873e93957/pkg/cache/internal/cache_reader.go#L84-L90

lobziik · 2022-04-22T13:40:30Z

lgtm :)

mdbooth · 2022-04-26T12:56:31Z

@andrewsykim the impact of this that I've noticed is fairly minor and by fixing the underlying permission issue we're unlikely to see it often, although the problem still exists if there's a transient failure to patch the Service object.

I haven't been following the 1.24 release schedule, so I don't know how frozen we are right now. Is it worth trying to merge this, or wait until after 1.24?

k8s-triage-robot · 2022-07-25T13:40:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mdbooth · 2022-07-25T18:19:00Z

/remove-lifecycle stale

mdbooth · 2022-07-25T18:19:52Z

@andrewsykim This is still a thing, btw. Any idea who can approve?

thockin

Thanks!

/lgtm
/approve

k8s-ci-robot · 2022-07-28T00:01:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/cloud-provider/controllers/service/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Prevent dirty service object leaking between reconciles

68ad099

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 21, 2022

k8s-ci-robot requested review from bowei and thockin April 21, 2022 20:03

mdbooth mentioned this pull request Apr 22, 2022

Service Controller can call provider EnsureLoadBalancer with a dirty Service object kubernetes/cloud-provider#59

Closed

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 22, 2022

mdbooth changed the title ~~WIP: Prevent dirty service object leaking between reconciles~~ Prevent dirty service object leaking between reconciles Apr 22, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

thockin reviewed Jul 27, 2022

View reviewed changes

k8s-ci-robot assigned thockin Jul 27, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2022

k8s-ci-robot merged commit 914406d into kubernetes:master Jul 28, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent dirty service object leaking between reconciles #109601

Prevent dirty service object leaking between reconciles #109601

mdbooth commented Apr 21, 2022 •

edited

k8s-ci-robot commented Apr 21, 2022

k8s-ci-robot commented Apr 21, 2022

mdbooth commented Apr 22, 2022 •

edited

mdbooth commented Apr 22, 2022

mdbooth commented Apr 22, 2022

lobziik commented Apr 22, 2022

mdbooth commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

mdbooth commented Jul 25, 2022

mdbooth commented Jul 25, 2022

thockin left a comment

k8s-ci-robot commented Jul 28, 2022

Prevent dirty service object leaking between reconciles #109601

Prevent dirty service object leaking between reconciles #109601

Conversation

mdbooth commented Apr 21, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 21, 2022

k8s-ci-robot commented Apr 21, 2022

mdbooth commented Apr 22, 2022 • edited

mdbooth commented Apr 22, 2022

mdbooth commented Apr 22, 2022

lobziik commented Apr 22, 2022

mdbooth commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

mdbooth commented Jul 25, 2022

mdbooth commented Jul 25, 2022

thockin left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 28, 2022

mdbooth commented Apr 21, 2022 •

edited

mdbooth commented Apr 22, 2022 •

edited