Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent dirty service object leaking between reconciles #109601

Merged
merged 1 commit into from Jul 28, 2022

Conversation

mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Apr 21, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

We were passing a Service object returned from the Service informer directly to the Service controller loop. This loop is expected to modify the Service object during execution, but this is not safe. Specifically any modification to the Service object by the controller will pollute the Service informer's cache and result in a potentially dirty object being passed to subsequent invocation of the controller loop.

We fix this issue by passing a copy of the object from the informer instead.

I have described in the comments how this bug affected cloud-provider-openstack, but I suspect it affects multiple cloud providers.

Which issue(s) this PR fixes:

Fixes kubernetes/cloud-provider#59

Special notes for your reviewer:

Only manually tested. See manual test description in comment below.

Does this PR introduce a user-facing change?

Fixes a bug which could have allowed an improperly annotated LoadBalancer service to become active.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 21, 2022
@k8s-ci-robot
Copy link
Contributor

@mdbooth: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 21, 2022
@k8s-ci-robot k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 21, 2022
@mdbooth
Copy link
Contributor Author

mdbooth commented Apr 22, 2022

I have now tested this manually and verified that it fixes the problem.

To describe the problem: cloud-provider-openstack adds an annotation to the Service object during reconciliation. In our environment we were incorrectly deploying cloud-provider-openstack without the required permission to patch the service object, which resulted in an error message like:

I0422 10:30:04.823353   64011 event.go:294] "Event occurred" object="lb-test-ns/lb-test-svc2" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to patch service object lb-test-ns/lb-test-svc2: services \"lb-test-svc2\" is forbidden: User \"system:serviceaccount:kube-system:cloud-controller-manager\" cannot patch resource \"services\" in API group \"\" in the namespace \"lb-test-ns\

However, what we observed was that the next time the Service was reconciled after the failure, the reconcile would succeed but the annotation had not been added to the Service object.

The reason for this was that the Service object passed to cloud-provider-openstack from syncService was dirty from the previous invocation. As the annotation was still present from the previous run there was no difference for the Patcher to patch, and therefore the reconcile erroneously succeeds.

I have verified that with:

  • cloud-provider-openstack deployed without the required permissions to patch Service objects
  • this patch in place

cloud-provider-openstack will now fail with the above permission error on every reconcile attempt, not just the first one.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 22, 2022
@mdbooth mdbooth changed the title WIP: Prevent dirty service object leaking between reconciles Prevent dirty service object leaking between reconciles Apr 22, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2022
@mdbooth
Copy link
Contributor Author

mdbooth commented Apr 22, 2022

I would appreciate advice on how to write an automated test for this. I think the ideal test would be:

  • Have the informer return a Service object
  • Modify the object in a reconciler, but don't push the modification to apiserver and return an error
  • sync the Service a second time, retrieving the object from the informer
  • assert that the object is not dirty

It might be possible to construct a unit test to do this but it would be quite complicated and probably not very easy to understand or maintain. It feels more like envtest territory?

Alternatively, we could argue that this would really be a test of the behaviour of informers, not the service controller, and doesn't require a separate test?

@mdbooth
Copy link
Contributor Author

mdbooth commented Apr 22, 2022

@lobziik
Copy link
Member

lobziik commented Apr 22, 2022

lgtm :)

@mdbooth
Copy link
Contributor Author

mdbooth commented Apr 26, 2022

@andrewsykim the impact of this that I've noticed is fairly minor and by fixing the underlying permission issue we're unlikely to see it often, although the problem still exists if there's a transient failure to patch the Service object.

I haven't been following the 1.24 release schedule, so I don't know how frozen we are right now. Is it worth trying to merge this, or wait until after 1.24?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 25, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 25, 2022

@andrewsykim This is still a thing, btw. Any idea who can approve?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2022
@k8s-ci-robot k8s-ci-robot merged commit 914406d into kubernetes:master Jul 28, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service Controller can call provider EnsureLoadBalancer with a dirty Service object
5 participants