New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid re-syncing LBs for ETP=local services #109706
Avoid re-syncing LBs for ETP=local services #109706
Conversation
Please note that we're already in Test Freeze for the |
Hi @alexanderConstantinescu. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-sig api-machinery |
cc: @bowei |
/assign @bowei -- I'm very interested in poking more at the lifecycle of the LB node |
Some additional notes here below from the recent discussion in sig-net that just ended. Also, you'll be able to find the slides presented by me, here: https://docs.google.com/presentation/d/1EsK5FLS4s6E3Ok42XhRz643YtB9uZHn3wohWvcMyHag/edit#slide=id.p My third slide present some ideas around the possibility of removing everything the CCM does w.r.t synchronizing nodes transitioning between
3 and 4. will require enhancement proposal and are beyond the scope of this PR, but we can discuss them here (or I can open a KEP for them and we can discuss them there...?) This PR only concerns itself with finding a solution to either 1. and/or 2., simply because this currently is a bug for clusters at scale and, I believe, we should backport the fix to older releases. I am actually in favour of doing 2.a., meaning: modifying this PR slightly and removing any relation to ETP=local services and checking where endpoints are hosted. I would instead prefer we define an annotation (ex: If we can't agree on 2.a, then 1. (the current state of this PR) will at least help reduce the time taken to sync all LBs by the CCM on large clusters with a lot of services ETP=local, which is already an improvement from today. |
a3a71c3
to
8e4eb5d
Compare
/remove-sig api-machinery |
/ok-to-test |
@thockin : I am concerned the first commit might be wrong, specifically around removing nodes from the LB set based on the schedulability predicate. It seems it was removed with #90823 which listed a production outage caused by the fact that we were syncing LBs as a cause by the schedulability constraint changing on a set of nodes.
That makes me wonder why we even decided to sync updates to nodes which are experiencing a change to this attribute at all. That was added with PR: #90769 - which doesn't list any good reasons for adding it... So:
Now we did the inverse it seems. Sorry for realizing this now.... |
Specifically: - #109706 (comment) - #109706 (comment)
…-taint-unsched [CCM - service controller] addressing left over comments from #109706
Specifically: - kubernetes/kubernetes#109706 (comment) - kubernetes/kubernetes#109706 (comment) Kubernetes-commit: 9ca3d0ec157b57df4a9bf406cba39317a0aecdd2
} | ||
|
||
if err := c.lockedUpdateLoadBalancerHosts(svc, hosts); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may have been locked some time ago, but I can 't see where is locked now
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR reduces the amount of LB re-configurations for ETP=local services by removing the sync completely for any node readiness changes, from the CCM's sync loop. The referenced issue explains the problem in greater detail, so readers are henced directed to that for what concerns the genesis of the problem. A lot has been discussed in this PR and a plan forward has been drafted. The following summarizes the discussion:
Which issue(s) this PR fixes:
Partially fixes #111539 for ETP=local services, specifically: reduces the amount of cloud API calls and service downtime caused by excessive re-configurations of all cluster LBs as an effect of transitioning node readiness state.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig network
/sig cloud-provider