Avoid re-syncing LBs for ETP=local services #109706

alexanderConstantinescu · 2022-04-28T08:34:16Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR reduces the amount of LB re-configurations for ETP=local services by removing the sync completely for any node readiness changes, from the CCM's sync loop. The referenced issue explains the problem in greater detail, so readers are henced directed to that for what concerns the genesis of the problem. A lot has been discussed in this PR and a plan forward has been drafted. The following summarizes the discussion:

Remove syncs w.r.t node state changes in the CCM in 1.25 (follow up PR, once this merges, is to be expected for remaining node conditions)
Maybe / possibly / hopefully backport that to the oldest supported Kube version also supporting ETP=local services
Have the reported status of the LB HC configured for ETP=local services reflect the node state in 1.25 (also in the form of a follow up PR)
Write KEP for dealing with transitioning node state for all service LBs.
Kube >= after KEP : remove the sync logic for node state changes for all services and move them to the HC

Which issue(s) this PR fixes:

Partially fixes #111539 for ETP=local services, specifically: reduces the amount of cloud API calls and service downtime caused by excessive re-configurations of all cluster LBs as an effect of transitioning node readiness state.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Reduce the number of cloud API calls and service downtime caused by excessive re-configurations of cluster LBs with externalTrafficPolicy=Local when node readiness changes (https://github.com/kubernetes/kubernetes/issues/111539). The service controller (in cloud-controller-manager) will avoid resyncing nodes which are transitioning between `Ready` / `NotReady` (only for for ETP=Local Services). The LBs used for these services will solely rely on the health check probe defined by the `healthCheckNodePort` to determine if a particular node is to be used for traffic load balancing.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig network
/sig cloud-provider

k8s-ci-robot · 2022-04-28T08:34:17Z

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

k8s-ci-robot · 2022-04-28T08:34:24Z

Hi @alexanderConstantinescu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

leilajal · 2022-04-28T16:37:53Z

/remove-sig api-machinery
/triage accepted

bowei · 2022-04-28T21:14:30Z

cc: @bowei

bowei · 2022-04-28T21:24:49Z

/assign @bowei -- I'm very interested in poking more at the lifecycle of the LB node

alexanderConstantinescu · 2022-04-28T23:14:33Z

Some additional notes here below from the recent discussion in sig-net that just ended. Also, you'll be able to find the slides presented by me, here: https://docs.google.com/presentation/d/1EsK5FLS4s6E3Ok42XhRz643YtB9uZHn3wohWvcMyHag/edit#slide=id.p

My third slide present some ideas around the possibility of removing everything the CCM does w.r.t synchronizing nodes transitioning between NotReady <-> Ready.

Is this PR - which eliminates the mentioned sync for ETP=local services, from the CCM's node sync loop
Is removing that sync completely for all services irrespective of ETP=local or not.
a. Is removing that sync completely only for a selected set of services, indicated by the user through the utilization of an annotation on the service object
Is removing that sync completely for ETP=local services from the CCM, but replacing it by having kube-proxy dial the kubelet's read-only port when answering to the probe executed by the LB against the HealthCheckNodePort. I.e: kube-proxy would keep the current behavior (validating that the node in question is actually hosting endpoints for the service) and AND-ing that with the kubelet's reported state. That means that the LB would continue executing one probe against the HealthCheckNodePort, but kube-proxy would open a local connection against kubelet's read-only port on the node for every LB probe. There was some confusion surrounding this during the meeting, hence why I've chosen to detail this here. This would result in the same behavior as today, but be much more dynamic and efficient at handling transitioning node state - where instead of re-configuring the LB with a new set of nodes, the LB simply probes the node on the HealthCheckNodePort as it currently does, with the result also reflecting the kubelet/node readiness state.
Is removing that sync completely for all services from the CCM. Instead we define a standardised behavior for health checking nodes which are listed as backends for LoadBalancer type services, that all cloud provider implementations must adhere to. All cloud providers today implement this interface: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/cloud.go#L133-L162 , however: when the service is ETP != local they are free to specify custom probes on their LBs. What they do today was described at the bottom of the description to this PR. If we align the behavior for all cloud implementations: we could have them include a probe against the kubelet's read-only port - or do as in the ETP=local case and have the LB query kube-proxy which does multiple checks. In the end, LB's only really care to know if the node can service traffic and if the service NodePort on the node is reachable. All three clouds have chosen one of these two checks, not both. GCE's LB check is actually superfluous, since the CCM also checks the same condition (i.e: the kubelet state/node readiness).

3 and 4. will require enhancement proposal and are beyond the scope of this PR, but we can discuss them here (or I can open a KEP for them and we can discuss them there...?)

This PR only concerns itself with finding a solution to either 1. and/or 2., simply because this currently is a bug for clusters at scale and, I believe, we should backport the fix to older releases.

I am actually in favour of doing 2.a., meaning: modifying this PR slightly and removing any relation to ETP=local services and checking where endpoints are hosted. I would instead prefer we define an annotation (ex: service.beta.kubernetes.io/lb-ignore-node-not-ready) which a user can set on the service object to tell the CCM to ignore this service for node readiness changes. The implementation effort would be minimal and would allow the user to solve both problems presented in my first slides. I, as a user of Kubernetes, should be allowed to disable the CCM's sync for node readiness state if I don't want my LBs to be re-configured as a cause by that. I own my nodes and LBs and should thus be allowed to decide what's best.

If we can't agree on 2.a, then 1. (the current state of this PR) will at least help reduce the time taken to sync all LBs by the CCM on large clusters with a lot of services ETP=local, which is already an improvement from today.

alexanderConstantinescu · 2022-05-04T15:11:03Z

/remove-sig api-machinery

danwinship · 2022-05-31T12:55:50Z

/ok-to-test

thockin · 2022-08-03T23:51:55Z

@kubernetes/sig-scalability
@wojtek-t
@mborsz

FYI re scale - this should be an improvement for real clusters (minor because etp=Local, but eventually for all services)

alexanderConstantinescu · 2022-08-04T11:19:57Z

@thockin : I am concerned the first commit might be wrong, specifically around removing nodes from the LB set based on the schedulability predicate.

It seems it was removed with #90823 which listed a production outage caused by the fact that we were syncing LBs as a cause by the schedulability constraint changing on a set of nodes.

This caused a production outage when nodes in an autoscaling group were marked unschedulable so the autoscaler would gracefully terminate them - instead the ingress controllers on those nodes were immediately removed from the LB pool and no new ingress requests could be accepted.

That makes me wonder why we even decided to sync updates to nodes which are experiencing a change to this attribute at all. That was added with PR: #90769 - which doesn't list any good reasons for adding it...

So:

Service load balancers should include unschedulable nodes #90823 tells me we should not be excluding nodes from the LB if they are not schedulable
service controller: only sync LB node pools when relevant fields change #90769 tells me we should (possibly) not even care about the schedulability at all...

Now we did the inverse it seems. Sorry for realizing this now....

Specifically: - #109706 (comment) - #109706 (comment)

…-taint-unsched [CCM - service controller] addressing left over comments from #109706

Specifically: - kubernetes/kubernetes#109706 (comment) - kubernetes/kubernetes#109706 (comment) Kubernetes-commit: 9ca3d0ec157b57df4a9bf406cba39317a0aecdd2

aojea · 2023-03-22T22:03:52Z

staging/src/k8s.io/cloud-provider/controllers/service/controller.go

 	}
-
-	if err := c.lockedUpdateLoadBalancerHosts(svc, hosts); err != nil {


this may have been locked some time ago, but I can 't see where is locked now

Specifically: - kubernetes#109706 (comment) - kubernetes#109706 (comment)

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/cloudprovider labels Apr 28, 2022

k8s-ci-robot requested review from derekwaynecarr and neolit123 April 28, 2022 08:34

k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2022

thockin self-assigned this Apr 28, 2022

alexanderConstantinescu force-pushed the etp-local-svc branch from a3a71c3 to 8e4eb5d Compare May 4, 2022 15:08

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022

alexanderConstantinescu mentioned this pull request Aug 4, 2022

[scale] node state changes causes excessive LB re-configurations #111539

Open

alexanderConstantinescu mentioned this pull request Aug 4, 2022

Cherry pick of #109706: Avoid re-syncing LBs for ETP=local services #111686

Closed

alexanderConstantinescu mentioned this pull request Aug 4, 2022

[CCM - service controller] Remove schedulability predicate for LB set #111691

Merged

alexanderConstantinescu mentioned this pull request Aug 7, 2022

Remove transient node predicates from KCCM's service controller kubernetes/enhancements#3458

Open

8 tasks

alexanderConstantinescu mentioned this pull request Sep 28, 2022

KEP-3458: Remove transient node predicates from KCCM's service controller kubernetes/enhancements#3460

Merged

swetharepakula mentioned this pull request Sep 29, 2022

Starting in 1.25 clusters, services of type=LB and xTP=Local sometimes does not update node backends on load balancers #112793

Closed

This was referenced Sep 30, 2022

Automated cherry pick of #111663: Cleanup service sync path #112798

Closed

Revert 109706 + 111691 #112807

Merged

k8s-ci-robot pushed a commit that referenced this pull request Mar 9, 2023

Address follow up comments from #109706

9ca3d0e

Specifically: - #109706 (comment) - #109706 (comment)

k8s-ci-robot added a commit that referenced this pull request Mar 9, 2023

Merge pull request #111658 from alexanderConstantinescu/etp-local-svc…

87a40ae

…-taint-unsched [CCM - service controller] addressing left over comments from #109706

aojea reviewed Mar 22, 2023

View reviewed changes

kmala mentioned this pull request Apr 14, 2023

cloud provider updating load balancer before registering the node #117375

Closed

alexanderConstantinescu mentioned this pull request Apr 15, 2023

[KCCM] service controller: add providerID Node predicate #117388

Merged

JoelSpeed mentioned this pull request Jul 5, 2023

Load Balancer Node membership not being updated for externalTrafficPolicy Local kubernetes-sigs/cloud-provider-azure#4230

Closed

This was referenced Jul 5, 2023

Issue 4230: remove readiness check for cache exclusion kubernetes-sigs/cloud-provider-azure#4234

Merged

[Azure] Issue 4230: remove readiness check for cache exclusion #119128

Merged

rayowang pushed a commit to rayowang/kubernetes that referenced this pull request Feb 9, 2024

Address follow up comments from kubernetes#109706

513f689

Specifically: - kubernetes#109706 (comment) - kubernetes#109706 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-syncing LBs for ETP=local services #109706

Avoid re-syncing LBs for ETP=local services #109706

alexanderConstantinescu commented Apr 28, 2022 •

edited by thockin

k8s-ci-robot commented Apr 28, 2022

k8s-ci-robot commented Apr 28, 2022

leilajal commented Apr 28, 2022

bowei commented Apr 28, 2022

bowei commented Apr 28, 2022

alexanderConstantinescu commented Apr 28, 2022 •

edited

alexanderConstantinescu commented May 4, 2022

danwinship commented May 31, 2022

thockin commented Aug 3, 2022

alexanderConstantinescu commented Aug 4, 2022 •

edited

aojea Mar 22, 2023

		}

		if err := c.lockedUpdateLoadBalancerHosts(svc, hosts); err != nil {

Avoid re-syncing LBs for ETP=local services #109706

Avoid re-syncing LBs for ETP=local services #109706

Conversation

alexanderConstantinescu commented Apr 28, 2022 • edited by thockin

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 28, 2022

k8s-ci-robot commented Apr 28, 2022

leilajal commented Apr 28, 2022

bowei commented Apr 28, 2022

bowei commented Apr 28, 2022

alexanderConstantinescu commented Apr 28, 2022 • edited

alexanderConstantinescu commented May 4, 2022

danwinship commented May 31, 2022

thockin commented Aug 3, 2022

alexanderConstantinescu commented Aug 4, 2022 • edited

aojea Mar 22, 2023

Choose a reason for hiding this comment

alexanderConstantinescu commented Apr 28, 2022 •

edited by thockin

alexanderConstantinescu commented Apr 28, 2022 •

edited

alexanderConstantinescu commented Aug 4, 2022 •

edited