Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump default burst limit for discovery client to 300 #109141

Merged
merged 2 commits into from Jul 26, 2022

Conversation

ulucinar
Copy link
Contributor

@ulucinar ulucinar commented Mar 30, 2022

What type of PR is this?

What this PR does / why we need it:

/kind cleanup
/sig cli

The burst limit of the token bucket rate limiter of the discovery client for kubectl has been previously bumped to 300 here. Similar to that, this PR proposes to bump the burst limit of the default discovery client to 300. With a large (over 100) number of GroupVersions in the cluster, we are now in a better situation with kubectl as it now uses tbrl(b=300, r=50.0 qps) but other clients, such as Helm still experience throttling during the discovery phase.

As mentioned in this comment, it's not a good strategy to bump tbrl parameters as CRD use cases evolve but this PR proposes the limit currently in use by kubectl as the default, and in my opinion, this will allow a consistent experience among API server clients like kubectl, Helm and others. Currently in cases where kubectl is not throttled, other clients (which are using the default discovery client tbrl parameters) are being throttled.

Having proposed a bump for the default burst limit, I'd like to also ask why we do not remove the rate limiter from the discovery client as we have APF now.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Default burst limit for the discovery client is now 300.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. sig/cli Categorizes an issue or PR as relevant to SIG CLI. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 30, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @ulucinar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 30, 2022
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 30, 2022
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
@fedebongio
Copy link
Contributor

/assign @Jefftree
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2022
@Jefftree
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 31, 2022
@muvaf
Copy link

muvaf commented Jun 2, 2022

@deads2k @caesarxuchao Hello! Is there a blocker to get this merged?

@Jefftree
Copy link
Member

Jefftree commented Jun 3, 2022

We do have a fix for the discovery burst coming this release so this may not be necessary. There might be some opinions around this PR for increasing the burst though.

/assign @lavalamp

@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2022

I'd like to also ask why we do not remove the rate limiter from the discovery client as we have APF now.

Yeah this is a great question, what do you think @deads2k, APF has been beta and default-on for a while now. It's missing features for watch still but discovery doesn't do watches.

I think for most clusters burst of 300 and disabling client side rate limits completely is not that different. And for the rest of clusters, people won't want it to break at 300 group versions. So I think this is a good first place to just disable the rate limit completely.

@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2022

@wojtek-t may also have an opinion about disabling this rate limit.

@deads2k
Copy link
Contributor

deads2k commented Jun 3, 2022

I think for most clusters burst of 300 and disabling client side rate limits completely is not that different.

I think a high burst, even with a fairly high recharge rate, is different than no limit at all. I would rather go to 300 than remove it entirely at this stage.

APF has been beta and default-on for a while now. It's missing features for watch still but discovery doesn't do watches.

We made this a GA criteria, but we haven't tested that way yet. I agree that discovery is a good place to start, but I'd rather have an organized, "test this with p&f" before removing the limit.

if config.Burst == 0 && config.QPS < 100 {
// if a burst limit is not already configured and
// an avg. rate of `defaultBurst` qps is not already configured
if config.Burst == 0 && config.QPS < defaultBurst {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition here is a bit weird, since if burst is already set to something like 2, it won't get fixed?

And if burst is set to zero, likely qps will also be zero, and then we don't modify qps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @liggitt added this condition in 711dc0b. Perhaps he can provide some context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmm... trying to page that back in...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent was two-fold:

  1. this should be a default, so if Burst is explicitly set we don't want to override it
  2. I vaguely remember the burst bucket defaulting to match QPS at some layer (at least at some point in time), so the config.QPS < 100 guard was to prevent us from defaulting to a burst value lower than what would have been auto-selected. I'm not sure if this guard is still necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Kept the config.Burst == 0 part of the conjunction and removed the config.QPS < defaultBurst condition.

@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2022

I agree that discovery is a good place to start, but I'd rather have an organized, "test this with p&f" before removing the limit.

It's pretty early in the release cycle now...

@wojtek-t
Copy link
Member

wojtek-t commented Jun 6, 2022

We made this a GA criteria, but we haven't tested that way yet. I agree that discovery is a good place to start, but I'd rather have an organized, "test this with p&f" before removing the limit.

+1

It's pretty early in the release cycle now...

We're planning to have something done in this area this cycle (#109614 would the first thing to scale-test), but without knowing where exactly we're not, I'm against disabling it completely now.

@apelisse
Copy link
Member

I don't understand the goal of rate limiting the discovery. Are people mis-using the discovery? are they mis-using on purpose? What problem does this address?

@wojtek-t
Copy link
Member

I don't understand the goal of rate limiting the discovery. Are people mis-using the discovery? are they mis-using on purpose? What problem does this address?

I've seen many bugs where some people misconfigured their components/scripts etc. and were overloading control-plane.
Discovery API is not different wrt it.

@apelisse
Copy link
Member

apelisse commented Jul 25, 2022

Discovery API is not different wrt it.

They are a little bit, they are often memory cached, which means they will re-trigger every time you restart your binary (including kubectl) and the rate-limiter also doesn't protect you from that. They are much more static than most other APIs in kubernetes, so people are less likely to poll from it.

Would it make sense to just disable the rate-limiter for discovery in kubectl? Would that help?

@negz
Copy link
Contributor

negz commented Jul 25, 2022

I've seen many bugs where some people misconfigured their components/scripts etc. and were overloading control-plane.

Shouldn't server-side API priority and fairness address this now?

Would it make sense to just disable the rate-limiter for discovery in kubectl? Would that help?

We've already bumped it to 300 in kubectl. I'm open to disabling it there, but we'd also like to at least see the discovery rate-limit bumped (or removed) in client-go. There are plenty of other Go clients (Helm comes to mind) that hit these discovery rate limits and provide a bad user experience; if it's safe to update the default I'd prefer to do that here in client-go rather than chase down and ask every client to override the defaults.

@apelisse
Copy link
Member

I don't disagree that we don't want to fix it in all clients. I'm still not sure I really understand what we get with the limiter.

@lavalamp
Copy link
Member

The comments here are rehashing things addressed in previous comments: #109141 (comment)

@lavalamp
Copy link
Member

I also think we should remove this limit completely, but this is the improvement we can all agree on for the moment.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, ulucinar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2022
@k8s-ci-robot k8s-ci-robot merged commit 5ac563c into kubernetes:master Jul 26, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 26, 2022
uvegla added a commit to giantswarm/fluxcd-pkg that referenced this pull request Feb 9, 2023
uvegla added a commit to giantswarm/fluxcd-pkg that referenced this pull request Feb 9, 2023
uvegla added a commit to giantswarm/fluxcd-pkg that referenced this pull request Feb 9, 2023
See: kubernetes/kubernetes#109141

Signed-off-by: Laszlo Uveges <laszlo@giantswarm.io>
uvegla added a commit to giantswarm/fluxcd-pkg that referenced this pull request Feb 14, 2023
See: kubernetes/kubernetes#109141

Signed-off-by: Laszlo Uveges <laszlo@giantswarm.io>
ytsarev added a commit to ytsarev/provider-helm that referenced this pull request Feb 25, 2023
* Fixes crossplane-contrib#159

* No more client-side throttling timeouts like

```
I1128 12:44:10.336621       1 request.go:665] Waited for 11.188964474s due to client-side throttling, not priority and fairness, request: GET:https://10.255.0.1:443/apis/cloudidentity.cnrm.cloud.google.com/v1beta1?timeout=32s
I1128 12:44:20.345096       1 request.go:665] Waited for 5.989111872s due to client-side throttling, not priority and fairness, request: GET:https://10.255.0.1:443/apis/binaryauthorization.cnrm.cloud.google.com/v1beta1?timeout=32s
I1128 12:44:31.086707       1 request.go:665] Waited for 1.174828735s due to client-side throttling, not priority and fairness, request: GET:https://10.255.0.1:443/apis/servicedirectory.cnrm.cloud.google.com/v1beta1?timeout=32s
```

* 300 value is aligned with associated kubectl fix at kubernetes/kubernetes#109141

Signed-off-by: Yury Tsarev <yury@upbound.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet