Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify timeout for etcd healthcheck #111399

Merged
merged 1 commit into from Jul 27, 2022
Merged

Conversation

Argh4k
Copy link
Contributor

@Argh4k Argh4k commented Jul 25, 2022

Increase default timeout for etcd healthcheck to 15 seconds.
Add additional etcd check to readyz with 2 seconds timeout.

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR increases default timeout for etcd healthcheck to 15 seconds and adds new etcd check to readiness check with timeout of 2 seconds. Currently, when the control plane is overloaded, healthchecks to etcd can take more than 2 seconds marking kube apsierver unhealthy, even if it is only etcd performance degradation. Adding 2 seconds check to readyz should help with load distribution between apiservers in case of etcd performance degradation.

Which issue(s) this PR fixes:

Fixes #111290

Special notes for your reviewer:

Does this PR introduce a user-facing change?

a new flag `etcd-ready-timeout` has been added. It configures a timeout of an additional etcd check performed as part of readyz check. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 25, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @Argh4k. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 25, 2022
@Argh4k
Copy link
Contributor Author

Argh4k commented Jul 25, 2022

/assign @mborsz @wojtek-t

@MadhavJivrajani
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 26, 2022
@wojtek-t
Copy link
Member

I'm fine with this change modulo the comments that I added.

But I would like to give a bit of time to sig-apimachinery folks to take a quick look at it.

/sig api-machinery

if err != nil {
return err
}
c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a method for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only see AddHealthChecks which adds healthcheck to readyz/livez/healthz. It even has We should prefer this to adding healthChecks directly to the config unless we explicitly want to add a healthcheck only to a specific health endpoint. in its description. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63

But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.

That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.

Copy link
Contributor

@MadhavJivrajani MadhavJivrajani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts
Copy link
Contributor

sttts commented Jul 26, 2022

I can see why we want the readyz flag. I don't see why changing the healthz timeout from 2s to 15s and potentially break users who depend on the old behaviour.

Also this change is not mentioned in the "Does this PR introduce a user-facing change?" section while it is userfacing.

@@ -234,6 +236,14 @@ func (s *EtcdOptions) addEtcdHealthEndpoint(c *server.Config) error {
return healthCheck()
}))

readyCheck, err := storagefactory.CreateReadyCheck(s.StorageConfig, c.DrainedNotify())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't use a different channel here, although given it's actually used as a stopCh it actually makes sense.

@@ -35,7 +35,8 @@ const (

DefaultCompactInterval = 5 * time.Minute
DefaultDBMetricPollInterval = 30 * time.Second
DefaultHealthcheckTimeout = 2 * time.Second
DefaultHealthcheckTimeout = 15 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation and I'm supportive for the motivation as described in:
#111290

But technically that's a breaking change - for people who are not setting the timeout now, they will face a default behavior change.

For me it seems much safer to leave it set to 2s and just rely that people who would like to bump it will use the already existing flag. @deads2k - FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving etcd as 2s and using existing flag to tune this.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 26, 2022
@Argh4k
Copy link
Contributor Author

Argh4k commented Jul 26, 2022

I can see why we want the readyz flag. I don't see why changing the healthz timeout from 2s to 15s and potentially break users who depend on the old behaviour.

Also this change is not mentioned in the "Does this PR introduce a user-facing change?" section while it is userfacing.

Changed defaults back to original values. You are right that changing them, could break it for some users.

@leilajal
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2022
if err != nil {
return err
}
c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63

But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.

That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.

@Argh4k Argh4k force-pushed the i-111290 branch 3 times, most recently from 969ef19 to 9776963 Compare July 27, 2022 07:44
@k8s-ci-robot k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Jul 27, 2022
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall (modulo my one comment), but test failures seems related to this PR.

@@ -110,6 +110,7 @@ require (
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect
golang.org/x/text v0.3.7 // indirect
golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? You're not changing imports...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run /hack/update-vendor.sh because pull-kubernetes-dependecies was failing with:

Your vendored results are different:
diff -Naupr -x BUILD -x 'AUTHORS*' -x 'CONTRIBUTORS*' vendor/k8s.io/apiserver/go.mod /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod
--- vendor/k8s.io/apiserver/go.mod	2022-07-27 07:20:52.922456943 +0000
+++ /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod	2022-07-27 07:23:22.144718498 +0000
@@ -110,6 +110,7 @@ require (
 	golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect
 	golang.org/x/text v0.3.7 // indirect
 	golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect
+	golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
 	google.golang.org/appengine v1.6.7 // indirect
 	google.golang.org/genproto v0.0.0-20220502173005-c8bf987b8c21 // indirect
 	google.golang.org/protobuf v1.28.0 // indirect
Vendor Verify failed.
If you're seeing this locally, run the below command to fix your directories:
hack/update-vendor.sh

I'm not quite sure why it is needed. I've changed formatting from %v to %w for error and used cmpopts.EquateErrors() but I do not understand why verify-vendor says that this should be included as indirect dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe revert that change then? I would like to avoid combining those two...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed cmpopts, it looks like they were causing this behaviour

@wojtek-t
Copy link
Member

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Argh4k, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit 610b783 into kubernetes:master Jul 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/dependency Issues or PRs related to dependency changes cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configuration of etcd healthcheck in kube-apiserver
7 participants