Conversation
There was a problem hiding this comment.
Pull request overview
Adds an Admin API endpoint to retrieve Azure VM serial console logs for HCP clusters, enabling SRE debugging via boot diagnostics/serial console output.
Changes:
- Introduces
GET /admin/v1/hcp{resourceId}/serialconsole?vmName=...handler that calls Azure Compute boot diagnostics APIs and streams the serial console log blob back to the caller. - Adds vmName validation helper + unit tests, and refactors shared Cluster Service error handling for correct 404 behavior.
- Adds E2E coverage and supporting test framework helpers for discovering a VM and toggling boot diagnostics.
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
admin/server/server/admin.go |
Registers the new /serialconsole HCP admin route. |
admin/server/handlers/hcp/serialconsole.go |
Implements the serial console retrieval handler (boot diagnostics lookup + blob download/stream). |
admin/server/handlers/hcp/serialconsole_test.go |
Adds handler-level unit tests for validation and dependency error paths. |
admin/server/handlers/hcp/helpers.go |
Adds reusable ClusterServiceError helper to correctly translate CS 404s. |
admin/server/handlers/hcp/helpers_test.go |
Unit tests for ClusterServiceError. |
admin/server/handlers/hcp/breakglass/create.go |
Switches to shared ClusterServiceError helper (removes local duplicate). |
internal/validation/validators.go |
Adds Azure VM name regex + exported validation helper. |
internal/validation/validate_vmname.go |
Adds tests for VM name validation (currently added as a non-*_test.go file). |
test/util/framework/admin_api_helpers.go |
Adds test helpers for calling the new endpoint and manipulating/listing VMs via ARM compute. |
test/e2e/admin_api.go |
Adds E2E tests for serial console happy path and “boot diagnostics disabled” conflict case. |
test/testdata/zz_fixture_TestMainListSuitesForEachSuite_rp_api_compat_all_parallel_01rp_api_compat_all_parallel_development.txt |
Updates suite listing fixture to include the new SRE serial console test. |
test/testdata/zz_fixture_TestMainListSuitesForEachSuite_dev_cd_check_paralleldev_cd_check_parallel.txt |
Updates suite listing fixture to include the new SRE serial console test. |
admin/server/go.mod |
Adds Azure compute SDK dependency (and adjusts mock dependency classification). |
admin/server/go.sum |
Updates sums for new/updated dependencies. |
test-integration/go.mod |
Adds armcompute/v5 as an indirect dependency (appears unused within test-integration/). |
test-integration/go.sum |
Updates sums for new/updated dependencies. |
admin/README.md |
Documents the new admin endpoint in the API table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a0834b8 to
bbe2bd9
Compare
bbe2bd9 to
64f0ced
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
64f0ced to
207848e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/test e2e-parallel |
|
|
||
| // download blob content with timeout to avoid stuck handlers on slow blob endpoints | ||
| httpClient := &http.Client{ | ||
| Timeout: 30 * time.Second, |
There was a problem hiding this comment.
Why use a client-side timeout instead of plumbing contexts?
There was a problem hiding this comment.
Removed the client-side timeout
There was a problem hiding this comment.
@bennerv should we still do a context.WithTimeout to be explicit about how long we want to wait here? otherwise we just get we the request context defines.
There was a problem hiding this comment.
Haven't added the context.WithTimeout yet, can wait for the decision. Or, handle it in the middleware at some point if we want to have a default request timeout?
There was a problem hiding this comment.
My vote:
- We 100% should add middleware with some kind of timeout (5 minutes?) as a stop-gap
- As needed for optimization, individual handlers can/should restrict that timeout even further to exit earlier so SREs aren't kept waiting on reasonably-known failure states for a particular task.
77f6ef2 to
c06f4c5
Compare
c06f4c5 to
2302601
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
2302601 to
47d7e02
Compare
47d7e02 to
87f6c6a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ccadc3f to
7d8cf3a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7d8cf3a to
ff7df37
Compare
ff7df37 to
c715df0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // specific CloudError. This prevents ReportError from misinterpreting it as | ||
| // "HCP resource not found" (the HCP was already found in the database). | ||
| // Non-OCM errors are wrapped for ReportError to handle as internal errors. | ||
| func ClusterServiceError(err error, what string) error { |
There was a problem hiding this comment.
Initially in the PR this was being used by the serialconsolelogs implementation, but since it no longer uses it, this can be moved back to breakglass. But decided to keep it here for any similar future use.
Please comment if anyone feel against moving the function here in this PR.
SudoBrendan
left a comment
There was a problem hiding this comment.
I think that overall, this /lgtm - if you want to address any of these (or do a follow-up) that's just as well. Thanks Raj!
| http.StatusConflict, | ||
| arm.CloudErrorCodeConflict, | ||
| "", | ||
| "Diagnostics might be disabled for VM %s", vmName, |
There was a problem hiding this comment.
nit: as an SRE working a 2am incident - if I saw this error message raw I might be confused and need to review the code to understand the error. Adding a bit of clarity to what this error actually means (and how to resolve it?) for on-call could be helpful if that's even possible. I think this is especially true considering we never expect diagnostics to be off - the oddest cases likely deserve the clearest messaging.
| limitedReader := io.LimitReader(blobResp.Body, maxBytes) | ||
| _, err = io.Copy(writer, limitedReader) |
There was a problem hiding this comment.
nit: should we write a warning or something if data is truncated at 100MB? Note - this truncates to the beginning rather than the end of the content. Should we instead return the end of the log?
There was a problem hiding this comment.
Good suggestion, I will address in the next PR where I can just handle the streaming of the logs.
| writer.Header().Set("Cache-Control", "no-cache, no-store, must-revalidate") | ||
| writer.Header().Set("Pragma", "no-cache") | ||
| writer.Header().Set("Expires", "0") | ||
| writer.WriteHeader(http.StatusOK) |
There was a problem hiding this comment.
bug: once you write a header - it can NEVER be overwritten. Your errors at L183 are silently ignored and go to logs - but SREs are getting 200/OK (and potentially getting partial data) by putting this here.
I think the best way for us to handle this is to buffer the full content and only return once we have the data in-hand. if we're capping at 100MB I think this is fine?
There was a problem hiding this comment.
Yes, that's true and that's the idea when streaming the logs.
There was a problem hiding this comment.
I guess I meant - streaming can fail, and there's not a great way to handle those failures and notify on-call of the failures. Considering we're only dealing with 100MB, it might be best to eliminate streaming all-together, and only return once we've got the full data to return safely with 200. That has implications on memory for admin-api for sure - but it seems reasonable - and I don't think we expect to call this endpoint often (e.g. I don't see us invoking this on every ICM; it's just a critical gap we need to fill in rare situations).
ultimately it's not a huge deal either way - the scenarios are slim - but considering that (silent) failures here could lead to incorrect assumptions in triage of incidents, it seems important to be more correct than resource-aware?
| Expect(err.Error()).To(ContainSubstring("400")) | ||
| }) | ||
|
|
||
| It("should return 409 when boot diagnostics is disabled on a VM", |
There was a problem hiding this comment.
could we save a cluster create and NP create cycle by adding the "fetch VM -> disable -> 409" to the previous suite? I guess that conflates positive/negative ... but this feels heavy for a single HTTP test. Could it be an integration test?
There was a problem hiding this comment.
I believe I thought of doing it in a single suite, but separated it as the disabling one requires an external action on the VM to disable the boot diagnostics which might not work in the envs where there's deny assignment in place. But the previous suite would work in any environment.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rajdeepc2792, SudoBrendan The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
https://issues.redhat.com/browse/ARO-17482
What
Introduces Admin-API Serial Console endpoint that provides SREs with access to VM serial console logs for debugging boot issues, kernel panics, and other low-level problems in HCP cluster's Nodepool VMs.
Why
Special notes for your reviewer
Testing