Skip to content

[WIP] [reconfigurator] start removing expunged zones#5211

Closed
sunshowers wants to merge 6 commits into
sunshowers/spr/main.wip-reconfigurator-start-removing-expunged-zonesfrom
sunshowers/spr/wip-reconfigurator-start-removing-expunged-zones
Closed

[WIP] [reconfigurator] start removing expunged zones#5211
sunshowers wants to merge 6 commits into
sunshowers/spr/main.wip-reconfigurator-start-removing-expunged-zonesfrom
sunshowers/spr/wip-reconfigurator-start-removing-expunged-zones

Conversation

@sunshowers

@sunshowers sunshowers commented Mar 7, 2024

Copy link
Copy Markdown
Contributor

Putting this up for early review. This implements internal bookkeeping for the blueprint builder, as well as database storage for expunged Nexus zones.

This does not implement any of the other things we discussed in Tuesday's meeting -- let's discuss this tomorrow.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
sled_ip_allocators: BTreeMap::new(),
zones: BlueprintZones::new(parent_blueprint),
zones_in_service: parent_blueprint.zones_in_service.clone(),
expunged_nexus_zones: parent_blueprint.expunged_nexus_zones.clone(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to discuss this during the meeting later today, but just jotting down a couple of thoughts here:

  • Once a zone gets added to expunged_nexus_zones, how does the planner know when it's okay to remove it from a subsequent child blueprint?
  • Does this need to be Nexus zones specifically? Or could it be zones_expunged like zones_in_service? (I haven't thought about this much, but am assuming we're eventually going to want to track other kinds of zones that need execution work to be expunged.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to discuss this during the meeting later today, but just jotting down a couple of thoughts here:

  • Once a zone gets added to expunged_nexus_zones, how does the planner know when it's okay to remove it from a subsequent child blueprint?

I was wondering the same thing. Should there also be a 'decommissioned' column for when the zone has been completely removed?

  • Does this need to be Nexus zones specifically? Or could it be zones_expunged like zones_in_service? (I haven't thought about this much, but am assuming we're eventually going to want to track other kinds of zones that need execution work to be expunged.)

I was also thinking the same thing. It seems somewhat silly to have a separate table per zone type unless each of those tables would have different rows because they store other "checkpoint" information while decommissioning a zone. I don't think we were planning to do this inside the zone table, but I'm not sure.

@sunshowers sunshowers Apr 2, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter we've taken into account by switching to a disposition enum, which we'll store inside bp_omicron_zone.

.sleds
.keys()
.map(|sled_id| {
// XXX: do we need to handle, or check for, initially expunged

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say no. The only thing we could do here is mark zones expunged if we find them on sleds that are expunged. But that's more of the planner's job. The idea behind the initial blueprint is that if you executed it, it wouldn't do anything. (That's why this is in BlueprintBuilder, which is not supposed to be that smart, and not Planner.)

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. (I know this is still a WIP.)

Comment on lines +419 to +444
let sled_info = self.sled_resources(sled_id)?;

match sled_info.state {
SledState::Active => {}
SledState::Decommissioned => {
return Err(Error::InvalidEnsure {
zone_tag: ZoneTag::InternalNtp,
sled_id,
reason: InvalidEnsureReason::Decommissioned,
});
}
}

match sled_info.policy {
SledPolicy::InService { .. } => {
// Non-provisionable sleds should still get NTP zones.
}
SledPolicy::Expunged => {
return Err(Error::InvalidEnsure {
zone_tag: ZoneTag::InternalNtp,
sled_id,
reason: InvalidEnsureReason::Expunged,
});
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could use a helper for this.

Comment on lines +424 to +428
return Err(Error::InvalidEnsure {
zone_tag: ZoneTag::InternalNtp,
sled_id,
reason: InvalidEnsureReason::Decommissioned,
});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this and the other InvalidEnsure should be PlannerErrors.

Comment on lines +768 to +770
// Check that all the zone IDs exist. It is an internal invariant
// violation for a zone to be in the list of current sled zones, but
// not in zones_in_service.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Eventually we want to be able to have zones that exist but not in service, and it should be possible to expunge them I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I was thinking about this last night and what I did here is wrong. Thanks for catching this.


// Also add the list of expunged Nexus zones to self.expunged_nexus_zones.
self.expunged_nexus_zones.extend(expunge.iter().filter_map(|z| {
(z.zone_type.tag() == ZoneTag::Nexus).then_some(z.id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I think there's already a is_nexus(). (No reason to change this though.)

Comment on lines +803 to +805
// Precondition: we must have checked that it is valid for the zone to be
// added to the sled. (TODO: actually check this here, or use the type
// system to statically verify this.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about this? What does "valid" mean? Does this panic or return an error or do something bad if this precondition is violated?

/// error if the sled ID wasn't found.
///
/// Return true in `expunge_fn` to remove a zone, or false to retain it.
pub fn sled_expunge_zones(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why not just a sled_expunge_zone(sled_id, zone_id)?

Comment on lines +80 to +101
// We could in principle have a decommissioned sled which has
// resources allocated to it -- if we do, then we should log this
// as an error because that is not a legal state, but remove those
// zones anyway.
match sled_info.state {
SledState::Active => {}
SledState::Decommissioned => {
error!(
&log,
"sled has state Decommissioned, yet has resources allocated to it; \
removing them here (sled policy is \"{}\")", sled_info.policy;
);

return Some("decommissioned");
}
}

match sled_info.policy {
SledPolicy::InService { .. } => None,
SledPolicy::Expunged => Some("expunged"),
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a sled in state "decommissioned" that's not got policy "expunged"? (If not, I think a lot of this could be simplified -- there'd only be one possible expunge reason and you'd only need to check one field to identify it.)

Comment on lines +165 to +167
// Decommissioned and expunged sleds don't get any services. (These
// are explicit matches so that when more states are added, this
// fails to compile.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I wonder if this could just look at "expunged" now. It feels like only the policy should determine how we plan resources for the sled.

sunshowers added a commit that referenced this pull request Mar 24, 2024
…on enum next to each zone (#5238)

This greatly simplifies the handling of the expunged disposition in
#5211. It also is more understandable, and less prone to making
mistakes.

This PR only changes the in-memory representation of what is currently
`zones_in_service`. I've skipped doing the DB migration in this PR
because it'll be easier to just do it wholesale in #5211.

There are some small differences in diff output that I think make sense.
But I took a bigger swing at it in #5270, which depends on this.
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from main to sunshowers/spr/main.wip-reconfigurator-start-removing-expunged-zones April 4, 2024 23:35
Created using spr 1.3.6-beta.1
sunshowers added a commit that referenced this pull request Apr 5, 2024
…#5413)

Before we added the `display()` method, the way to get nicely formatted
blueprints was to use the Debug impl. While working on #5211, I noticed
that these callers were still using the old way to do things. Fix this up.

I tested this by making the `Debug` impl for `Blueprint` panic, then saw
which tests failed.

I also realized that we didn't have tests for omdb's blueprints. I believe at
the time it wasn't possible to write a test for this because we didn't
have an initial blueprint -- but now we do, so include a test.

(I also wanted to ensure that the actual blueprint ID was what we expected, so
I expanded the scope of `redact_variable` a bit. Introduce `ExtraRedactions`
so we can handle both fixed- and variable-length redactions, and use the
same logic for UUIDs.)
@sunshowers

Copy link
Copy Markdown
Contributor Author

Turns out it's too hard to make this rebase cleanly, so I'm going to start from scratch and copy over bits as they make sense.

@sunshowers sunshowers closed this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants