[WIP] [reconfigurator] start removing expunged zones#5211
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
| sled_ip_allocators: BTreeMap::new(), | ||
| zones: BlueprintZones::new(parent_blueprint), | ||
| zones_in_service: parent_blueprint.zones_in_service.clone(), | ||
| expunged_nexus_zones: parent_blueprint.expunged_nexus_zones.clone(), |
There was a problem hiding this comment.
Happy to discuss this during the meeting later today, but just jotting down a couple of thoughts here:
- Once a zone gets added to
expunged_nexus_zones, how does the planner know when it's okay to remove it from a subsequent child blueprint? - Does this need to be Nexus zones specifically? Or could it be
zones_expungedlikezones_in_service? (I haven't thought about this much, but am assuming we're eventually going to want to track other kinds of zones that need execution work to be expunged.)
There was a problem hiding this comment.
Happy to discuss this during the meeting later today, but just jotting down a couple of thoughts here:
- Once a zone gets added to
expunged_nexus_zones, how does the planner know when it's okay to remove it from a subsequent child blueprint?
I was wondering the same thing. Should there also be a 'decommissioned' column for when the zone has been completely removed?
- Does this need to be Nexus zones specifically? Or could it be
zones_expungedlikezones_in_service? (I haven't thought about this much, but am assuming we're eventually going to want to track other kinds of zones that need execution work to be expunged.)
I was also thinking the same thing. It seems somewhat silly to have a separate table per zone type unless each of those tables would have different rows because they store other "checkpoint" information while decommissioning a zone. I don't think we were planning to do this inside the zone table, but I'm not sure.
There was a problem hiding this comment.
The latter we've taken into account by switching to a disposition enum, which we'll store inside bp_omicron_zone.
| .sleds | ||
| .keys() | ||
| .map(|sled_id| { | ||
| // XXX: do we need to handle, or check for, initially expunged |
There was a problem hiding this comment.
I would say no. The only thing we could do here is mark zones expunged if we find them on sleds that are expunged. But that's more of the planner's job. The idea behind the initial blueprint is that if you executed it, it wouldn't do anything. (That's why this is in BlueprintBuilder, which is not supposed to be that smart, and not Planner.)
davepacheco
left a comment
There was a problem hiding this comment.
Nice. (I know this is still a WIP.)
| let sled_info = self.sled_resources(sled_id)?; | ||
|
|
||
| match sled_info.state { | ||
| SledState::Active => {} | ||
| SledState::Decommissioned => { | ||
| return Err(Error::InvalidEnsure { | ||
| zone_tag: ZoneTag::InternalNtp, | ||
| sled_id, | ||
| reason: InvalidEnsureReason::Decommissioned, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| match sled_info.policy { | ||
| SledPolicy::InService { .. } => { | ||
| // Non-provisionable sleds should still get NTP zones. | ||
| } | ||
| SledPolicy::Expunged => { | ||
| return Err(Error::InvalidEnsure { | ||
| zone_tag: ZoneTag::InternalNtp, | ||
| sled_id, | ||
| reason: InvalidEnsureReason::Expunged, | ||
| }); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Seems like we could use a helper for this.
| return Err(Error::InvalidEnsure { | ||
| zone_tag: ZoneTag::InternalNtp, | ||
| sled_id, | ||
| reason: InvalidEnsureReason::Decommissioned, | ||
| }); |
There was a problem hiding this comment.
I wonder if this and the other InvalidEnsure should be PlannerErrors.
| // Check that all the zone IDs exist. It is an internal invariant | ||
| // violation for a zone to be in the list of current sled zones, but | ||
| // not in zones_in_service. |
There was a problem hiding this comment.
Why is that? Eventually we want to be able to have zones that exist but not in service, and it should be possible to expunge them I think.
There was a problem hiding this comment.
Yes -- I was thinking about this last night and what I did here is wrong. Thanks for catching this.
|
|
||
| // Also add the list of expunged Nexus zones to self.expunged_nexus_zones. | ||
| self.expunged_nexus_zones.extend(expunge.iter().filter_map(|z| { | ||
| (z.zone_type.tag() == ZoneTag::Nexus).then_some(z.id) |
There was a problem hiding this comment.
FYI, I think there's already a is_nexus(). (No reason to change this though.)
| // Precondition: we must have checked that it is valid for the zone to be | ||
| // added to the sled. (TODO: actually check this here, or use the type | ||
| // system to statically verify this.) |
There was a problem hiding this comment.
Can you say more about this? What does "valid" mean? Does this panic or return an error or do something bad if this precondition is violated?
| /// error if the sled ID wasn't found. | ||
| /// | ||
| /// Return true in `expunge_fn` to remove a zone, or false to retain it. | ||
| pub fn sled_expunge_zones( |
There was a problem hiding this comment.
Out of curiosity, why not just a sled_expunge_zone(sled_id, zone_id)?
| // We could in principle have a decommissioned sled which has | ||
| // resources allocated to it -- if we do, then we should log this | ||
| // as an error because that is not a legal state, but remove those | ||
| // zones anyway. | ||
| match sled_info.state { | ||
| SledState::Active => {} | ||
| SledState::Decommissioned => { | ||
| error!( | ||
| &log, | ||
| "sled has state Decommissioned, yet has resources allocated to it; \ | ||
| removing them here (sled policy is \"{}\")", sled_info.policy; | ||
| ); | ||
|
|
||
| return Some("decommissioned"); | ||
| } | ||
| } | ||
|
|
||
| match sled_info.policy { | ||
| SledPolicy::InService { .. } => None, | ||
| SledPolicy::Expunged => Some("expunged"), | ||
| } | ||
| } |
There was a problem hiding this comment.
Is it possible to have a sled in state "decommissioned" that's not got policy "expunged"? (If not, I think a lot of this could be simplified -- there'd only be one possible expunge reason and you'd only need to check one field to identify it.)
| // Decommissioned and expunged sleds don't get any services. (These | ||
| // are explicit matches so that when more states are added, this | ||
| // fails to compile.) |
There was a problem hiding this comment.
Similarly I wonder if this could just look at "expunged" now. It feels like only the policy should determine how we plan resources for the sled.
…on enum next to each zone (#5238) This greatly simplifies the handling of the expunged disposition in #5211. It also is more understandable, and less prone to making mistakes. This PR only changes the in-memory representation of what is currently `zones_in_service`. I've skipped doing the DB migration in this PR because it'll be easier to just do it wholesale in #5211. There are some small differences in diff output that I think make sense. But I took a bigger swing at it in #5270, which depends on this.
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
…#5413) Before we added the `display()` method, the way to get nicely formatted blueprints was to use the Debug impl. While working on #5211, I noticed that these callers were still using the old way to do things. Fix this up. I tested this by making the `Debug` impl for `Blueprint` panic, then saw which tests failed. I also realized that we didn't have tests for omdb's blueprints. I believe at the time it wasn't possible to write a test for this because we didn't have an initial blueprint -- but now we do, so include a test. (I also wanted to ensure that the actual blueprint ID was what we expected, so I expanded the scope of `redact_variable` a bit. Introduce `ExtraRedactions` so we can handle both fixed- and variable-length redactions, and use the same logic for UUIDs.)
|
Turns out it's too hard to make this rebase cleanly, so I'm going to start from scratch and copy over bits as they make sense. |
Putting this up for early review. This implements internal bookkeeping for the blueprint builder, as well as database storage for expunged Nexus zones.
This does not implement any of the other things we discussed in Tuesday's meeting -- let's discuss this tomorrow.