Skip to content

Scale ha combined#2

Open
jayunit100 wants to merge 33 commits into
masterfrom
scale-ha-combined
Open

Scale ha combined#2
jayunit100 wants to merge 33 commits into
masterfrom
scale-ha-combined

Conversation

@jayunit100
Copy link
Copy Markdown

@timothysc @rrati merged everything to one branch am testing now.
just FYI any thoughts feel free to put in here !

@timothysc
Copy link
Copy Markdown

Guys we need to finish

  1. Finish cleanup
  2. shift to experimental
  3. rebase
  4. one we are green lets put up a WIP pr so we can get feedback while we give it cycles.

Ideally getting a WIP up tomorrow just to start getting feedback would be good.

@jayunit100
Copy link
Copy Markdown
Author

I think Cleanup is done addressed earlier comments, but have to copy final struct over to scheduler will do that in the am then we can open the PR up to upstream and start rebasing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change effect more than just the locking stuff?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Before this change kube ignored the expire events from etcd. I don't know there was anything that used etcd's ttl so it probably wasn't needed before. An expire event is only when a key expires in etcd, so from kube's perspective it really should be handled like a delete. The only difference between a delete and an expire event in etcd is who initiated the key removal. For a delete, it is expressly asked by an external entity vs expire is done by etcd w/o external prompting

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the full process name.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check with decarr where is that recursive delete @...
This is the same iterative problem.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock name needs to be an input option, with the default, so you need to plumb through to the cmdline stuffs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it? This is the controller-manager. Is there value in having the lock name configurable for the kcm? I was thinking we would want to hard-code the lock name in the processes so there's no chance of a configuration snaffu leading to issues

@jayunit100
Copy link
Copy Markdown
Author

(last thing i did friday, just remembered to add these notes here !) i got some testing done on some on a cluster (you can do this by modifying vagrant scripts to launch 2 masters). I can add the code tomorrow for that. but ...failover didn't seem to happen. monday

  • we need to see what the story is w/ ttls to make sure they are getting propogated properly.
  • I may have to add more logic to the utility class... i think the "update" logic maybe is messed up.

@jayunit100
Copy link
Copy Markdown
Author

interesting. i tested the renewtime and it seems to properly add a TTL to the etcd entries. but i cant see where?

Created-Index: 12
Modified-Index: 76
TTL: 26
Etcd-Index: 77
Raft-Index: 317
Raft-Term: 2
{"kind":"Lock","apiVersion":"v1","metadata":{"name":"ha.cm.lock","namespace":"default","uid":"f565e30b-45f6-11e5-b222-50e549b89d2f","resourceVersion":"73","creationTimestamp":"2015-08-18T22:17:48Z","deletionTimestamp":"2015-08-18T22:20:04Z"},"spec":{"heldby":"andross","duration":30,"atime":"2015-08-18 18:17:48.603475865 -0400 EDT","rtime":"2015-08-18 18:19:39.734959751 -0400 EDT"}}

Comment thread hack/local-up-cluster.sh
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I was on the fence about keeping these changes... but they allow you to test HA in local mode by inducing failover. Iim happy to including them in the final PR, unless folks think its a bad idea. For now while this is WIP its good to have these for regression testing as we change the code base and especially before we rebase.

The way it works: Just tail the logs of controller-manager-1.log and controller-manager-2.log, and you'll see the handoff happen after the time bomb goes off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants