Re: [dhcwg] DHCPv6 Failover

On 5/9/2011 5:07 PM, Tomasz Mrugalski wrote:
> Dear group,
> DHCPv6 failover was mentioned a couple of times on mailing list and
> other places, but it looks like no real work started in that area. I'd
> like to change that.
>
> I'm looking for people, who would like to be involved in DHCPv6 failover
> work. Ted Lemon and John Brzozowski are supporting this effort. If you
> are interested, please respond to the list or contact me directly.

I'd be happy to at least offer a few comments.  I'm coming from the 
perspective of an enterprise network administrator, so I'm sure there are 
plenty of scenarios I'm not considering, but this is at least one set of 
concerns and opinions.

> My high level plan is to write DHCPv6 failover requirements document
> first. This will let us avoid problems encountered with v4 failover
> standardization attempts - it started as relatively simple draft, but
> people kept adding extra features and it became something else than
> originally envisaged. That would be very short document that would
> contain definitions and enumerate goals of this work, what is relevant
> and what is outside of scope.
>
> If you think that having such requirements list as a separate document
> is a bad idea, we could consider incorporating it into the main failover
> draft, but in my opinion having separate documents is cleaner and easier
> to maintain.
>
> Here are initial questions that we should consider, sometimes with my
> first proposed answers.
>
> 1. Primary goal is to define failover protocol, so rough failover
> definition is needed. DHCPv6 failover is an ability of one peer (DHCPv6
> server) to continue serving leases provided by another peer, without
> affecting clients' connectivity.

In my ideal world, failover would imply the opposite of the famous Leslie 
Lamport quote - the failure of a machine I didn't even know existed, makes no 
difference to the machine I'm using.  Server peer leases is obviously a good 
start.

> 2. Do we want high availability?

Yes, YES, /YES/.  On a large network like mine (~10k hosts, small support 
staff, lots of self-administered student owned machines), where hardcoding any 
large class of device is impractical, if DHCP doesn't work, then nothing 
works.  There are other ways of mitigating this (hardcoding servers, using 
exceptionally long leases), but lack of DHCP services still has a huge 
operational impact.

> 4. Do we want load balancing?

Others will huge networks may want this, but for my money, servers are cheap 
and fast enough that it's easy to just throw more iron at a problem.  3GHz 
CPUs, 10Gbit network interfaces, and server class SSDs can allow you process a 
silly amount of DHCP transactions without worrying too much about explicit 
load balancing.

> 6. Should geographically distributed failover be taken into consideration?

Yes.  For example, I have a handful of semi-remote locations that justify 
installing a lower end DHCP server in the building, but pair with the beefier 
ones back on the main campus.  This gives me good redundancy in the building, 
without having to buy a second dedicated server.

> 3. If we want HA/LB/geo-failover, should it be defined in the core spec
> or as a separate document? Keeping them in single document can be easier
> in the beginning, but then can bloat the document quickly.

As I said, I'd prioritize HA over load balancing.  As for geo distribution, 
I'm curious - how do you see that differing from vanilla HA?

> 4. Should failover also cover prefix delegations or addresses only?
> My proposal is both address and prefixes. PD is a core feature of
> DHCPv6, so we can't pretend that it doesn't exist. PD is a must.

For the near future we're only looking at doing addresses, but yeah, ignoring 
prefixes now just invites more after the fact duct tape patching later on.

> 5. How many peers should this protocol support?
> My opinion is that 2 is enough. While having several peers would be
> great, the difficulty of synchronizing bigger number of peers than a
> pair may be very difficult.

Going from one to two, and removing a single point of failure, is certainly a 
much larger jump than going from two to three (to four, to five...).  That 
said, I've had cases where I've had less redundancy than I'd prefer because 
I'm limited to only failover pairs.  It's nothing I'm losing any sleep over, 
but the ability to define a pool of fully equivalent masters would make some 
of my design decisions a bit simpler.

Some of the nosql databases have been doing a lot of work on multi-master 
replication.  Would it be worth using or stealing from one of them, like 
apache casandra?

> After we agree on answers to those questions, we may start thinking
> about how should the actual handover protocol will work. Two
> introductory questions for now:
>
> 6. Should we adopt DHCPv4 failover
> (http://tools.ietf.org/id/draft-ietf-dhc-failover-12.txt) as a starting
> point (or at least based on the same concept)? Note that this work
> started in 1997, but was eventually abandoned in 2003, so using the
> latest revision as a starting point may not be the best idea.
>
> 7. Significant part of the failover will be synchronization between
> peers, including transmitting selected/all leases. There is similar
> mechanism used in leasequery (RFC5007) and bulk leasequery (RFC5460). Do
> we want to reuse/extend that or define something new?
>
> Feel free to add other questions to the list.
>
> I'm looking forward to your comments and suggestions,
>
> Tomek Mrugalski
> ISC
> _______________________________________________
> dhcwg mailing list
> dhcwg@ietf.org
> https://www.ietf.org/mailman/listinfo/dhcwg

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
     GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

Re: [dhcwg] DHCPv6 Failover - volunteers needed