Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing

Andre Kostur <akostur@incognito.com> Tue, 20 January 2015 19:43 UTC

MIME-Version: 1.0
In-Reply-To: <489D13FBFA9B3E41812EA89F188F018E1B782FE0@xmb-rcd-x04.cisco.com>
References: <0FE7102D-39A6-4245-A07A-B70C945FAE8F@nominum.com> <489D13FBFA9B3E41812EA89F188F018E1B7828B1@xmb-rcd-x04.cisco.com> <F3B109F2-52EA-4305-BAA2-4DB3C6C4CEDC@nominum.com> <489D13FBFA9B3E41812EA89F188F018E1B782FE0@xmb-rcd-x04.cisco.com>
Date: Tue, 20 Jan 2015 11:43:23 -0800
Message-ID: <CAL10_Bq9vej7L0AN-CF4oZUCqu2jhH4=UmjUqXCUdAi-kJZDkQ@mail.gmail.com>
From: Andre Kostur <akostur@incognito.com>
To: "Bernie Volz (volz)" <volz@cisco.com>
Content-Type: multipart/alternative; boundary="001a11333074e270ed050d1aa69e"
Archived-At: <http://mailarchive.ietf.org/arch/msg/dhcwg/V02dLeVgAhq5mWXk7WBC1iNReoE>
Cc: dhcwg <dhcwg@ietf.org>, Ted Lemon <Ted.Lemon@nominum.com>
Subject: Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing
Precedence: list

Thank-you for your patience (and Bernie for the first round of defense):



> In section 3.2, the text appears to be saying that in a DHCPv6 failover
setup, a RENEW with a server

> identifier should be ignored if the load balancing algorithm doesn't
identify the client as belonging to that

> server.   This is problematic, for two reasons.   First, it leads to the
client attempting to renew multiple

> times, resulting in an increased load on the server.   Second, a REBIND
would presumably be answered

> by both servers, so it wouldn't necessarily correct the problem.

The increased load on the server is pretty small as load balancing happens
pretty early and results in the RENEW being dropped.  Also this additional
load is only appearing in the case where one of the servers has failed.
When the server is restored, the RENEWs would be arriving at both servers,
and the responsible server will answer.  Unless the DHCPv6 server is
configured to allow unicast RENEWs.  Then the client would be retrying
between T1 and T2 every 10 minutes (assuming default timeouts, and after
the initial ramp-up of the 4, 8, 16, etc timeouts).

As Bernie mentioned, the REBIND would only be answered by the responsible
server.

> But this leads to the reason I think this is not ready for publication:
RFC 3074 never actually specifies

> under what circumstances load balancing is done: it leaves it up to the
implementation.   It kind of implies

> that load balancing should be done on all requests.   But in fact it only
ever mentions DHCPDISCOVER.

> And indeed, I just looked at the ISC implementation, and it only does
load balancing on DHCPDISCOVER.

RFC 3074 defines a Service Transaction as “A set of client-server exchanges
that lead to a server providing or denying some service to a client.”  It
does go on to use DORA as an example of a Service Transaction, but a
REQUEST-RENEW/ACK (or INFORM/ACK) would just as much be a Service
Transaction.  Examples are not definitions.

> Section 2 of this document suggests that load balancing is done for _all_
client-sourced messages, but

> that would mean that a REBIND message wouldn't get a response from the
non-balancing server, which in

> turn would mean that the client's lease would have to expire before it
could rebind to the correct server.

Correct.  REBINDs do not carry a Server ID, and thus are subject to load
balancing.  And if we are talking about the failover mode, then the one of
the two servers is responsible for that device and will answer.  The
failure case here is that if the partner servers can see each other, but
the clients cannot reach one of the servers.  Then we may be in the limbo
state of half the population being ignored because the functional server
can still see the isolated server (and thus correctly ignores that half of
the population).  This is fixable by either repairing the network issue, or
stopping the isolated server.


Note that section 3.2 says that the RENEW with the correct Server
Identifier MAY choose to ignore the request.  This does leave it free for a
server implementer to make the choice that they do not wish the population
to rebalance automatically without waiting for them to eventually come in
with a SOLICIT (or make it configurable, implementer’s choice).  I feel
that this behaviour is desirable as one is performing load balancing for a
reason.  If one did not care that the entire population could stay on one
server, why bother with the additional complexity of a load balanced
solution?  Just let the population work with one server, and let the
secondary server only respond when the first is unresponsive.

> And, if the goal is for use with failover, we should wait until failover
is available (or further along)?

I don’t think it would be necessary to wait for failover to be available,
as this draft is dependant on certain properties of the failover mechanism,
not on a specific implementation.

-- 
Andre Kostur

[dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-b… Ted Lemon
Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-lo… Bernie Volz (volz)
Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-lo… Bernie Volz (volz)
Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-lo… Ted Lemon
Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-lo… Andre Kostur