Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing

"Bernie Volz (volz)" <volz@cisco.com> Thu, 18 December 2014 03:32 UTC

Return-Path: <volz@cisco.com>
X-Original-To: dhcwg@ietfa.amsl.com
Delivered-To: dhcwg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 890CE1A011D for <dhcwg@ietfa.amsl.com>; Wed, 17 Dec 2014 19:32:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.511
X-Spam-Level:
X-Spam-Status: No, score=-14.511 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q1bak40kn7j9 for <dhcwg@ietfa.amsl.com>; Wed, 17 Dec 2014 19:32:03 -0800 (PST)
Received: from aer-iport-4.cisco.com (aer-iport-4.cisco.com [173.38.203.54]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1E1EE1A0110 for <dhcwg@ietf.org>; Wed, 17 Dec 2014 19:32:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=6685; q=dns/txt; s=iport; t=1418873524; x=1420083124; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=t5kggCYcIp/Ch80g2d96K0TgBg889JYhUgj/8OqKlKQ=; b=m03zukkSfkyewgvRFwGwhUpqWOCX/588mfe4W695L8zsEYGJzEjgN7Ha gO99TOwuu6JJEPspxs34UFy/7kYE4FhH35AkY2hVBqsXi2yvvoVFpG7aS kTO2GK0uSydwXitA+uzbf9WQFdnesqPwSPzkmXORoKnKQ/MQ7SJArt8PA M=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AssEAFJJklStJssW/2dsb2JhbABahDAEzAACgTIBAQEBAX2EDAEBAQMBOisMCAUHBAIBCBEEAQEBChQJBzIUCQgCBA4FCIgcCNUUAQEBAQEBAQEBAQEBAQEBAQEBAQEBF4oKhH06MQcGgxCBEwEEjDGBVpovIoIAHIFQb4FFfgEBAQ
X-IronPort-AV: E=Sophos;i="5.07,598,1413244800"; d="scan'208";a="274251736"
Received: from aer-iport-nat.cisco.com (HELO aer-core-3.cisco.com) ([173.38.203.22]) by aer-iport-4.cisco.com with ESMTP; 18 Dec 2014 03:32:02 +0000
Received: from xhc-rcd-x02.cisco.com (xhc-rcd-x02.cisco.com [173.37.183.76]) by aer-core-3.cisco.com (8.14.5/8.14.5) with ESMTP id sBI3Vxap015069 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Thu, 18 Dec 2014 03:32:00 GMT
Received: from xmb-rcd-x04.cisco.com ([169.254.8.84]) by xhc-rcd-x02.cisco.com ([173.37.183.76]) with mapi id 14.03.0195.001; Wed, 17 Dec 2014 21:31:58 -0600
From: "Bernie Volz (volz)" <volz@cisco.com>
To: Ted Lemon <Ted.Lemon@nominum.com>
Thread-Topic: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing
Thread-Index: AQHQGjSrui+8J4NAqkiN107VmBn0eJyUTGIQgAC+FwD//5z4sA==
Date: Thu, 18 Dec 2014 03:31:58 +0000
Message-ID: <489D13FBFA9B3E41812EA89F188F018E1B782FE0@xmb-rcd-x04.cisco.com>
References: <0FE7102D-39A6-4245-A07A-B70C945FAE8F@nominum.com> <489D13FBFA9B3E41812EA89F188F018E1B7828B1@xmb-rcd-x04.cisco.com> <F3B109F2-52EA-4305-BAA2-4DB3C6C4CEDC@nominum.com>
In-Reply-To: <F3B109F2-52EA-4305-BAA2-4DB3C6C4CEDC@nominum.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.86.252.248]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: http://mailarchive.ietf.org/arch/msg/dhcwg/NdFQN1skx7sN-BLMr7RQURF66OA
Cc: dhcwg <dhcwg@ietf.org>
Subject: Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing
X-BeenThere: dhcwg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <dhcwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dhcwg>, <mailto:dhcwg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/dhcwg/>
List-Post: <mailto:dhcwg@ietf.org>
List-Help: <mailto:dhcwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dhcwg>, <mailto:dhcwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Dec 2014 03:32:05 -0000

>Only do load balancing on solicits.   The problem isn't actually in this document; it's that this document relies on 3074, which is underspecified.   So this document needs to correct for that.

There's also Information-Request.

And, there's Confirm that is not sent to a specific server and having both respond is probably unnecessary (it may not do any harm since they should give the same answer).

I also think that Rebinds are fine to do load balancing on - at least when failover is present (i.e. that the partner is alive). A client that arrives at Rebind might have done so because the failover partner that was to service it just came back up (i.e., since the last Renew retransmission). Yes, there are isolated cases where this ends up in not given the client service, but in that case other things are broken (such as Relays are not forwarding the packets to both servers and best to learn that earlier).

So, we're really back to all the messages that don't specify a server-id option?

>Well, it's pretty clear that the logic of doing load balancing on non-solicit messages hasn't been explored in any depth... :)

Hum ... we've being doing that for DHCPv4 for many years with it deployed in large SP networks and also are doing it for DHCPv6 - we do more than just DHCPDISCOVER and Solicit. Things have been working quiet well (these are with failover and we use information about the failover state when deciding whether to process the received packet or not).

We do not do Renews and don't intend to.


Looking back at the draft, I do see that we are conflating the two different architectures I had outlined earlier and that is probably where things go bad as the rules are a bit different.

In the case were you want a server (or failover partners) to process only a subset of the clients, the rules can be very simple - discard packets are not in the server's hash bucket assignments. This can be done regardless of the packet type and whether the server-id option is present. Though it also means you only need to apply the logic to packets where the server-id option is not present (since the server-id option will suffice in dropping all other packets not for the server).

In the case of failover load balancing (i.e., where you want each server to handle "half" of the clients), the rules are different depending on the failover state. In normal failover communication state, load balancing applies; when not in normal communication, load balancing must be disabled.


And, if the goal is for use with failover, we should wait until failover is available (or further along)?

And, if the goal is general load balancing (not tied to failover), then anything to do with failover (or any redundancy) should be removed.

In the case where there isn't any common database (whether failover or not), you really have to have the client stick with the server that is to service it - there is no redundancy goal. So whether you do the hashing on some or all client packets doesn't really matter. In principle though, it seems that doing those where there is no server id option is simpler and better as it allows a client that previously communicated with a server (regardless of load balancing) to keep working what that server (if the server is still present). But when that client gets to a packet which is not directed at a specific server, then doing the load balancing is appropriate (Solicit, Information-Request, Rebind, Confirm).

If there is a common backing store (whether failover or otherwise), clients can be serviced by a different server when that server is known not to be response - those load balancing should only be done when the partner is known to be up and fully responsive. But it turns out the same packets are load balanced (Solicit, Information-Request, Rebind, Confirm).

- Bernie

-----Original Message-----
From: Ted Lemon [mailto:Ted.Lemon@nominum.com] 
Sent: Wednesday, December 17, 2014 9:49 PM
To: Bernie Volz (volz)
Cc: dhcwg
Subject: Re: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-load-balancing

On Dec 17, 2014, at 5:18 PM, Bernie Volz (volz) <volz@cisco.com> wrote:
> See section 3.1 - a REBIND does not contain a Server ID option so the server that is supposed to respond to this client's hash bucket is the only that responds.

Right, but that behavior is incorrect.   If a client is sending REBINDs, it might be because the server it's bound to is refusing to answer due to load balancing.  But it also might also be because it's bound to the correct server, and that other server is no longer reachable by the client.   In that case, the client will fail to renew its lease, because the backup server won't answer.   Not only that, but in this case the client would simply fail to get a lease until the problem with the unreachable server is corrected.

Of course, Delayed Service is supposed to take care of that, but if Delayed Service is configured, then presumably the server that is refusing service under 3.2 will allow service once the Delayed Service interval has elapsed.   And so the whole thing will have been for naught.

> Also, at least in a failover situation, even if all clients were serviced by only one of the servers (say they all obtained their lease when only one of the servers was running and are happily renewing their leases), what's the harm? Failover doesn't mean you have twice the throughput - since you must have proper server capacity to handle all of the clients from a single server (the partner server may be down for an extended period). And, eventually clients will SOLICIT and then move to the proper server.

Right.

> Is there more than what followed (Section 2 & 3.2) that needs to be "specified"?

Well, it's pretty clear that the logic of doing load balancing on non-solicit messages hasn't been explored in any depth... :)

> Did you have any suggestion for how to deal with this? It was I believe intended to indicate that DHCPREQUEST in v4 could mean multiple things. Note that this is addressing the requirements, not the operation (though it is interesting that this section in 3074 doesn't mention DHCPREQUEST). Perhaps this can just be: "The requirements for DHCPv6 are substantially the same as for DHCPv4."?

Only do load balancing on solicits.   The problem isn't actually in this document; it's that this document relies on 3074, which is underspecified.   So this document needs to correct for that.