[dhcwg] AD review of draft-ietf-dhc-dhcpv6-failover-design

Ted Lemon <ted.lemon@nominum.com> Mon, 03 February 2014 19:30 UTC

Return-Path: <Ted.Lemon@nominum.com>
X-Original-To: dhcwg@ietfa.amsl.com
Delivered-To: dhcwg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5DCCF1A010F for <dhcwg@ietfa.amsl.com>; Mon, 3 Feb 2014 11:30:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.2
X-Spam-Level:
X-Spam-Status: No, score=-4.2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id X-X6uv-Twv1u for <dhcwg@ietfa.amsl.com>; Mon, 3 Feb 2014 11:30:25 -0800 (PST)
Received: from exprod7og126.obsmtp.com (exprod7og126.obsmtp.com [64.18.2.206]) by ietfa.amsl.com (Postfix) with ESMTP id B16B71A01DA for <dhcwg@ietf.org>; Mon, 3 Feb 2014 11:30:24 -0800 (PST)
Received: from shell-too.nominum.com ([64.89.228.229]) (using TLSv1) by exprod7ob126.postini.com ([64.18.6.12]) with SMTP ID DSNKUu/uUAErji1NXKex/4RpG9aswlcdCIry@postini.com; Mon, 03 Feb 2014 11:30:25 PST
Received: from archivist.nominum.com (archivist.nominum.com [64.89.228.108]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.nominum.com", Issuer "Go Daddy Secure Certification Authority" (verified OK)) by shell-too.nominum.com (Postfix) with ESMTP id 3135C1B8307 for <dhcwg@ietf.org>; Mon, 3 Feb 2014 11:30:24 -0800 (PST)
Received: from webmail.nominum.com (cas-01.win.nominum.com [64.89.228.131]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "mail.nominum.com", Issuer "Go Daddy Secure Certification Authority" (verified OK)) by archivist.nominum.com (Postfix) with ESMTP id B5365190052; Mon, 3 Feb 2014 11:30:14 -0800 (PST)
Received: from [10.0.10.40] (192.168.1.10) by CAS-01.WIN.NOMINUM.COM (192.168.1.100) with Microsoft SMTP Server (TLS) id 14.3.158.1; Mon, 3 Feb 2014 11:30:14 -0800
From: Ted Lemon <ted.lemon@nominum.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 03 Feb 2014 14:30:12 -0500
Message-ID: <50CD0E29-FAAC-428D-BDE6-D65A8F411F41@nominum.com>
To: draft-ietf-dhc-dhcpv6-failover-design@tools.ietf.org
MIME-Version: 1.0 (Mac OS X Mail 7.1 \(1827\))
X-Mailer: Apple Mail (2.1827)
X-Originating-IP: [192.168.1.10]
Cc: "dhcwg@ietf.org WG" <dhcwg@ietf.org>
Subject: [dhcwg] AD review of draft-ietf-dhc-dhcpv6-failover-design
X-BeenThere: dhcwg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <dhcwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dhcwg>, <mailto:dhcwg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/dhcwg/>
List-Post: <mailto:dhcwg@ietf.org>
List-Help: <mailto:dhcwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dhcwg>, <mailto:dhcwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 03 Feb 2014 19:30:28 -0000

I've got some detailed comments, but the bottom line is that I am not ready to submit this to the IESG.   This is for two reasons.   For the first, see the last few paragraphs of this review.   The other reason is that it appears to be two documents.   One document is trying to explain the reasoning behind the choices made in developing the failover protocol.   The other is trying to specify how the protocol functions, without specifying wire format stuff.

My understanding of the plan was that there would be a requirements document, which is done.   And then there would be this document, which would be a specification of the protocol, but would not specify wire format information.   Finally, there would be a third document, which would specify the wire format.

I don't think it's a problem to do a document that describes the reasoning behind the protocol, and if the wire format document is going to specify the actual protocol, including all the MUSTs and SHOULDs and MAYs, then this document would be fine as an informational document.   However, if the third document will not stand on its own, this document has to be standards track, as it is now.   And in that case, it needs to be chopped down substantially, and the normative language needs to be thought through carefully and systematically, so that it's not scattered all over the place and so that things that ought to be validated by servers are not instead specified as requirements on administrators (e.g. the suggestions below on the CONTACT message and on 5.2 paragraph 3).

I think the right thing to do is make this document informational, get rid of the RFC2119 language and the reference to RFC2119, and do that bit in the third document.   The more I think about this, the more I think that's actually what we said we were going to do originally.   However, if you want to go forward with proposed standard, it's going to require a lot of rework.   Either way is fine with me, but the decision needs to be made.

So, on to the editorial suggestions.   I've tried to limit myself to suggestions that I don't think the RFC editor would make, or that would be easy to get wrong, and of course to questions and comments about the spec itself.

The abstract is a bit wordy.   Suggested edit:

OLD
   DHCPv6 defined in [RFC3315] does not offer server redundancy.  This
   document defines a design for DHCPv6 failover, a mechanism for
   running two servers on the same network with capability for either
   server to take over clients' leases in case of server failure or
   network partition.  This is a DHCPv6 Failover design document, it is
   not a protocol specification document.  It is a second document in a
   planned series of three documents.  DHCPv6 failover requirements are
   specified in [I-D.ietf-dhc-dhcpv6-failover-requirements].  A protocol
   specification document is planned to follow this document.

NEW
   This document describes a design for DHCPv6 failover, a mechanism
   that allows two servers for the same network to communicate so as to
   allow either server to take over clients' leases if the other fails,
   as well as allowing continued operation in the event of a network
   partition.  This document does not describe the wire protocol.

The other text is okay if you really prefer it, but the point of abstracts is to attract the right readers, and not attract the wrong readers, nor to tell you anything at all else.   So e.g. there is no reason to refer to the requirements document or to RFC 3315, because presumably the target audience for this document already knows what DHCPv6 is and where it is defined, and if not they can read the references.

Section 3, first paragraph, "design" is superfluous, edit for clarity:
OLD:
   The failover protocol design provides a means for cooperating DHCPv6
   servers to work together to provide a DHCPv6 service with
   availability that is increased beyond that which could be provided by
   a single DHCPv6 server operating alone.  It is designed to protect
NEW:
   The failover protocol provides a means for cooperating DHCPv6
   servers to work together to provide a DHCPv6 service with
   greater availability than could be provided by
   a single DHCPv6 server operating alone.  It is designed to protect

page 6, second paragraph:
   Due in part to the
   additional overhead required as well as requirements to handle time
   skew between failover partners (See Section 8.1), failover is not
   suitable for leases shorter than 30 seconds.

Does the protocol design guarantee that leases _longer_ than 30 seconds will work, or is this a rule of thumb?

Are sections 3.1 and 3.2 really necessary?   Didn't we have a requirements document that preceded this one?   :)

Section 4, first paragraph:

   are listed in Section 4.2.  All information is sent over the
   connection as typical DHCPv6 messages that convey DHCPv6 options,
   following the format defined in Section 22.1 of [RFC3315].

why not just say "as DHCPv6 messages" and leave out "typical?"  Perhaps you mean "as defined in section 6 of RFC 3315?"   But I don't think you need to say that—you can just specify the format in the document that describes the wire format.

page 9, paragraph 3, edit for clarity:
OLD:
   Once a server is switched into PARTNER-DOWN (when auto-partner-down
   is used or as a result of administrative action), it can extend
   leases, regardless of the original server that initially granted the
   lease.  In that state server handles leases from its own pool, but
   once its own pool is depleted is also able to serve pool from its
   downed partner.  Some MCLT restrictions no longer apply, but the MCLT
NEW:
   Once a server is switched into PARTNER-DOWN (when auto-partner-down
   is used or as a result of administrative action), it can extend
   leases, regardless of which server initially granted the
   lease.  In that state server handles leases from its own pool, but
   once its own pool is depleted is also able to serve leases from the 
   pool of its downed partner.  Some MCLT restrictions no longer apply,
   but the MCLT

Also, doesn't this leave out that the partner down server can immediately renew leases from the other server's pool, even before its own pool is depleted?

Section 4.2, first paragraph:
   are assigned in this document.  Appropriate implementation details
   will be specified in a separate protocol specification document.  The
   following list enumerates these messages:

You've already said several times that this document doesn't specify the wire format.   I think it's best not to repeat it every time something that has a wire format is discussed for the first time—it feels repetitive to me.   Your call, but that's my advice, FWIW.

Page 11, last paragraph:
   o  CONTACT - The contact message is used by either server to ensure
      that the other server continues to see the connection as
      operational.  It MUST be transmitted periodically over every
      established connection if other message traffic is not flowing,
      and it MAY be sent at any time.

Do you really want to put normative language here?   You aren't describing the flow of state.   Surely there's a better place in the document for this MUST.

Section 5.1, paragraph 7:
   for the CONNECT message from a primary server.  If the secondary
   server doesn't receive a CONNECT message from the primary server in
   an installation dependent amount of time, it MAY drop the connection.

Do you mean "installation dependent" or "administratively configured?"   Or "implementation-dependent"?   Why isn't this just specified as a specific time limit?

paragraph 8:
   Every CONNECT message includes a TLS-request option, and if the
   CONNECTACK message does not reject the CONNECT message and the TLS-
   reply option says TLS MUST be used, then the servers will immediately
   enter into TLS negotiation.

Presumably either server could be configured to require TLS; if so, and if the secondary doesn't say TLS must be used, the primary should drop the connection and flag an error.

section 5.2, paragraph 2:
   The failover protocol SHOULD be configured with one failover
   relationship between each pair of failover servers.  In this case
   there is one failover endpoint for that relationship on each failover
   partner.  This failover relationship MUST have a unique name.

do you mean:
   Any pair of failover servers that have a failover relationship is 
   normally configured with exactly one such relationship.
   In this case there is one failover endpoint for that relationship
   on each failover partner.  This failover relationship MUST have a unique name.

Also, unique in what domain?   The entire internet?   :)

in paragraph 3, I think the normative language is superfluous.   How about:
   There is typically no need for more than one relationship between
   any two servers, but such a configuration is not prohibited, and may make
   sense in some cases.   In such cases, each failover relationship MUST have
   a unique relationship name.

This paragraph also seems a little weird since you said in the previous paragraph that failover relationship names are unique.   That said, an additional problem with the above paragraph is that you're using normative language for what looks like an administrative configuration.   This doesn't make sense—administrators aren't necessarily reading this RFC.   If you want to make a requirement, you should say something like "if a DHCP server configured with two failover endpoints detects that both endpoints have the same name, the DHCP server MUST report the conflict and MUST NOT attempt to establish or accept connections for either failover relationship."

In 5.2, last paragraph:

   When a connection is received from the partner, the unique failover
   endpoint to which the message is directed is determined solely by the
   IP address of the partner, the relationship-name, and the role of the
   receiving server.

A connection is not a message.   I can't tell what this paragraph is asking the implementor to do.

In section 6: you define proportional allocation and independent allocation in some detail here.   I don't see the point in also defining them in the terminology section, although I understand the impulse.   I would recommend that you delete the entries in the terminology section to avoid this repetition.

Also, I think the reference to dhcpv4-failover is probably unhelpful.   I think it's a normative reference in this particular context, although you currently have it listed as informative.   Until we get that document published, it's a downref, which is something to avoid, and I think is easy to avoid here.

6.1, paragraph 1:
   In this allocation scheme, each server has its own pool of available
   resources.  Remaining available resources are split between the
   primary and secondary servers in a configured proportion.  Note that

I think you mean "in a configured ratio," not "proportion."   See http://www.math.com/school/subject1/lessons/S1U2L2GL.html for explanation.

6.1, paragraph 3:
   A resource will not become owned by the server which allocated it
   initially when it is released or the lease expires because, in
   general, that server will have had to replenish its pool of available
   resources well in advance of any likely lease expirations.  Thus,
   having a particular resource cycle back to the secondary might well
   put the secondary more out of balance with respect to the primary
   instead of enhancing the balance of available addresses or prefixes
   between them.

You've already said that the resource reverts to the primary when it becomes available after having been allocated to a client.   Here you are explaining why.   It is unnecessary to explain why, and the explanation doesn't really make sense, because it's equally true that returning the resource to the secondary might put the pool more in balance.   So I think you should delete this paragraph.   The reader simply doesn't need to know this.

The use of POOLREQ/POOLRESP in 6.1 doesn't make sense.   It's the primary that needs to send POOLREQ, not the secondary.   You've said that the primary is supposed to automatically balance the pools by sending BNDUPD messages moving leases from FREE to FREE_BACKUP when needed.   Really, the best thing to do would be to require the secondary to do the same thing, and get rid of POOLREQ/POOLRESP entirely.   You can deal with the problem of rebalance wars by specifying a hysteresis threshold rather than suggesting it.   You'd also need to make sure that the ratio is the same on both servers in the CONNECT/CONNECTACK exchange.   Also, allowing the primary to try to steal FREE_BACKUP leases seems unnecessarily complicated.   Proactive pushing seems better than proactive stealing.

In 6.2, I see no justification for allowing the PARTNER-DOWN server to allocate out of the other server's pool.

In 6.3, paragraph 2:
   The proportional allocation mechanism is more flexible as it can
   dynamically rebalance available resources between servers.  That
   balance creates an additional burden for the servers and generates

You want to say "that balancing," not "that balance," because it is the balancing process that creates the load—if the pools are balanced, there is no balancing, and hence no load.

BTW, the advice you're giving here seems like a repetition of what was said in the two bullet points in section 6.   It would be good if you could cull those, or chop them down so that they don't duplicate what's said in 6.3.

In paragraph 3:
   between partners simpler.  It also makes recovery easier and
   potential conflict less likely to appear.

I think you mean either "It also makes recovery easier and makes it less likely that the failover pair will wind up in the POTENTIAL-CONFLICT state," or else you mean "It also makes recovery easier and conflict less likely."

In paragraph 4, I think you mean "more constrained" rather than "smaller":
   assign, so there is no need to rebalance.  For the prefix delegation
   mechanism, available resources are typically much smaller, so there

also:
   Independent allocation still may be used, but the implication must be
   well understood.  For example in a network that delegates /64
   prefixes out of a /48 prefix (so there can be up to 65536 prefixes
   delegated) and a 1000 requesting routers, it is safe to use
   independent allocation.

I think you mean here that if the maximum number of requesting routers is fewer than half the number of prefixes, then there should be no problem with using the independent model.   It's good to give examples, but here it might be better to just show the math.

Of course, you could also safely just delete the last three paragraphs of this section, because it's unlikely that what is said here will fail to be obvious to any person smart enough to implement the protocol.

In section 7, "common denominator" is an abbreviation of an English colloquialism where the phrase "least common denominator" is used to mean "basic common subset" or something like that.   I think you can just say "common" and convey the same meaning.   For a reader who is not familiar with this colloquialism, "denominator" is likely to impede understanding.

So, in paragraph to, I'd suggest the following edit:
OLD:
   In order to transmit binding database updates between one server and
   another using the failover protocol, some common denominator binding-
   status values must be defined.  It is not expected that these values
   correspond with any actual implementation of the DHCP protocol in a
   DHCP server, but rather that the binding-status values defined in
   this document should be a common denominator of those in use by many
   DHCP server implementations.
NEW:
   In order to transmit binding database updates between one server and
   another using the failover protocol, some common binding-
   status values must be defined.  It is not expected that these values
   correspond with any actual implementation of the DHCP protocol in a
   DHCP server, but rather that the binding-status values defined in
   this document should be a common subset of those in use by existing
   DHCP server implementations.

Why the MAY here:
   The lease binding-status values defined for the failover protocol are
   listed below.  Unless otherwise noted below, there MAY be client
   information associated with each of these binding-status value.
?

I think what you really want is something like this:
   The lease binding-status values defined for the failover protocol are
   listed below.  As noted below, client
   information may be required to be associated with the lease when it
   has a particular binding-status value.

Section 7, under FREE, for clarity:
OLD:
      available for assignment by the primary server.  Note that on a
      secondary server running in PARTNER-DOWN state, after waiting the
      MCLT, the resource MAY be allocated to a client by the secondary
      server.  Client identification MAY appear and indicates the last
      client to have used this resource as a hint.
NEW:
      available for assignment by the primary server.  On a
      secondary server running in PARTNER-DOWN state, the resource
      is available for allocation after MCLT has passed.  Client
      identification can appear, and indicates which client last used
      this resource.

Similarly under FREE_BACKUP:
OLD:
      secondary server to a client at any time.  Note that the primary
      server running in PARTNER-DOWN state, after waiting the MCLT, the
      resource MAY be allocated to a client by the primary server if
      proportional algorithm was used.  Client identification MAY appear
      and indicates the last client to have used this resource as a
      hint.
NEW:
      secondary server to a client at any time.  On a
      primary server running in PARTNER-DOWN state, the resource
      is available for allocation after MCLT has passed.  Client
      identification can appear, and indicates which client last used
      this resource.

In figure 1, is it possible for a lease in ACTIVE to go to RESET?

The information documented in figure 2 and the paragraph before it is also documented in points 5 and 6.   You should just delete figure 2 and that paragraph; if you feel points 5 and 6 are insufficient, add more information there.

The advice in the paragraph following figure 2 is not advice about the operation of the binding state machine.   It is information about the operating of the pool balancer.   It should be in its own section.   Pool balancing has been discussed elsewhere in the document already.

This kind of redundancy is prevalent throughout the document.   Authors, please take another pass through this document and try to identify cases like the one I described above and isolate the advice given in these cases into sections specific to these cases, and make sure it's not repeated anywhere else in the document.   You can probably reduce the document length by five pages or even ten pages by doing this, and it will be much better organized and easier to read.

I would appreciate it if the working group could do this before I continue reviewing the document.   What you are specifying here is a good protocol, but the document needs more work.

The working group would be within their rights to insist that I continue with the review and do not insist on this change, on the basis that the authors do not have the time or enthusiasm to do the work.   If the working group makes this request, I will attempt to honor it, or else try to find some resources to help you.   I want to see this document progress, and there is lots of good work in here.   But it could be so much better with a bit of restructuring and winnowing.