Re: [trill] RtgDir review of draft-ietf-trill-directory-assist-mechanisms-07.txt

Donald Eastlake <d3e3e3@gmail.com> Sat, 16 April 2016 03:12 UTC

MIME-Version: 1.0
In-Reply-To: <57110E19.6050304@joelhalpern.com>
References: <570EB05D.20802@joelhalpern.com> <CAF4+nEHWCs7EOzMFN7HzA92DtdEzsFvFk-4zuzY4MRfeXdA4JA@mail.gmail.com> <57110E19.6050304@joelhalpern.com>
From: Donald Eastlake <d3e3e3@gmail.com>
Date: Fri, 15 Apr 2016 23:12:06 -0400
Message-ID: <CAF4+nEHxnx8NDZAbyVvzdexoGVpA=Z56YJw2HPcr-zh44dYGEQ@mail.gmail.com>
To: "Joel M. Halpern" <jmh@joelhalpern.com>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <http://mailarchive.ietf.org/arch/msg/trill/QZrUm4lMsXiFwgieAW21YrKkLgo>
Cc: "rtg-ads@ietf.org" <rtg-ads@ietf.org>, "rtg-dir@ietf.org" <rtg-dir@ietf.org>, "trill@ietf.org" <trill@ietf.org>, draft-ietf-trill-directory-assist-mechanisms.all@ietf.org
Subject: Re: [trill] RtgDir review of draft-ietf-trill-directory-assist-mechanisms-07.txt
Precedence: list

Hi Joel,

On Fri, Apr 15, 2016 at 11:51 AM, Joel M. Halpern <jmh@joelhalpern.com> wrote:
> Thank you Donald.  Points of agreement elided, some responses to try to
> clarify my observations.  I will note that from your comments about 3.1, I
> believe my concerns, now moved to 3.7, are larger, as I had assumed that the
> magic was in some other protocol, and you now say it is not defined there.
>
> Yours,
> Joel
>
> On 4/15/16 11:23 AM, Donald Eastlake wrote:
>>
>> Hi Joel
>>
>> Thanks for your thorough review and comments. See below
>>
>> On Wed, Apr 13, 2016 at 4:47 PM, Joel M. Halpern <jmh@joelhalpern.com
>>  <mailto:jmh@joelhalpern.com> wrote:
>>
> ...
>
>>> Major Issues:
>>> In the state machine transitions in section 2.3.3
>>> for push servers, it appears that if the event indicating that the
>>> server is being shut down occurs while the server is already Going
>>> Stand-By or Uncompleting, the transitions indicate that this
>>> "going
>>> down" event will be lost.  A strict reading of this would seem to
>>> mean that the "go Down" event would need to recur after the
>>> timeout
>>> condition.  This would seem to be best addressed by a new state
>>> "Going-Down" whose timeout behavior is to move to down state.
>>
>> I understand your point but "going down" and the like are called
>> "events or conditions" in this draft, not just events.
>> The problem with adding a single "Going-Down" state is that
>> transition
>> to that state would lose the information as to whether or not the
>> Push
>> Directory had been advertising that it was pushing complete
>> information or not. The reason to remember this is that you would
>> want
>> to behave a differently if the "going down" condition was revoked
>> before it completed. This information could be preserved in a
>> Boolean
>> pseudo variable but the current style of state machine in this draft
>> avoids such pseudo variables and encodes all of the relevant push
>> directory's state into the state machine state. Thus, I can see
>> three
>> possible responses to your comment:
>>
>> 1) Change wording to emphasize that these "events or conditions" can
>> be conditions that cause a state transition some substantial time
>> after they become true.
>>
>> 2) Add two new states: (1) going down - was complete; (2) going down
>> -
>> was incomplete.
>>
>> 3) Change the style of state machine to admit pseudo variables which
>> can be set and testing as part of the state machinery.
>>
>> Option 1 is just some minor wording changes but adopting either
>> options 2 or 3 involves more extensive changes so I would prefer to
>> avoid them.
>
> From what I have seen, trying to build a state machine with conditions
> rather than events is fraught with problems and tends to lead to errors in
> implementation.  It amounts to hiding pseudo-variables inside the states,
> but not describing them.
> Thus, I would much prefer solution 2, but it is of course up to the WG.

Well, option 2 wouldn't be too hard. Option 3 would probably involve the most
change.

> ...
>
>>> Minor Issues:
>>> In section 2.3.3 describing the state transitions for push
>>> servers, there is an event (event 1) described as "the server was
>>> Down but is now Up."  The state transition diagram describes this
>>> as
>>> being a valid event that does not change the servers state if the
>>> server is in any state other than "Down." In one sense, this is
>>> reasonable, saying that such an event is harmless.  I would
>>> however
>>> expect some sort of logging or administrative notification, as
>>> something in the system is quite confused.
>>
>> Again, I see your point but it seems to me to be a matter of state
>> machine style. Note that the "event" is described as a condition, so
>> from that point of view, it is true anytime the state is other than
>> Down. On the other hand, if you view it as strictly an event, you
>> are
>> left with the question of what to put at the intersection of a state
>> and event in the table when it is impossible for that event to occur
>> in that state. Some people note this with an "N/A" (not applicable)
>> entry. In fact, previous TRILL state diagrams such as in RFC 7177
>> use
>> "N/A" so it would probably be simplest to change to that for
>> consistency.
>
> I think N/A would be good.

OK.

> ...
>
>>> Text in section 3.2.2.1 on lifetimes and the information
>>> maintenance in section 3.3 imply that the clients and servers must
>>> maintain a connection. Presumably, this is required already by the
>>> RBridge Channel protocol, and I understand that we should not
>>> repeat
>>> the entire protocol here.  It would seem to make readers life MUCH
>>> simpler if the text noted that the RBridge Channel protocol
>>> requires
>>> that there be a maintained connection between the client and the
>>> server, and that these mechanisms leverage the presence of that
>>> connection.
>>
>> The basic RBridge Channel protocol [RFC7178] is a datagram protocol
>> rather than a connection protocol. So there is no guaranteed
>> continuity of connection between RBridges that have previously
>> exchanged RBridge Channel messages. But connection would only be
>> lost
>> if the network partitions since RBridge Channel messages look like
>> data packets to any transit RBridges and will get forwarded as long
>> as
>> there is any route. Network partition is immediately visible in the
>> link state database to the RBridges at both ends of an RBridge
>> Channel
>> exchange.  Section 3.7 provides that if a Pull Directory is no
>> longer
>> reachable (i.e., RBridge Channel protocol packets would no longer
>> get
>> through), then all pull responses from that Pull Directory MUST be
>> discarded since cache consistency update messages can't get through.
>> Perhaps a reference to Section 3.7 should be added to Section 3.3.
>
> I don't think a reference to 3.7 is sufficient, although it is helpful.
> If the protocol is a datagram protocol, and if it is important to discard
> data from unreachable pull servers, then I think 3.7 NEEDS to say more than
> just ~if you happen to magically figure out you can't reach the server,
> discard data it has given you.~  From the rest of the text, this is an
> important and unspecified protocol mechanism.

Figuring out whether/how you can reach other RBridges is a basic
function of TRILL IS-IS based routing, not something "magical".
Whenever their is a topology change, an RBridge MUST determine routes
to all data reachable RBridges in the new topology. If there was an
RBridge previously reachable but no longer reachable, as would be the
case for all RBridges on the other side of a network partition, this
MUST be noticed so that, for example, all MAC reachability information
associated with each of the no longer reachable RBridges can be discarded.
It does not seem like much of a stretch to believe that an RBridge would
keep track of the Pull Directory or Directories it was using, each of
which will be some other RBridge, and notice when a topology change
makes any of them inaccessible. But I have no problem adding some
wording to make this clearer.

> ...
> In the flooding flag and behavior, (long text elided) I don't think there is
> anything wrong with the intended behavior.  It is just that the very brief
> description of the FL flag leads the reader to an incorrect expectation.
> Yes, it gets sorted out, but that is not good.  What I would suggest is when
> the flag is defined (with whatever name you choose) note that "for the
> qtypes 2,3,and 4, the flag indicates that the server should flood its
> response."

We can work  on clarifying the wording.

Thanks,
Donald
=============================
 Donald E. Eastlake 3rd   +1-508-333-2270 (cell)
 155 Beaver Street, Milford, MA 01757 USA
 d3e3e3@gmail.com

[trill] RtgDir review of draft-ietf-trill-directo… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Susan Hares
Re: [trill] RtgDir review of draft-ietf-trill-dir… Alia Atlas
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] RtgDir review of draft-ietf-trill-dir… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] RtgDir review of draft-ietf-trill-dir… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] RtgDir review of draft-ietf-trill-dir… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] RtgDir review of draft-ietf-trill-dir… Joel M. Halpern
Re: [trill] RtgDir review of draft-ietf-trill-dir… Donald Eastlake
Re: [trill] [RTG-DIR] RtgDir review of draft-ietf… Joel M. Halpern