Re: [DNSOP] Spencer Dawkins' Discuss on draft-ietf-dnsop-session-signal-12: (with DISCUSS and COMMENT)

Ted Lemon <mellon@fugue.com> Tue, 31 July 2018 18:41 UTC

From: Ted Lemon <mellon@fugue.com>
Message-Id: <F1CE7C6A-DE44-45F4-8CF4-E49766CFCEA5@fugue.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_9E4BB62A-FA62-476D-9B6A-A52322473277"
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.17\))
Date: Tue, 31 Jul 2018 14:41:25 -0400
In-Reply-To: <153266600019.24802.9316144897968330271.idtracker@ietfa.amsl.com>
Cc: The IESG <iesg@ietf.org>, draft-ietf-dnsop-session-signal@ietf.org, Tim Wicinski <tjw.ietf@gmail.com>, dnsop-chairs@ietf.org, dnsop@ietf.org
To: Spencer Dawkins <spencerdawkins.ietf@gmail.com>
References: <153266600019.24802.9316144897968330271.idtracker@ietfa.amsl.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/VLC8gikDvrF3lf3N1ILBcWRpOqU>
Subject: Re: [DNSOP] Spencer Dawkins' Discuss on draft-ietf-dnsop-session-signal-12: (with DISCUSS and COMMENT)
Precedence: list

On Jul 27, 2018, at 12:33 AM, Spencer Dawkins <spencerdawkins.ietf@gmail.com> wrote:
> I really like this document, and think it's headed the right direction. Of
> course I have four pages of comments, because reasons, but the only part I'm
> really confused about is this one ...
> 
> I would have thought that if you end up with a different endpoint because your
> anycast address now resolves differently, the new endpoint would have to have
> shared a lot of state with the previous endpoint, for this to work:
> 
>  When an anycast service is configured on a particular IP address and
>   port, it must be the case that although there is more than one
>   physical server responding on that IP address, each such server can
>   be treated as equivalent.  If a change in network topology causes
>   packets in a particular TCP connection to be sent to an anycast
>   server instance that does not know about the connection, the normal
>   keepalive and TCP connection timeout process will allow for recovery.
> 
> What I would have expected to happen, is that the new endpoint sees a packet
> arrive that's not on a synchronized TCP connection, and immediately responds
> with a RST (reset), rather than the normal keepalive and TCP connection timeout
> process happening. That's also the way I'm reading
> https://tools.ietf.org/html/rfc7828#section-3.6. Is that not the way it's
> working for anycast these days?

I believe this is correct—thanks for pointing it out.   I think the fix is straightforward:

  When an anycast service is configured on a particular IP address and
  port, it must be the case that although there is more than one
  physical server responding on that IP address, each such server can
  be treated as equivalent.  If a change in network topology causes
  packets in a particular TCP connection to be sent to an anycast
  server instance that does not know about the connection, the new
  server will automatically terminate the connection with a TCP reset,
  since it will have no record of the connection, and then the client can
  reconnect or stop using the connection, as appropriate.

Does that sound good?

> This is a nit, and your answer could be "no", and that's fine, but in some
> places this document uses "DSO keepalive", and in other places, "keepalive"
> with no qualifier. It's likely that less confusion would result if you could
> consistently call this "DSO keepalive", so that it is clearly NOT a TCP
> keepalive. Do the right thing, of course.

Of course.   What I wound up doing is clarifying every occurrence of "keepalive traffic" to "DSO keepalive traffic" to avoid confusion with TCP keepalives, and similarly with "DSO Keepalive message".   I left keepalive alone in cases where the context was obvious (e.g., maintaining keepalive timers).   I changed all occurrences of "DSO Keepalive TLV" to just "Keepalive TLV" because this is obviously not referring to the TCP keepalive option.   There was one case where "keepalive traffic" occurred five or six times in the same paragraph; in that paragraph I added "DSO" to the first instance, and left the rest alone, since repeating it over and over again would be unnecessarily clunky.

> Is the expectation that DSO would also be used in DNS over HTTP? I'm reading
> 
>  At the time of publication, DSO is specified only for DNS over TCP
>   [RFC1035] [RFC7766], and for DNS over TLS over TCP [RFC7858].  Any
>   use of DSO over some other connection technology needs to be
>   specified in an appropriate future document.
> 
> and noticing that https://tools.ietf.org/html/draft-ietf-doh-dns-over-https-12
> is currently in IETF Last Call.

No, the tranport there is HTTP, not DNS-over-TCP, so I don't see how that could work.

> This next one is well within the "Spencer wouldn't have done it this way, but
> Spencer's not the working group, or the IETF" range, but
> 
>  However, in the typical case a server will not know in advance
>   whether a client supports DSO, so in general, unless it is known in
>   advance by other means that a client does support DSO, a server MUST
>   NOT initiate DSO request messages or DSO unacknowledged messages
>   until a DSO Session has been mutually established by at least one
>   successful DSO request/response exchange initiated by the client, as
>   described below.  Similarly, unless it is known in advance by other
>   means that a server does support DSO, a client MUST NOT initiate DSO
>   unacknowledged messages until after a DSO Session has been mutually
>   established.
> 
> seems fragile, especially in environments where clients can come and go, and
> servers may be addressed using anycast (so I knew in advance that the four
> servers at that anycast address supported DSO, but somebody installed a fifth
> server that does not). Is that unlikely to be a problem?

The idea here is that it would be very unusual for there to be a situation where it makes sense to violate the MUST NOT, but we can imagine such situations (e.g., the DNSSD Discovery Relay), so we shouldn't absolutely forbid them.   In this case, DSO is being used between two servers that are manually configured.   TBH, however, I would be perfectly okay with taking this out—we don't actually use this in the Discovery Relay, and I'm hard pressed to think of a situation where it makes sense.

> I'm sure
> 
>  A single server may support multiple services, including DNS Updates
>   [RFC2136], DNS Push Notifications [I-D.ietf-dnssd-push], and other
>   services, for one or more DNS zones.  When a client discovers that
>   the target server for several different operations is the same target
>   hostname and port, the client SHOULD use a single shared DSO Session
>   for all those operations.  A client SHOULD NOT open multiple
>   connections to the same target host and port just because the names
>   being operated on are different or happen to fall within different
>   zones.  This requirement is to reduce unnecessary connection load on
>   the DNS server.
> 
> is correct from the server side, but perhaps it's also worth noting that using
> multiple TCP connections unnecessarily increases the chances that data
> transfers happen during TCP slow start. If only one or two packets are being
> exchanged, that doesn't matter, but as more packets are exchanged, the
> difference increases, because congestion windows will grow more rapidly if
> fewer connections are used.

I've made this change:

-This requirement is to reduce unnecessary connection load on the DNS server.
+This requirement has two benefits.
+First, it reduces unnecessary connection load on the DNS server.
+Second, it avoids paying the TCP slow start penalty when making subsequent
+connections to the same server.


> I appreciate the inclusion of 5.4.  DSO Response Generation
> 
> But I've gotta ask. In the last paragraph of that section, I see
> 
>   o  Use a networking API that lets the receiver signal to the TCP
>      implementation that the receiver has received and processed a
>      client request for which it will not be generating any immediate
>      response.  This allows the TCP implementation to operate
>      efficiently in both cases; for requests that generate a response,
>      the TCP ACK, window update, and DSO response are transmitted
>      together in a single TCP segment, and for requests that do not
>      generate a response, the application-layer software informs the
>      TCP implementation that it should go ahead and send the TCP ACK
>      and window update immediately, without waiting for the Delayed ACK
>      timer.  Unfortunately it is not known at this time which (if any)
>      of the widely-available networking APIs currently include this
>      capability.
> 
> I would love to know if there are any widely-available network APIs that
> include this capability, before including this text in a standards-track RFC.
> Do you need help chasing this down?

:)

I'm going to leave this one for Mirja's DISCUSS.

> 
> The text in 6.1.  DSO Session Initiation seems rough to me, for a couple of
> reasons.
> 
>   The client may perform as many DNS operations as it wishes using the
>   newly created DSO Session.  Operations SHOULD be pipelined (i.e., the
> 
> I don't understand why this would be a SHOULD. At least from the client's
> perspective, it's not needed for interoperation.
> 
>   client doesn't need wait for a response before sending the next
>   message).  The server MUST act on messages in the order they are
>   transmitted, but responses to those messages SHOULD be sent out of
>   order when appropriate.
> 
> Is it correct to say that "responses to those messages SHOULD be sent when they
> become available, even if the responses are sent out of order"? If not, I'm
> probably missing what "when appropriate" means.

Good points.   I've made the following change:

 The client may perform as many DNS operations as it wishes using the
-newly created DSO Session. Operations SHOULD be pipelined (i.e., the
-client doesn't need wait for a response before sending the next message).
+newly created DSO Session. When the
+client has multiple messages to send, it SHOULD NOT wait for each response before sending the next message.
+This prevents TCP's delayed acknowledgement algorithm from forcing the
+client into a slow lock-step.
 The server MUST act on messages in the order they are transmitted, but
-responses to those messages SHOULD be sent out of order when appropriate.
+when responses to those messages become available out of order, the server
+SHOULD NOT delay sending available responses in order to respond in order.

> I'm a bit mystified by this text in 6.2.  DSO Session Timeouts
> 
>  In the usual case where the inactivity timeout is shorter than the
>   keepalive interval, it is only when a client has a very long-lived,
>   low-traffic, operation that the keepalive interval comes into play,
>   to ensure that a sufficient residual amount of traffic is generated
>   to maintain NAT and firewall state and to assure client and server
>   that they still have connectivity to each other.
> 
> I think the basics are correct - the inactivity timer and (DSO) keepalive
> interval are independent - but I'm struggling to think of a reason to send
> (DSO) keepalives that's NOT tied to maintaining NAT/firewall state, and there's
> a lot of text before the paragraph that mentions NAT/firewall, that talks about
> why either interval might be longer or shorter than the other, without
> considering NAT/firewall. Am I missing something here?
> 
> ... and, now that I keep reading, 6.5.2.  Values for the Keepalive Interval
> does a much better job of explaining how a (DSO) keepalive interval should be
> selected - I think you could reasonably delete most of the text about (DSO)
> keepalive intervals in section 6.2, and at most provide a forward pointer to
> 6.5.2.

I am very sympathetic to the goal of reducing the amount of text in the document, and we certainly could make this change; indeed, I just went and did it.   However, having done it, I looked a little closer, and I think the text in 6.2 actually says useful things that the text in 6.5.2 doesn't say—I don't think these sections are mutually redundant.   So I wound up not making this change after all.

> (As an aside, I think you probably want to cite
> https://tools.ietf.org/html/bcp142 as the operative recommendation for NAT
> behaviour toward TCP, since https://tools.ietf.org/html/rfc5382 has been
> updated)

Done.   Thanks for catching this.

> I found this text
> 
>  For long-lived DNS Stateful operations (such as a Push Notification
>   subscription [I-D.ietf-dnssd-push] or a Discovery Relay interface
>   subscription [I-D.ietf-dnssd-mdns-relay]), an operation is considered
>   in progress for as long as the operation is active, until it is
>   cancelled.  This means that a DSO Session can exist, with active
>   operations, with no messages flowing in either direction, for far
>   longer than the inactivity timeout, and this is not an error.  This
>   is why there are two separate timers: the inactivity timeout, and the
>   keepalive interval.  Just because a DSO Session has no traffic for an
>   extended period of time does not automatically make that DSO Session
>   "inactive", if it has an active operation that is awaiting events.
> 
> to be extremely helpful, but it's 28 pages into the document. Is there a place
> earlier in the document that describes these timers, where you could place this
> text? Maybe section 3/Terminology isn't the right place, but maybe there is a
> right place toward the front of the document.

I sort of agree, but I'm not sure how to address this.   I added the following in section 6.2:

 The first timeout value, the inactivity timeout, is the maximum time for which
-a client may speculatively keep a DSO Session open in the expectation that
+a client may speculatively keep a DSO Session open with no operations pending
+(e.g., an outstanding DNS Push request)
+in the expectation that
 it may have future requests to send to that server.

This is the first mention of the inactivity timeout, but it's only one page earlier.   Assuming that you still have any remembered state about this, can you talk about where you were confused earlier in the document?

> I'm not understanding why the SHOULDs are not MUSTs in this text:
> 
>  If, at any time during the life of the DSO Session, twice the
>   inactivity timeout value (i.e., 30 seconds by default), or five
>   seconds, if twice the inactivity timeout value is less than five
>   seconds, elapses without there being any operation active on the DSO
>   Session, the server SHOULD consider the client delinquent, and SHOULD
>   forcibly abort the DSO Session.
> 
> Perhaps part of my confusion is that I'm not sure what it means to "consider
> the client delinquent", but NOT to "forcibly abort the DSO session". But there
> are several "will forcibly abort"s in section 6.4.2, that sound more like MUST
> than SHOULD.

I agree, and have made this change.  (of course, others may disagree and revert it)

> I don't think the MUST NOT in
> 
>  Normally a server MUST NOT close a DSO Session with a client.  A
>   server only causes a DSO Session to be ended in the exceptional
>   circumstances outlined below.
> 
> is quite right. Given that you have a bulleted list of reasons why a server
> would violate the MUST not immediately following this sentence, you might want
> to say "Normally a server does not close" here.

Yup.    Actually, that text is completely unnecessary.   I've tweaked the paragraph as follows:

-Normally a server MUST NOT close a DSO Session with a client.
-A server only causes a DSO Session to be ended in the exceptional circumstances outlined below.
 In normal operation, closing a DSO Session is the client's responsibility.
 The client makes the determination of when to close a DSO
 Session based on an evaluation of both its own needs,
 and the inactivity timeout value dictated by the server.
+A server only causes a DSO Session to be ended in the exceptional circumstances outlined below.

Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Ted Lemon
[DNSOP] Spencer Dawkins' Discuss on draft-ietf-dn… Spencer Dawkins
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Benjamin Kaduk
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Spencer Dawkins at IETF
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Mirja Kuehlewind (IETF)
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Ted Lemon
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Spencer Dawkins at IETF
Re: [DNSOP] Spencer Dawkins' Discuss on draft-iet… Ted Lemon