Re: [dhcwg] draft-ietf-dhc-dhcpv6-bulk-leasequery in IESG review

>My recommendation would be to just keep the two usages separate and
>avoid saying things that make people think that if TCP fails, you need
>to failover to using UDP. You don't. (An earlier message suggested
>that.)

I think the earlier message was just trying to convey this. I don't
believe the I-D actually says any such thing (I took a quick look and
didn't see anything like this, but perhaps I missed it).

>Well, I would argue that you might well want to think about rate
>limiting UDP lease queries generally. Consider a DOS attack using

Rate limiting is always a good idea (for UDP). This would be an issue
for RFC 5007 (and 4388 for DHCPv4).

In the RFC 5007 in the security considerations we have:

   Since even trusted access concentrators may generate LEASEQUERY
   requests as a result of activity external to the access concentrator,
   access concentrators SHOULD minimize potential denial-of-service
   attacks on the DHCPv6 servers by minimizing the generation of
   LEASEQUERY messages.  In particular, the access concentrator SHOULD
   employ negative caching (i.e., cache the fact that a particular
   recent query failed to return client data) and address restrictions
   where possible (i.e., don't send a LEASEQUERY message for addresses
   outside the range of the attached broadband access networks).
   Together, these mechanisms limit the access concentrator to
   transmitting one LEASEQUERY message (excluding message retries) per
   legitimate broadband access network address after a reboot event.

   Packet-flooding denial-of-service attacks can result in the
   exhaustion of processing resources, thus preventing the server from
   serving legitimate and regular DHCPv6 clients as well as legitimate
   DHCPv6 LEASEQUERY requestors, denying configurations to legitimate
   DHCPv6 clients as well lease information to legitimate DHCPv6
   LEASEQUERY requestors.  While these attacks are unlikely when only
   communicating with trusted LEASEQUERY requestors, the possibility
   always exists that the trust is misplaced, security techniques are
   compromised, or even trusted requestors can have bugs in them.
   Therefore, techniques for defending against packet-flooding denial of
   service are always a good idea, and they include good perimeter
   security, as mentioned earlier, and rate limiting DHCPv6 traffic by
   relay agents, other network elements, or the server itself.

So, I think this issue is pretty well covered there?

>So one thing I don't quite get. You said earlier that in some cases,
>UDP is insufficient because traffic won't arrive at the CMTS to
>generate the query. Hence, Bulk Query is needed. But here you say,
>Bulk Query isn't deployed yet. Does that mean that current deployed
>technology/standards has real holes (i.e,. lead to permanent
>failures)?

We've worked on this issue with the RAAN proposal
(draft-droms-dhc-dhcpv6-agentopt-delegate which has expired). This is
really only an issue with Prefix Delegation (RFC 3633), and when
non-aggregated prefixes are delegated. This probably hasn't been common
yet since it is a missing piece to recover these leases. UDP LQ (RFC
5007) can be used to recover the prefix delegations, but only if traffic
is received that has an address contained by a delegated prefix. But, if
the prefix needs to be advertised before traffic is received ... There's
a catch-22. Hence, one of the key reasons Bulk LQ is required. An
alternative is to use some routing protocol between the requesting and
delegating router.

I think there really are two key issues for Bulk LQ with respect to the
TCP connections:

1. Requestor (relay agent) - if a request was initiated but responses
are not furthcoming in a "timely" fashion.

There are three primary causes:
A) The server is unable to communicate -- it is down.
B) The network prevents communication.
C) The server is "slow" (busy) and not responding in a timely fashion.

A is clearly the one that is important to detect. But, it is often
difficult to distinguish between the three.

Clearly a timeout is appropriate here as TCP could allow a long period
of time (hours) to go by before reporting a problem. A user would
generally give up (of course, a user would likely use recent experience
to determine whether the wait is appropriate or not).

2. Server - if a connection is open but no outstanding request exists on
it.

After some period of time, a server is again within its right to close
the connection to "conserve" resources.

I believe the main concern is not about disallowing the above actions,
but what is a reasonable time period to allow TCP "to do its thing" and
avoid re-doing work by reopening a connection and repeating a request
(perhaps with the same result). In particular the following parameters
are the issue:

   Parameter             Default  Description
   ------------------------------------------
   BULK_LQ_CONN_TIMEOUT  30 secs  Bulk Leasequery connection timeout
   BULK_LQ_DATA_TIMEOUT  30 secs  Bulk Leasequery data timeout

I suspect that removing BULK_LQ_CONN_TIMEOUT would be fine as TCP
already handles this (usually with a 90 second timeout). The
BULK_LQ_MAX_RETRY parameter (60 seconds) I don't think should be an
issue -- waiting at most a minute before retrying to connects seems
reasonable, as recovering the data quickly is a goal and thus waiting
too long prevents that.

For BULK_LQ_DATA_TIMEOUT, 30 seconds is likely too short. Making this
several minutes would be far better. Perhaps a value of 5 minutes would
be a reasonable default, with recommendations not to make it too short
as it prevents TCP from doing its thing. While larger values are also
possible, I think if it takes 5 minutes to get data through, something
is very wrong and there are bigger issues (retrying the connection with
the same result is not going to cost a lot every 5 minutes). (If a user
where initiating this, I suspect they would long have given up -- do you
wait 5 minutes to a web page to load??)

- Bernie

-----Original Message-----
From: Thomas Narten [mailto:narten@us.ibm.com] 
Sent: Wednesday, December 17, 2008 5:05 PM
To: Bernie Volz (volz)
Cc: Dhcwg; Mark Stapp (mjs); Lars Eggert; IESG
Subject: Re: [dhcwg] draft-ietf-dhc-dhcpv6-bulk-leasequery in IESG
review

Bernie, thanks again for the long explanation. I have no issue with
the scenario you describe.

> The TCP connection (bulk leasequery) is designed to recover all of the
> leases quickly.

> The UDP leasequery is used when a packet arrives that a router
> (typically CMTS which is both a router and relay agent) has no
> information for (ie, it doesn't know which client behind which CM is
> using that address). In this case, it asks the DHCP server (via a UDP
> leasequery) about that individual lease (query by address). This
> happens, on demand, when traffic arrives. This is the ONLY recovery
> technique that exists today. The rate at which these requests are
> generated is limited by traffic arrival (and hopefully rate limiting).
> Once information is known (either via TCP bulk LQ or UDP LQ), no
further
> queries are needed.

So, the usage of TCP vs. UDP is completely unrelated. That is, you
don't start with TCP and then if that fails, try UDP or anything like
that.

Rather, on system restart, you use TCP, because you are trying to
restore lost state. If, while you are waiting, you happen to get a
packet you don't know what to do with, you issue a UDP query just for
the one address. That seems fine, since this is what you do normally,
independent of whether any TCP query is happening in background.  An
important point here is that there is no need to tie the UDP & TCP
behavior together - they are logicially independent, invoked at 
mostly unrelated times.

My recommendation would be to just keep the two usages separate and
avoid saying things that make people think that if TCP fails, you need
to failover to using UDP. You don't. (An earlier message suggested
that.)

But, that does again raise the question of what do you do if TCP
"fails"? You presumably do either nothing (in which case terminating
the TCP connection early seems pointless), or you restart the TCP
query (because you still have to restore all that missing state, and
UDP queries can't do that... you only use UDP for specific, narrow
queries). Again, IMO, you should only restart the TCP connection if
you have good reason to think that retrying would help. It would go
against long-standing IETF practice to try to restart more
aggressively than TCP already does. IMO, you need a compelling
argument to go down this path. So I just don't think you need to
terminate a TCP connection early. I don't see how that helps the relay
agent at all.

I would just describe the two usages separately and not tie them
together.

> The general model is that likely BOTH TCP (for bulk and PD) and UDP
(on
> demand when no information exists) leasequery would be used. While
bulk
> (TCP) is loading the configuration, traffic may arrive and thus
generate
> a UDP LQ as well. TCP is only used to seed the information initially
--
> after successful bulk LQ, future information is learn by monitoring
> DHCPv6 traffic.

Works for me.

> The UDP LQ are extremely efficient -- after all, the DHCP server is
all
> about address management and looking up an address is very easy and
> quick for it.

> If the server is down, what difference does it make if TCP or UDP are
> being used? Sure, traffic is being generated, but it isn't going to be
> answered.

Well, I would argue that you might well want to think about rate
limiting UDP lease queries generally. Consider a DOS attack using
addresses in a packet that the CMTS gets. Do you really want the CMTS
pounding on the server for *each* of those packets? Or, consider when
it gets a train of legitimate TCP packets from a (pre-existing)
bittorrent connection.. Do you want each packet to cause a UDP query
to the server? I would think not... I.e., the same issue happens with
ARP/ND, and they are typically rate limited to prevent storms in this
scenario...

> In which case, the relay will not have information and be
> forced to drop packets on the floor. If the relay only relied on TCP,
it
> would be forced to wait until potentially ALL data was downloaded (it
> may be the last lease that is the one needed "now"). And, remember
that
> bulk LQ does not yet exist -- so UDP is being used today.

So one thing I don't quite get. You said earlier that in some cases,
UDP is insufficient because traffic won't arrive at the CMTS to
generate the query. Hence, Bulk Query is needed. But here you say,
Bulk Query isn't deployed yet. Does that mean that current deployed
technology/standards has real holes (i.e,. lead to permanent
failures)?

> Also, remember that this is all between components (relay
agents/routers
> and DHCP servers) and not thousands of users on the Internet. Traffic
> provision in an operators' network needs to accommodate this.

I hear you, but that argument has the problem that no technology
controls where it gets used in practice, so its not good for our
protocols to rely on this for proper behavior.

Thomas
_______________________________________________
dhcwg mailing list
dhcwg@ietf.org
https://www.ietf.org/mailman/listinfo/dhcwg