Re: [nvo3] Role of ARP/RARP

I agree your aspect in general. But the nvo3 solution does not limit to NVE on hypervisor only. The solution needs to work over the deployed devices too.
Lucy

From: nvo3-bounces@ietf.org [mailto:nvo3-bounces@ietf.org] On Behalf Of Ivan Pepelnjak
Sent: Tuesday, July 24, 2012 11:09 AM
To: Lucy yong; david.black@emc.com
Cc: nvo3@ietf.org
Subject: Re: [nvo3] Role of ARP/RARP

Lucy,

I don’t care where this WG eventually decides the MAC-to-something encapsulation (or translation) is done, but it will have to be done somewhere.

I’m just saying that there are not that many ToR switches deployed in today’s data centers that could do the encapsulation into L3-something with existing hardware (Q-in-Q or MAC-in-MAC is more common, but out of NVO3 scope), so it might make more sense to solve the problem in hypervisor software.

Ivan

From: Lucy yong [mailto:lucy.yong@huawei.com]
Sent: Tuesday, July 24, 2012 5:51 PM
To: Ivan Pepelnjak; david.black@emc.com
Cc: nvo3@ietf.org
Subject: RE: [nvo3] Role of ARP/RARP

Ivan,

IMO: the charter text means that the virtual network isolation should not rely on the VLAN, but it does not mean that the VLANs can not be used for the virtual access of a virtual network. In fact, this is common mechanism for the virtual access.
Lucy

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org] On Behalf Of Ivan Pepelnjak
Sent: Tuesday, July 24, 2012 10:39 AM
To: david.black@emc.com<mailto:david.black@emc.com>
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Role of ARP/RARP

I totally agree with you (in principle), but the NVO3 charter says ...

“NVO3 will consider approaches to multi-tenancy that reside at the
network layer rather than using traditional isolation mechanisms
that rely on the underlying layer 2 technology (e.g., VLANs). “

Which, according to my understanding, means using something-in-IP or something-in-MPLS encapsulation, done either in hypervisor or ToR switches. I don’t think there are too many data centers out there (particularly in enterprise environment) where such an idea wouldn’t require a forklift ToR switch upgrade or a replacement of existing hypervisor vSwitch (Open vSwitch is a bit of an exception).

Now, if our focus should be viability of deployment in existing networks (and I totally agree with that idea), it might be best to solve the problem within the hypervisor (with a decent vSwitch) and use the physical hardware just to provide IP transport.

Did I take a wrong turn somewhere to reach this conclusion?

Thank you!
Ivan

From: david.black@emc.com<mailto:david.black@emc.com> [mailto:david.black@emc.com]
Sent: Tuesday, July 24, 2012 3:03 PM
To: ipepelnjak@gmail.com<mailto:ipepelnjak@gmail.com>
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: RE: [nvo3] Role of ARP/RARP

Ivan,

> Obviously we’ll have to re-architect the existing “squashed complexity sausage”
> ways of doing things, in which the networking industry tries to implement all
> sorts of kludges to accommodate simplistic VLAN/L2-only hypervisor switches.
> Evidently this approach led us nowhere (or we wouldn’t have this working group)
> and yesterday’s news told us VMware clearly recognized simplistic hypervisor
> switches are not the way to go. While re-architecting the existing mess, we
> just might admit that current bad practices are just that – bad.

That’s a fine “greenfield” or “blank sheet of paper” design approach, but I think
you’re ignoring the “running code” that uses ARP/RARP for data plane learning.
It would be good to have a means of coping with that “running code” while we
transition to this brave new world ...

For clarity, I don’t think this is an either/or situation, but rather one in which
both approaches have merits and both should be pursued.  Specifically, I’d expect
better results from an explicit associate message (e.g., it provides an explicit
check that the interface [VSI or vNIC] that is associating is the one that is
expected based on the pre-associate), and would agree that this should be the
primary design approach.  OTOH, the ability to accommodate existing ARP/RARP usage
as an implicit associate is likely to make what we design easier to incrementally
deploy into existing data centers, especially enterprise data centers, whose admins
tend to show the door to anyone who talks about complete (“forklift”) upgrades ;-).

For IPv6, ND can be used, but I’m not sure how much of that is deployed for this
data-plane learning purpose, and I wouldn’t advocate this unless we have a
significant deployed base of “running code” to deal with.

Also, at the risk of really attracting responses ;-), this simplification of
incremental deployment is a reason to consider ARP/RARP as an additional implicit
associate primitive for IP-only solutions (and see previous discussion about this
L2 approach working well on a single L2 link but less well as the L2 and L3
topology between End Device and NVE gets more complex).

Thanks,
--David

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org] On Behalf Of Ivan Pepelnjak
Sent: Tuesday, July 24, 2012 3:41 AM
To: Black, David; truman@versionsix.org<mailto:truman@versionsix.org>
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Role of ARP/RARP (was: Comments on draft-kompella-nvo3-server2nve)

*** Provided that *** we have a management channel that can send pre-associate message from the orchestration tool to the NVE, that same management channel can also send the associate message (including the MAC address of the VM). Don’t use two protocols for two steps of the same process, and don’t rely on protocols that (as Truman noted) work for IPv4 only (see also http://tools.ietf.org/html/draft-george-ipv6-support-01)

Obviously we’ll have to re-architect the existing “squashed complexity sausage” ways of doing things, in which the networking industry tries to implement all sorts of kludges to accommodate simplistic VLAN/L2-only hypervisor switches. Evidently this approach led us nowhere (or we wouldn’t have this working group) and yesterday’s news told us VMware clearly recognized simplistic hypervisor switches are not the way to go. While re-architecting the existing mess, we just might admit that current bad practices are just that – bad.

Best,
Ivan

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org] On Behalf Of david.black@emc.com<mailto:david.black@emc.com>
Sent: Monday, July 23, 2012 11:49 PM
To: truman@versionsix.org<mailto:truman@versionsix.org>
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: [nvo3] Role of ARP/RARP (was: Comments on draft-kompella-nvo3-server2nve)

**Provided that** the pre-associate operation that selects the virtual network
for the VM and sets up access it (e.g., VLAN between End Device and NVE),
is done via other (suitably secured) means, a gratuitous ARP or RARP on that
pre-configured access can be used to signal the VM’s arrival to use the network
that’s been preconfigured for it (i.e., the associate event).  As Linda noted,
there’s lots of running code that does this, but (as Truman comments) ARP/RARP
cannot be trusted to carry explicit parameters to select a virtual network
(that has to be done via some other means).

Thanks,
--David

From: Truman Boyes [mailto:truman@versionsix.org]
Sent: Monday, July 23, 2012 1:03 PM
To: Ivan Pepelnjak
Cc: Linda Dunbar; Xuxiaohu; Black, David; kireeti.kompella@gmail.com<mailto:kireeti.kompella@gmail.com>; nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

On Mon, Jul 23, 2012 at 12:22 PM, Ivan Pepelnjak <ipepelnjak@gmail.com<mailto:ipepelnjak@gmail.com>> wrote:
Just because millions of applications misuse a simplistic protocol in a way it was never designed to handle doesn’t make it a good idea. Not to mention the total lack of security.

@Xiaohu: how would you distinguish a gratuitous ARP send from the hypervisor to indicate a VM move from a gratuitous ARP sent by a VM with a misconfigured IP address or a malicious gratuitous ARP sent by an intruder (physical or virtual)? Unless you can totally control the VM attachment point (= hypervisor switch unless you’re using something like 802.1BR) you cannot trust ARP ... but then if you do control the hypervisor switch, you don’t need ARP.

Ivan

Completely agree. ARP is the wrong protocol for any form of trusted signaling. Additionally, IPv6-only networks would present their own set of challenges if we had a reliance on  IPv4 address resolution protocol.

Truman

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org>] On Behalf Of Linda Dunbar
Sent: Monday, July 23, 2012 5:59 PM
To: Xuxiaohu; david.black@emc.com<mailto:david.black@emc.com>; kireeti.kompella@gmail.com<mailto:kireeti.kompella@gmail.com>

Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Millions of applications being deployed already use ARP to signal their presence. The widely deployed vMotion makes VMs in new location to send ARP (RARP) to inform the network of their new location. It doesn’t hurt to utilize the available messages from applications.

My two cents.

Linda Dunbar

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org] On Behalf Of Xuxiaohu
Sent: Wednesday, July 18, 2012 9:48 PM
To: david.black@emc.com<mailto:david.black@emc.com>; kireeti.kompella@gmail.com<mailto:kireeti.kompella@gmail.com>
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Does that mean the ARP could also be considered as an option for signaling the VM attachment/detachment event? For example, a gratuitous ARP packet can be inferred as an attachment event by the NVE which receives such packet via the NVE-TES interface. Meanwhile, for those L2VPN  (e.g., VPLS)  or L3VPN overlay approaches which only allows one next-hop to be available  for a given MAC route or a host route in the forwarding table,  a gratuitous ARP packet received from a remote NVE could be inferred as a detachment event by the NVE to which the ARP sending VM was previously attached. Moreover, in case a gratuitous ARP packet triggers the NVE which received that packet via the NVE-TES interface to generate a MAC route or a host route for the ARP sending VM, and the NVE to which that VM was previously attached, upon receiving that route,  could also infer that route as a detachment event of that VM.

Best regards,
Xiaohu

<skipped>
I have a related consideration based on thinking about this further.  The network
SHOULD NOT rely on dissociate messages always being sent - a server crash at the
wrong point during a VM migration may cause a dissociate to be missed (e.g., the
VM made it to S’, but S crashed before sending the dissociate).  More importantly,
not relying on the dissociate messages (in particular, not having the inter-NVE
control protocol rely on them) helps if one wants to mix hypervisors that support
the attach/detach protocol with (exiting) ones that don’t.  For existing hypervisors,
under suitable restrictions and assuming some advance configuration, “associate”
can be inferred from a gratuitous ARP or RARP, but nothing is sent for dissociate.
The inference of “associate” won’t be possible if things have not been set up
to enable the gratuitous ARP or RARP.

Thanks,
--David

From: nvo3-bounces@ietf.org<mailto:nvo3-bounces@ietf.org> [mailto:nvo3-bounces@ietf.org]<mailto:[mailto:nvo3-bounces@ietf.org]> On Behalf Of Kireeti Kompella
Sent: Saturday, July 14, 2012 8:16 PM
To: Black, David
Cc: nvo3@ietf.org<mailto:nvo3@ietf.org>
Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Hi David,

Thanks for your detailed comments!  More inline.
On Fri, Jul 13, 2012 at 1:01 PM, <david.black@emc.com<mailto:david.black@emc.com>> wrote:
Authors (Kireeti, Yakov and Thomas),

This is a good draft - it looks like a good foundation to focus discussion
around what the server-to-NVE (attach/detach) protocol needs to do.  I
like a lot of the contents - I have a few high level comments and some
more detailed feedback.

Thanks!

(1) This draft starts out dealing with the attach/detach (server-to-NVE)
protocol and then includes some material on the control protocol for
distributing and managing mapping information on the NVEs.  I suggest
focusing the draft on the attach/detach protocol, removing control
protocol discussion (e.g., Section 3), and minimizing assumptions about
the control protocol (see detailed comments for where I think assumptions
could be minimized).  The result should be more general and more useful.

About the control plane: it really concerns me that the control plane discussion has not happened so far (not really).  ARP doesn't scale; neither does flooding.  The goal here is to signal networking parameters: from server (vswitch) to local NVE to remote NVEs to remote servers.  Fine, call the local NVE to remote NVEs part "control plane" -- but that's a critical part of the picture.

What I take from your suggestion is to move the lNVE to rNVE part to a different draft; I buy that, especially if there are other mechanisms for doing this that can plug-in to this server2nve signaling, so that one can mix-and-match server2nve signaling and lNVE2rNVE signaling.  Does that seem reasonable?

(2) Section 2.2.3 on detach is trying to cover at least a couple of use
cases,  VM live migration, and VM removal (e.g., power-off) that probably
want to be separated. The current text really doesn't get the live migration
case right, D.4 comes before D.3 for the power-off case, and I think things
get more complex when the live migration detach functionality is corrected.

More on this below.

(3) Section 2.2.4 appears to assume a specific order of events between
the two servers involved in VM migration.  As those servers are operating
concurrently, that's not a robust assumption, and the NVE functionality
should be specified to not depend on the order of events.

Ordering assumptions weren't intended, so we'll tweak the wording to remove any such implications.

--- Detailed comments by section ---

A pre-disassociate operation is defined in section 2.2.1 but not used
in the rest of the draft.  Is it actually needed?

Good catch!  I'd put that in early, worked out the rest of the details, couldn't figure out a use for it, but forgot to remove it.  I'll remove it.

-- Section 2.2.2

   A.1:  Validate the authentication (if present).  If not, inform the
         provisioning system, log the error, and stop processing the
         associate message.

This step should also include an optional authorization check, as network
policy may limit which NVEs are allowed to participate in which VNs.

Okay.  Authorization locally, or from the provisioning system?  (Or either?)

   A.3:  If the VID in the associate message is non-zero, look up <VNID,
         P>.  If the result is zero, or equal to VID, all's well.
         Otherwise, respond to S with an error, and stop processing the
         associate message.

Why is a zero VID lookup result ok for a non-zero VID in the associate
message?

Just means no mapping yet.  With respect to the refcounting suggested below, good place to set it to 1; otherwise increment.

Should the NVE copy the VID from the associate message to the
<VNID,P> entry before responding?

Good point.  Will fix.

   A.5:  Communicate with each rNVE device to advertise the VM's
         addresses, and also to get the addresses of other VMs in the
         DCVPN.  Populate the table with the VM's addresses and
         addresses learned from each rNVE.

This assumes that the control protocol does active propagation of all
address info, and assumes that no other addresses for the VN are present
in the NVE.  Neither of those are good general assumptions, IMHO, and
in particular, lazy evaluation is possible (e.g., load address mappings
on demand to reduce the amount of invalidation traffic caused by
each mapping change).

I'm leery of on-demand/cache-based address mappings and lazy evaluation (love it in general, but not for address mappings).  However, you're right: there may be cases where this is a valid approach.

I'd suggest rephrasing to something like:

   A.5:  Use the overlay control protocol to inform the network of the
         VM's addresses and the VM's association with this NVE.

Something like that.  Will work on text.

-- Section 2.2.3

   D.1:  Validate the authentication (if present).  If not, inform the
         provisioning system, log the error, and stop processing the
         associate message.

Like A.1, this should include an optional authorization check, as some
<VNID,P> -> VID mappings may be statically configured and hence not
permit removal.

Okay, will copy wording from there once we've agreed on it.

   D.2:  If the hold time is non-zero, point the VM's addresses in the
         VNID table to the new location of the VM, if known, or to
         "discard", and start a timer for the period of the hold time.
         If hold time is zero, immediately perform step D.4, then go to
         D.3.

This is where the power-off and migration cases start to interact -
Hold time would be zero for power-off, non-zero for detach.  For migration,
this change potentially races with a change to the VM's addresses received
via the control protocol, so the VM's address may already point somewhere
else if the control protocol did its update before the dissociate (in
which case nothing should be done to those addresses).

Definitely worth looking at again, especially with respect to your comments about the order for migration.

With regard to the race condition, I'll send a separate email on that.

   D.3:  Set the VID for <VNID, P> as unassigned.  Respond to S saying
         that the operation was successful.

If there are multiple VMs using the VNID on that port, this
"pulls the rug" out from under the others by disabling their forwarding.
This <VNID,P> -> VID mapping needs a reference count of some form, and
corresponding changes would be needed to A.2 and A.3.  Not using a
reference count may be ok under the assumption that the NVE does not
share ports among VMs (or VSIs/vNICs), but that may not be a good
assumption for an external NVE (e.g., in a ToR switch).

Good point!  I'll go with refcounting.

   D.4:  When the hold timer expires, delete the VM's addresses from the
         VNID table.  Delete any VM-specific network policies associated
         with any of the VM addresses.  If the VNID table is empty after
         deleting the VM's addresses, optionally delete the table and
         any network policies for the VNID.

Well, that's the right thing to do in the power-off case, but not
when the VM has moved and there are other VMs on this NVE (possibly even
the same port) that still need to communicate with the moved VM.  Also,
the power-off case needs to include (at least optionally) informing the
control protocol of the withdrawal of the VM's addresses.

See separate email.

As noted in (2) above, I think it would be clearer if there were separate
versions of 2.2.3 for the migration departure and power-down use cases.

Perhaps.  Let's get the semantics right first, then see if there are common elements or not.

-- Section 2.2.4

   M.3:  S then gets a request to terminate the VM on S.

   M.4:  Finally, S' gets a request to start up the VM on S'.

Not exactly ;-).

Terminating the VM on S (and destroying its state) before confirming
its startup on S' risks losing the VM entirely if something goes wrong
on S'.

Interesting point.  However, if the VM starts on S' without first being stopped on S, then (for some time) both S and S' are running, and I'd think that the results would be unpredictable, especially if the VM is just about to engage in some I/O.  However, I'll bow to those who've implemented VM migration and know what they're doing.  Perhaps the VM is paused on S, started on S'; if that's successful, the VM is destroyed on S, otherwise the migration is aborted and the VM is continued on S.  I'd like to know, as this affects the "tentative address changes" you talk about below, and dealing with migration abort.

This level of detail isn't necessary - from the point of view
of the network:
- Startup on S' generates an associate request to the NVE for S'.
- The dissociate request from S to its NVE may occur before or after
        that S' associate request
- The dissociate request from S to its NVE may occur before or after
        control protocol propagation of the results of the S' associate
        request to the NVE for S.
The server-to-NVE functionality should be specified to operate properly
independent of the order of these events.

Agreed.  Separate email thread to work this out.

   PA.5:  Communicate with each rNVE device to advertise the VM's
      addresses but as non-preferred destinations(*).  Also get the
      addresses of other VMs in the DCVPN.  Populate the table with the
      VM's addresses and addresses learned from each rNVE.

That assumes aggressive push of the new address information by the
control protocol directly to the rNVEs - while a control protocol
may choose to do that, it's not strictly necessary and the interaction
may not be directly between the lNVE and the rNVEs.  Generalizing in
a fashion similar to A.5, I'd suggest something like:

   PA.5:  The overlay control protocol may be used to inform the
      network of the forthcoming change to the VM's addresses
      that will occur when the VM is associated with this NVE.

Okay, something like.

If this is done, withdrawal of the tentative address changes
needs to be discussed, as VM migrations can abort for a variety
of reasons (e.g., S' may crash during the copy).  This PA.5
step can be skipped for a control protocol only does on-demand
provisioning of the address mapping information.

Interesting thought.  Will follow up once we get the migration "right" (for some value of right).

-- Section 3

This appears to be entirely about the control protocol and (IMHO)
doesn't fit well with the rest of the draft.

Will discuss putting this in a separate draft with co-authors.

Thanks again for the detailed comments!
Kireeti.

Thanks,
--David
----------------------------------------------------
David L. Black, Distinguished Engineer
EMC Corporation, 176 South St., Hopkinton, MA  01748
+1 (508) 293-7953<tel:%2B1%20%28508%29%20293-7953>             FAX: +1 (508) 293-7786<tel:%2B1%20%28508%29%20293-7786>
david.black@emc.com<mailto:david.black@emc.com>        Mobile: +1 (978) 394-7754<tel:%2B1%20%28978%29%20394-7754>
----------------------------------------------------

_______________________________________________
nvo3 mailing list
nvo3@ietf.org<mailto:nvo3@ietf.org>
https://www.ietf.org/mailman/listinfo/nvo3

--
Kireeti

_______________________________________________
nvo3 mailing list
nvo3@ietf.org<mailto:nvo3@ietf.org>
https://www.ietf.org/mailman/listinfo/nvo3