Re: [nvo3] Review of draft-ietf-nvo3-framework-02

Thank you for such a thorough review, Thomas. Speaking personally, I 
find myself in agreement with most of your comments.

But as WG co-chair, I'd really like to hear from the authors and other 
NVO3 contributors. Folks, please take time to read and comment on 
Thomas' review below (or at 
http://www.ietf.org/mail-archive/web/nvo3/current/msg02072.html).

Cheers,
-Benson

On 2/15/13 5:33 PM, Thomas Narten wrote:
> Below is a detailed review of the framework document.  Pretty much all
> of these are editorial -- I don't think I really have any issue with
> the substance. But there are a lot of suggestions for clarifying text
> and tightening up the terminology, language, etc.
>
> Thomas
>
> High level: what happened to adding text describing the "oracle"
> model, where there was a clear separation of the NVE, the oracle (and
> the notion of a federated oracle). as well as separate protocosl for
> the inter/intra-oracle control vs. server-to-nve control? Per the
> September interim meeting, this needs to be added.
>
>         [OVCPREQ] describes the requirements for a control plane protocol
>         required by overlay border nodes to exchange overlay mappings.
>
> The above document has been split into 2 documents. I think both
> should be referenced. Also, I think it would be good to not combine
> the two problem areas in the 2 drafts into one generic "control plane
> protocol" reference. They are very different problems with no overlap,
> and I think when folk talk about a "control plane", they are really
> referring to problem in draft-kreeger-nvo3-hypervisor-nve-cp-00.txt.
>
>>      1.2. General terminology
> Nit: On terminology, how about using consistent syntax/format of terms
> with abbreviation in parenthesis following the term itself, e.g.,
> something like
>
> OLD:
>
>         NVE: Network Virtualization Edge.
>
> NEW:
>        Network Virtualization Edge (NVE): ...
>
> Add the following (other documents use this term and we should define
> it here for use throughout NVO3):
>
>     Closed User Group (CUG): Another term for Virtual Network.
>
>
>>         NVE: Network Virtualization Edge. It is a network entity that sits
>>         on the edge of the NVO3 network. It implements network
>>         virtualization functions that allow for L2 and/or L3 tenant
>>         separation and for hiding tenant addressing information (MAC and IP
>>         addresses). An NVE could be implemented as part of a virtual switch
>>         within a hypervisor, a physical switch or router, or a network
>>         service appliance.
> Could be improved. How about:
>
>         NVO3 Network: on overlay network that provides an L2 or L3
>         service to Tenant Systems over an L3 underlay network, using
>         the architecture and protocols as defined by the NVO3 Working
>         Group.
>
>         Network Virtualization Edge (NVE). An NVE is the network entity
>         that implements network virtualization functions and sits on
>         the boundary between an NVO3 network and an underlying network.
>         The network-facing side of the NVE uses the underlying L3
>         network to tunnel frames to and from other NVEs. The
>         server-facing side of the NVE sends and receives Ethernet
>         Frames to and from individual Tenant Systems.  An NVE could be
>         implemented as part of a virtual switch within a hypervisor, a
>         physical switch or router, a Network Service Appliance, or be
>         split across multiple devices.
>
>
>>         VN: Virtual Network. This is a virtual L2 or L3 domain that belongs
>>         to a tenant.
> Better
>
>         Virtual Network (VN). A virtual network is a logical
>         abstraction of a physical network that provides network
>         services to a set of Tenant Systems.  To Tenant Systems, a
>         virtual network looks like a normal network (i.e., providing
>         unrestricted ethernet or L3 service), except that the only end
>         stations connected to the virtual network are those belonging
>         to a tenant's specific virtual network.
>     
>>         VNI: Virtual Network Instance. This is one instance of a virtual
>>         overlay network. It refers to the state maintained for a given VN on
>>         a given NVE. Two Virtual Networks are isolated from one another and
>>         may use overlapping addresses.
> Better
>
>         Virtual Network Instance (VNI): An specific instance of a
>         Virtual Network.
>
> (no need to say more)
>
>>         Virtual Network Context or VN Context: Field that is part of the
>>         overlay encapsulation header which allows the encapsulated frame to
>>         be delivered to the appropriate virtual network endpoint by the
>>         egress NVE. The egress NVE uses this field to determine the
>>         appropriate virtual network context in which to process the packet.
>>         This field MAY be an explicit, unique (to the administrative domain)
>>         virtual network identifier (VNID) or MAY express the necessary
>>         context information in other ways (e.g., a locally significant
>>         identifier).
> Better:
>
>         Virtual Network Context (VN Context): Field in overlay
>         encapsulation header that identifies the specific VN the packet
>         belongs to. The egress NVE uses the VN Context to deliver the
>         packet to the correct Tenant System.  The VM Context can be a
>         locally significant identifier having meaning only in
>         conjunction with additional information, such as the
>         destination NVE address. Alternatively, the VM Context can have
>         broader scope, e.g., be unique across the entire NVO3 network.
>         
>
>>         VNID:  Virtual Network Identifier. In the case where the VN context
>>         identifier has global significance, this is the ID value that is
>>         carried in each data packet in the overlay encapsulation that
>>         identifies the Virtual Network the packet belongs to.
> A VNID definition by itself doesn't seem all that helpful. (This term
> came from early on when some of us assumed that the Context ID always
> had non-local significance.)
>
> I think we may want to define some additional terms here. VNID by
> itself is not sufficient. Have a look at the terms "VN Alias", "VN
> Name" and "VN ID" in
> draft-kreeger-nvo3-hypervisor-nve-cp-00.txt. Those (or similar) terms
> should probably be moved into the framework document.
>
>>         Underlay or Underlying Network: This is the network that provides
>>         the connectivity between NVEs. The Underlying Network can be
>>         completely unaware of the overlay packets. Addresses within the
>>         Underlying Network are also referred to as "outer addresses" because
>>         they exist in the outer encapsulation. The Underlying Network can
>>         use a completely different protocol (and address family) from that
>>         of the overlay.
> We should say that for NVo3, the underlay is assumed to be IP
>
> Better:
>
>         Underlay or Underlying Network: the network that provides the
>         connectivity between NVEs and over which NVO3 packets are
>         tunneled. The Underlay Network does not need to be aware that
>         it is carrying NVO3 packets. Addresses on the Underlay Network
>         appear as "outer addresses" in encapsulated NVO3 packets. In
>         general, the Underlay Network can use a completely different
>         protocol (and address family) from that of the overlay. In the
>         case of NVO3, the underlay network will always be IP.
>
>>         Data Center (DC): A physical complex housing physical servers,
>>         network switches and routers, network service sppliances and
>>         networked storage. The purpose of a Data Center is to provide
>>         application, compute and/or storage services. One such service is
>>         virtualized infrastructure data center services, also known as
>>         Infrastructure as a Service.
> Should we add defn. for "network service appliance" ??
>
>>         Virtual Data Center or Virtual DC: A container for virtualized
>>         compute, storage and network services. Managed by a single tenant, a
>>         Virtual DC can contain multiple VNs and multiple Tenant Systems that
>>         are connected to one or more of these VNs.
> Tenant manages what is in the VDC. Network Admin manages all aspects
> of mapping the virtual components to physical components.
>
> Better:
>
>         Virtual Data Center (Virtual DC): A container for virtualized
>         compute, storage and network services. A Virtual DC is
>         associated with a single tenant, and can contain multiple VNs
>         and Tenant Systems connected to one or more of these VNs.
>
>>         VM: Virtual Machine. Several Virtual Machines can share the
>>         resources of a single physical computer server using the services of
>>         a Hypervisor (see below definition).
> Better (taken/adapted from RFC6820):
>
>     Virtual machine (VM):  A software implementation of a physical
>        machine that runs programs as if they were executing on a
>        physical, non-virtualized machine.  Applications (generally) do
>        not know they are running on a VM as opposed to running on a
>        "bare" host or server, though some systems provide a
>        paravirtualization environment that allows an operating systems or
>        application to be aware of the presences of virtualization for
>        optimization purposes.
>
>>         Hypervisor: Server virtualization software running on a physical
>>         compute server that hosts Virtual Machines. The hypervisor provides
>>         shared compute/memory/storage and network connectivity to the VMs
>>         that it hosts. Hypervisors often embed a Virtual Switch (see below).
> Compute server is not defined and the term isn't used elsewhere in the
> document. How about:
>
>         Hypervisor: Software running on a server that allows multiple
>         VMs to run on the same physical server. The hypervisor provides
>         shared compute/memory/storage and network connectivity to the
>         VMs that it hosts. Hypervisors often embed a Virtual Switch
>         (see below).
>
> Also, add (for completeness)
>
>        Server: A physical end host machine that runs user
>        applications. A standalone (or "bare metal") server runs a
>        conventional operating system hosting a single tenant
>        application. A virtualized server runs a hypervisor supporting
>        one or more VMs.
>         
>
>>         Virtual Switch: A function within a Hypervisor (typically
>>         implemented in software) that provides similar services to a
>>         physical Ethernet switch.  It switches Ethernet frames between VMs
>>         virtual NICs within the same physical server, or between a VM and a
>>         physical NIC card connecting the server to a physical Ethernet
>>         switch or router. It also enforces network isolation between VMs
>>         that should not communicate with each other.
> slightly better:
>
>         Virtual Switch (vSwitch): A function within a Hypervisor
>         (typically implemented in software) that provides similar
>         services to a physical Ethernet switch.  A vSwitch forwards
>         Ethernet frames between VMs running on the same server, or
>         between a VM and a physical NIC card connecting the server to a
>         physical Ethernet switch. A vSwitch also enforces network
>         isolation between VMs that by policy are not permitted to
>         communicate with each other (e.g., by honoring VLANs).
>
>>         Tenant: In a DC, a tenant refers to a customer that could be an
>>         organization within an enterprise, or an enterprise with a set of DC
>>         compute, storage and network resources associated with it.
> Better:
>
>         Tenant: The customer using a virtual network and any associated
>         resources (e.g., compute, storage and network).  A tenant could
>         be an enterprise, a department, an organization within an
>         enterprise, etc.
>
>
>>         Tenant System: A physical or virtual system that can play the role
>>         of a host, or a forwarding element such as a router, switch,
>>         firewall, etc. It belongs to a single tenant and connects to one or
>>         more VNs of that tenant.
> Better:
>
>         Tenant System: A physical or virtual host associated with a specific
>         tenant.  A Tenant System can play the role of a host, or a
>         forwarding element such as a router, switch, firewall, etc. A
>         Tenant System is associated with a specific tenant and connects
>         to one or more of the tenant's VNs.
>
>>         End device: A physical system to which networking service is
>>         provided. Examples include hosts (e.g. server or server blade),
>>         storage systems (e.g., file servers, iSCSI storage systems), and
>>         network devices (e.g., firewall, load-balancer, IPSec gateway). An
>>         end device may include internal networking functionality that
>>         interconnects the device's components (e.g. virtual switches that
>>         interconnect VMs running on the same server). NVE functionality may
>>         be implemented as part of that internal networking.
> Better:
>
>          End device: A physical device that connects directly to the
>          data center Underlay Network. An End Device is administered by
>          the data center operator rather than a tenant and is part of
>          data center infrastructure. An End Device may implement NVO3
>          technology in support of NVO3 functions. Contrast with Tenant
>          System, which is only connected to a Virtual Network.
>          Examples include hosts (e.g. server or server blade), storage
>          systems (e.g., file servers, iSCSI storage systems), and
>          network devices (e.g., firewall, load-balancer, IPSec
>          gateway).
>
>
>
>>         ELAN: MEF ELAN, multipoint to multipoint Ethernet service
> I'd suggest dropping these terms. They are barely used and are not
> critical to understanding the framework
>
>>         EVPN: Ethernet VPN as defined in [EVPN]
> Remove. These terms are not used elswhere in the document.
>
>>      1.3. DC network architecture
>>
>>         A generic architecture for Data Centers is depicted in Figure 1:
>>
>>                                      ,---------.
>>                                    ,'           `.
>>                                   (  IP/MPLS WAN )
>>                                    `.           ,'
>>                                      `-+------+'
>>                                   +--+--+   +-+---+
>>                                   |DC GW|+-+|DC GW|
>>                                   +-+---+   +-----+
>>                                      |       /
>>                                      .--. .--.
>>                                    (    '    '.--.
>>                                  .-.' Intra-DC     '
>>                                 (     network      )
>>                                  (             .'-'
>>                                   '--'._.'.    )\ \
>>                                   / /     '--'  \ \
>>                                  / /      | |    \ \
>>                            +---+--+   +-`.+--+  +--+----+
>>                            | ToR  |   | ToR  |  |  ToR  |
>>                            +-+--`.+   +-+-`.-+  +-+--+--+
>>                             /     \    /    \   /       \
>>                          __/_      \  /      \ /_       _\__
>>                   '--------'   '--------'   '--------'   '--------'
>>                   :  End   :   :  End   :   :  End   :   :  End   :
>>                   : Device :   : Device :   : Device :   : Device :
>>                   '--------'   '--------'   '--------'   '--------'
>>                                            
>>                   Figure 1 : A Generic Architecture for Data Centers
> The above is not necessarily what a DC looks like. ARMD went through
> this already, and there are many different data center network types.
> For example, the above doesn't allow for a chassis with an embedded
> switch between the End Device and ToR.
>
> This picture should be generalized to use terms like "access layer",
> and "aggregation layer" rather than specific terms like ToR. Have a
> look at RFC 6820.
>
> Feel free to grab text or just point there.
>
>>         An example of multi-tier DC network architecture is presented in
>>         this figure. It provides a view of physical components inside a DC.
> s/this figure/Figure 1/
>
>>         A cloud network is composed of intra-Data Center (DC) networks and
>>         network services, and inter-DC network and network connectivity
>>         services. Depending upon the scale, DC distribution, operations
>>         model, Capex and Opex aspects, DC networking elements can act as
>>         strict L2 switches and/or provide IP routing capabilities, including
>>         service virtualization.
> Do we really need to use the term "cloud network" and say what a
> "cloud network" is? The term does not seem to be used elsewhere in the
> document...
>
>>         In some DC architectures, it is possible that some tier layers
>>         provide L2 and/or L3 services, are collapsed, and that Internet
>>         connectivity, inter-DC connectivity and VPN support are handled by a
>>         smaller number of nodes. Nevertheless, one can assume that the
>>         functional blocks fit in the architecture above.
> Per above, see how the ARMD document handled this.
>
>>      1.4. Tenant networking view
>>
>>         The DC network architecture is used to provide L2 and/or L3 service
>>         connectivity to each tenant. An example is depicted in Figure 2:
>>          
>>
>>                           +----- L3 Infrastructure ----+
>>                           |                            |
>>                        ,--+--.                      ,--+--.
>>                  .....( Rtr1  )......              ( Rtr2  )
>>                  |     `-----'      |               `-----'
>>                  |     Tenant1      |LAN12      Tenant1|
>>                  |LAN11         ....|........          |LAN13
>>            ..............        |        |     ..............
>>               |        |         |        |       |        |
>>              ,-.      ,-.       ,-.      ,-.     ,-.      ,-.
>>             (VM )....(VM )     (VM )... (VM )   (VM )....(VM )
>>              `-'      `-'       `-'      `-'     `-'      `-'
>>                                            
>>              Figure 2 : Logical Service connectivity for a single tenant
>>
>>         In this example, one or more L3 contexts and one or more LANs (e.g.,
>>         one per application type) are assigned for DC tenant1.
> This picture is unclear. what does "tenant 1" cover in the picture?
> What is an "L3 context"? I would assume it needs to refer to specific
> VMs too...
>
>>         For a multi-tenant DC, a virtualized version of this type of service
>>         connectivity needs to be provided for each tenant by the Network
>>         Virtualization solution.
> I would assume  NVO3 only cares about the multi-tenant case. Which
> makes me wonder what the previous example is supposed to show. A
> single tenant case? What does that mean?
>
>>      2. Reference Models
>>
>>      2.1. Generic Reference Model
>>
>>         The following diagram shows a DC reference model for network
>>         virtualization using L3 (IP/MPLS) overlays where NVEs provide a
>>         logical interconnect between Tenant Systems that belong to a
>>         specific tenant network.
>>
> Should the above say "that belong to a specific tenant's Virtual
> Network"?
>
>>       
>>               +--------+                                    +--------+
>>               | Tenant +--+                            +----| Tenant |
>>               | System |  |                           (')   | System |
>>               +--------+  |    ...................   (   )  +--------+
>>                           |  +-+--+           +--+-+  (_)
>>                           |  | NV |           | NV |   |
>>                           +--|Edge|           |Edge|---+
>>                              +-+--+           +--+-+
>>                              / .                 .
>>                             /  .   L3 Overlay +--+-++--------+
>>               +--------+   /   .    Network   | NV || Tenant |
>>               | Tenant +--+    .              |Edge|| System |
>>               | System |       .    +----+    +--+-++--------+
>>               +--------+       .....| NV |........
>>                                     |Edge|
>>                                     +----+
>>                                       |
>>                                       |
>>                             =====================
>>                               |               |
>>                           +--------+      +--------+
>>                           | Tenant |      | Tenant |
>>                           | System |      | System |
>>                           +--------+      +--------+
> s/NV Edge/NVE/ for consistency
>
>>       
>>            Figure 3 : Generic reference model for DC network virtualization
>>                             over a Layer3 infrastructure
>>
>>         A Tenant System can be attached to a Network Virtualization Edge
>>         (NVE) node in several ways:
> Each of these ways should be clearly labeled in Figure3.
>
>>           - locally, by being co-located in the same device
> add something like: (e.g., as part of the hypervisor)
>
>>           - remotely, via a point-to-point connection or a switched network
>>           (e.g., Ethernet)
>>
>>         When an NVE is local, the state of Tenant Systems can be provided
>>         without protocol assistance. For instance, the operational status of
>>         a VM can be communicated via a local API. When an NVE is remote, the
>>         state of Tenant Systems needs to be exchanged via a data or control
>>         plane protocol, or via a management entity.
> Better:
>
>         When an NVE is co-located with a Tenant System, communication
>         and synchronization between the TS and NVE takes place via
>         software (e.g., using an internal API). When an NVE and TS are
>         separated by an access link, interaction and synchronization
>         between an NVE and TS require an explicit data plane, control
>         plane, or management protocol.
>
>>         The functional components in Figure 3 do not necessarily map
>>         directly with the physical components described in Figure 1.
>>
>>         For example, an End Device can be a server blade with VMs and
>>         virtual switch, i.e. the VM is the Tenant System and the NVE
>>         functions may be performed by the virtual switch and/or the
>>         hypervisor. In this case, the Tenant System and NVE function are co-
>>         located.
>>
>>         Another example is the case where an End Device can be a traditional
>>         physical server (no VMs, no virtual switch), i.e. the server is the
>>         Tenant System and the NVE function may be performed by the ToR.
> We should not use the term "ToR" here. We should be more generic and
> say something like the "attached switch" or "access switch".
>
>>         The NVE implements network virtualization functions that allow for
>>         L2 and/or L3 tenant separation and for hiding tenant addressing
>>         information (MAC and IP addresses), tenant-related control plane
>>         activity and service contexts from the underlay nodes.
> We should probably  define "tenant separation" earlier in the document and then
> just refer to that defintion. Add something like the following to the
> defintions?:
>
>      Tenant Separation: Tenant Separation refers to isolating traffic
>      of different tenants so that traffic from one tenant is not
>      visible to or delivered to another tenant, except when allowed by
>      policy. Tenant Separation also refers to address space separation,
>      whereby different tenants use the same address space for different
>      virtual networks without conflict.
>
>
>>      2.2. NVE Reference Model
>>         One or more VNIs can be instantiated on an NVE. Tenant Systems
>>         interface with a corresponding VNI via a Virtual Access Point
>>         (VAP).
> Define VAP in the terminology section.
>
>>         An overlay module that provides tunneling overlay functions (e.g.,
>>         encapsulation and decapsulation of tenant traffic from/to the tenant
>>         forwarding instance, tenant identification and mapping, etc), as
>>         described in figure 4:
> Doesn't quite parse. Better:
>
>          An overlay module on the NVE provides tunneling overlay
>          functions (e.g., encapsulation and decapsulation of tenant
>          traffic from/to the tenant forwarding instance, tenant
>          identification and mapping, etc), as described in figure 4:
>
>>                            +------- L3 Network ------+
>>                            |                         |
>>                            |       Tunnel Overlay    |
>>               +------------+---------+       +---------+------------+
>>               | +----------+-------+ |       | +---------+--------+ |
>>               | |  Overlay Module  | |       | |  Overlay Module  | |
>>               | +---------+--------+ |       | +---------+--------+ |
>>               |           |VN context|       | VN context|          |
>>               |           |          |       |           |          |
>>               |  +--------+-------+  |       |  +--------+-------+  |
>>               |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
>>          NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
>>               |    |   VAPs     |    |       |    |    VAPs   |     |
>>               +----+------------+----+       +----+-----------+-----+
>>                    |            |                 |           |
>>             -------+------------+-----------------+-----------+-------
>>                    |            |     Tenant      |           |
>>                    |            |   Service IF    |           |
>>                   Tenant Systems                 Tenant Systems
>>       
>>                    Figure 4 : Generic reference model for NV Edge
>>
>>         Note that some NVE functions (e.g., data plane and control plane
>>         functions) may reside in one device or may be implemented separately
>>         in different devices. For example, the NVE functionality could
>>         reside solely on the End Devices, or be distributed between the End
>>         Devices and the ToRs. In the latter case we say that the End Device
>>         NVE component acts as the NVE Spoke, and ToRs act as NVE hubs.
>>         Tenant Systems will interface with VNIs maintained on the NVE
>>         spokes, and VNIs maintained on the NVE spokes will interface with
>>         VNIs maintained on the NVE hubs.
> Don't always assume "ToR".
>
> Also, in Figure 4, "VN context' is listed as if it were a component or
> something. But VN context is  previously defined as a field in the
> overlay header. Is this something different? If so, we should use a
> different term (to avoid confusion).
>
>>      2.3. NVE Service Types
>>
>>         NVE components may be used to provide different types of virtualized
>>         network services. This section defines the service types and
>>         associated attributes. Note that an NVE may be capable of providing
>>         both L2 and L3 services.
>>      2.3.1. L2 NVE providing Ethernet LAN-like service
>>
>>         L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based
> drop the "(ELAN)" abbreviation. (its not needed and is not used
> anywhere else).
>
>>         multipoint service where the Tenant Systems appear to be
>>         interconnected by a LAN environment over a set of L3 tunnels. It
>>         provides per tenant virtual switching instance with MAC addressing
>>         isolation and L3 (IP/MPLS) tunnel encapsulation across the underlay.
>>
>>      2.3.2. L3 NVE providing IP/VRF-like service
>>
>>         Virtualized IP routing and forwarding is similar from a service
>>         definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN
>>         [RFC4364] and IPsec VPNs). It provides per tenant routing instance
> should provide an RFC reference for IPsec VPNs too...
>
> s/per tenant/per-tenant/
>
>>         with addressing isolation and L3 (IP/MPLS) tunnel encapsulation
>>         across the underlay.
>>
>>      3. Functional components
>>
>>         This section decomposes the Network Virtualization architecture into
>>         functional components described in Figure 4 to make it easier to
>>         discuss solution options for these components.
>>
>>      3.1. Service Virtualization Components
>>
>>      3.1.1. Virtual Access Points (VAPs)
>>
>>         Tenant Systems are connected to the VNI Instance through Virtual
>>         Access Points (VAPs).
>>
>>         The VAPs can be physical ports or virtual ports identified through
>>         logical interface identifiers (e.g., VLAN ID, internal vSwitch
>>         Interface ID coonected to a VM).
> s/coonected/connected/
>
>>      3.1.2. Virtual Network Instance (VNI)
>>
>>         The VNI represents a set of configuration attributes defining access
>>         and tunnel policies and (L2 and/or L3) forwarding functions.
>>         Per tenant FIB tables and control plane protocol instances are used
>>         to maintain separate private contexts between tenants. Hence tenants
>>         are free to use their own addressing schemes without concerns about
>>         address overlapping with other tenants.
> Not exactly. The VNI is a VN Instance. Implementing a VNI requires a
> bunch of stuff. Also, here I think it is better to talk about VNs than
> tenants.
>
> Also, in reading through the doc, we say over and over again that you
> get address space separation, etc. No need to repeat this all the time!
>
> Better:
>
>          A VNI is a specific VN instance. Associated with each VNI is a
>          set of metat data necessary to implement the specific VN
>          service. For example, a per-VN forwarding or mapping table is
>          needed deliver traffic to other members of the VN and ensure
>          tenant separation between between different VNs.
>
>>      3.1.3. Overlay Modules and VN Context
>>
>>         Mechanisms for identifying each tenant service are required to allow
>>         the simultaneous overlay of multiple tenant services over the same
>>         underlay L3 network topology. In the data plane, each NVE, upon
>>         sending a tenant packet, must be able to encode the VN Context for
>>         the destination NVE in addition to the L3 tunnel information (e.g.,
>>         source IP address identifying the source NVE and the destination IP
>>         address identifying the destination NVE, or MPLS label). This allows
>>         the destination NVE to identify the tenant service instance and
>>         therefore appropriately process and forward the tenant packet.
>>
>>         The Overlay module provides tunneling overlay functions: tunnel
>>         initiation/termination, encapsulation/decapsulation of frames from
>>         VAPs/L3 Backbone and may provide for transit forwarding of IP
> s/L3 Backbone/L3 underlay
>
>>         traffic (e.g., transparent tunnel forwarding).
> What is this "transit forwarding"?
>
>>         In a multi-tenant context, the tunnel aggregates frames from/to
>>         different VNIs. Tenant identification and traffic demultiplexing are
>>         based on the VN Context identifier (e.g., VNID).
> Let's drop use of VNID here (since IDs can be locally significant too).
>
>>         The following approaches can been considered:
>>
>>            o One VN Context per Tenant: A globally unique (on a per-DC
>>              administrative domain) VNID is used to identify the related
>>              Tenant instances. An example of this approach is the use of
>>              IEEE VLAN or ISID tags to provide virtual L2 domains.
> I think this is off. We are mixing "tenant" and "VN". A tenant can
> have multiple different VNIs associated with it.  Each of those VNIs
> uses different VN Contexts. Thus, the VN Context != Tenant.
>
>>            o One VN Context per VNI: A per-tenant local value is
>>              automatically generated by the egress NVE and usually
>>              distributed by a control plane protocol to all the related
>>              NVEs. An example of this approach is the use of per VRF MPLS
>>              labels in IP VPN [RFC4364].
> This seems off. There could be a different VN Context for each NVE, so
> its not "per VNI".
>
>>            o One VN Context per VAP: A per-VAP local value is assigned and
>>              usually distributed by a control plane protocol. An example of
>>              this approach is the use of per CE-PE MPLS labels in IP VPN
>>              [RFC4364].
>>
>>         Note that when using one VN Context per VNI or per VAP, an
>>         additional global identifier may be used by the control plane to
>>         identify the Tenant context.
> need a name for that global identifier term and put it in the
> terminology section.
>
>>      3.1.4. Tunnel Overlays and Encapsulation options
>>
>>         Once the VN context identifier is added to the frame, a L3 Tunnel
> When using term Context Identifier, capitalize it.
>
>
>>         encapsulation is used to transport the frame to the destination NVE.
>>         The backbone devices do not usually keep any per service state,
>>         simply forwarding the frames based on the outer tunnel
>>         header.
> don't use "backbone devices" term. use "underlay devices"?
>
>>         Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS
>>         tunneling options (e.g., BGP VPN, VPLS) can be used.
>>
>>      3.1.5. Control Plane Components
> This section should be expanded to show the different problem areas
> for the control plane (specifically server-to-NVE and NVE-to-oracle).
>
>>         Control plane components may be used to provide the following
>>         capabilities:
>>
>>            . Auto-provisioning/Service discovery
>>
>>            . Address advertisement and tunnel mapping
>>
>>            . Tunnel management
> The above really should be expanded a bit. E.g., what does "auto
> provisioning" refer to? I don't think a lot is needed, but a sentence
> or two per bullet point would help. Also, there are 3 bullet points
> here, but there are 4 subsections that follow, and they do not match
> the above bullet points.
>
>>         A control plane component can be an on-net control protocol
>>         implemented on the NVE or a management control entity.
> What is "on-net protocol"
>
>>      3.1.5.1. Distributed vs Centralized Control Plane
>>
>>         A control/management plane entity can be centralized or distributed.
>>         Both approaches have been used extensively in the past. The routing
>>         model of the Internet is a good example of a distributed approach.
>>         Transport networks have usually used a centralized approach to
>>         manage transport paths.
> What is a "transport network"? I'm not sure what these are and why
> they are "usually"  use a centrallized network.
>
>>         It is also possible to combine the two approaches i.e. using a
>>         hybrid model. A global view of network state can have many benefits
>>         but it does not preclude the use of distributed protocols within the
>>         network. Centralized controllers provide a facility to maintain
>>         global state, and distribute that state to the network which in
>>         combination with distributed protocols can aid in achieving greater
>>         network efficiencies, and improve reliability and robustness. Domain
>>         and/or deployment specific constraints define the balance between
>>         centralized and distributed approaches.
>>
>>         On one hand, a control plane module can reside in every NVE. This is
>>         how routing control plane modules are implemented in routers. At the
>>         same time, an external controller can manage a group of NVEs via an
>>         agent in each NVE. This is how an SDN controller could communicate
>>         with the nodes it controls, via OpenFlow [OF] for instance.
> Expand SDN on first usage...
>
>>         In the case where a logically centralized control plane is
>>         preferred, the controller will need to be distributed to more than
>>         one node for redundancy and scalability in order to manage a large
>>         number of NVEs. Hence, inter-controller communication is necessary
>>         to synchronize state among controllers. It should be noted that
>>         controllers may be organized in clusters. The information exchanged
>>         between controllers of the same cluster could be different from the
>>         information exchanged across clusters.
> This section  does not really capture the oracle discussion we had at
> the interim meeting.
>
>>      3.1.5.2. Auto-provisioning/Service discovery
>>
>>         NVEs must be able to identify the appropriate VNI for each Tenant
>>         System. This is based on state information that is often provided by
>>         external entities. For example, in an environment where a VM is a
>>         Tenant System, this information is provided by compute management
>>         systems, since these are the only entities that have visibility of
>>         which VM belongs to which tenant.
> Above. might be better to say this is provided by the vm orchestration
> system (and maybe define this term in terminology section?)
>
>>      3.1.5.3. Address advertisement and tunnel mapping
>>
>>         As traffic reaches an ingress NVE, a lookup is performed to
>>         determine which tunnel the packet needs to be sent to. It is then
> s/sent to/sent on/ ??
>
>>         encapsulated with a tunnel header containing the destination
>>         information (destination IP address or MPLS label) of the egress
>>         overlay node. Intermediate nodes (between the ingress and egress
>>         NVEs) switch or route traffic based upon the outer destination
>>         information.
> Use active voice above? And say which entity it is that does the work?
> E.g.
>
>          As traffic reaches an ingress NVE, the NVE performs a lookup
>          to determine which remote NVE the packet should be sent
>          to. The NVE [or Overlay Module???] then adds a tunnel
>          encapsulation header containing the destination information
>          (destination IP address or MPLS label) of the egress NVE as
>          well as an appropriate Context ID.  Nodes on the underlay
>          network (between the ingress and egress NVEs) forward traffic
>          based solely on the outer destination header information.
>
>>         One key step in this process consists of mapping a final destination
>>         information to the proper tunnel. NVEs are responsible for
>>         maintaining such mappings in their forwarding tables. Several ways
>>         of populating these tables are possible: control plane driven,
>>         management plane driven, or data plane driven.
> Better:
>
>          A key step in the above process consists of identifying the
>          destination NVE the packet is to be tunneled to. NVEs are
>          responsible for maintaining a set of forwarding or mapping
>          tables that hold the bindings between between destination VM
>          and egress NVE addresses. Several ways of populating these
>          tables are possible: control plane driven, management plane
>          driven, or data plane driven.
>
>>      3.2. Multi-homing
>>
>>         Multi-homing techniques can be used to increase the reliability of
>>         an nvo3 network. It is also important to ensure that physical
>>         diversity in an nvo3 network is taken into account to avoid single
>>         points of failure.
>>
>>         Multi-homing can be enabled in various nodes, from tenant systems
>>         into TORs, TORs into core switches/routers, and core nodes into DC
>>         GWs.
>>
>>         The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
>>         routing as the means to re-route traffic upon failures and/or ECMP
>>         techniques or on MPLS re-rerouting capabilities.
>>
>>         When a tenant system is co-located with the NVE on the same end-
>>         system, the tenant system is single homed to the NVE via a vport
>>         that is virtual NIC (vNIC). When the end system and the NVEs
>> are
> vport/vNIC terminology is not defined (and is somewhat proprietary).
>
> And shouldn't the VAP terminology be used here?
>
>>         separated, the end system is connected to the NVE via a logical
>>         Layer2 (L2) construct such as a VLAN. In this latter case, an end
>>         device or vSwitch on that device could be multi-homed to various
>>         NVEs. An NVE may provide an L2 service to the end system or a l3
>>         service. An NVE may be multi-homed to a next layer in the DC at
>>         Layer2 (L2) or Layer3 (L3). When an NVE provides an L2 service and
>>         is not co-located with the end system, techniques such as Ethernet
>>         Link Aggregation Group (LAG) or Spanning Tree Protocol (STP) can be
>>         used to switch traffic between an end system and connected
>>         NVEs without creating loops. Similarly, when the NVE provides L3
>>         service, similar dual-homing techniques can be used. When the NVE
>>         provides a L3 service to the end system, it is possible that no
>>         dynamic routing protocol is enabled between the end system and the
>>         NVE. The end system can be multi-homed to multiple physically-
>>         separated L3 NVEs over multiple interfaces. When one of the
>>         links connected to an NVE fails, the other interfaces can be used to
>>         reach the end system.
> The above seemt to talk about what I'll call a "distributed NVE
> model", where a TES is connected to more than one NVE (or the one NVE
> that is somehow distributed). If this document is going to talk about
> that as a possibility in the context of multihoming, I think we need
> to talk more generally about a TES connected to more than one
> TES. This document doesn't really talk about that at all.
>
> Personally, I'm not sure we should go here. A lot of complexity that I
> suspect is not worth the cost. I'd suggest sticking with a single NVE
> per TES.
>
>>         External connectivity out of an nvo3 domain can be handled by two or
>>         more nvo3 gateways. Each gateway is connected to a different domain
>>         (e.g. ISP), providing access to external networks such as VPNs or
>>         the Internet. A gateway may be connected to two nodes. When a
>>         connection to an upstream node is lost, the alternative connection
>>         is used and the failed route withdrawn.
> For external multihoming, there is no reason to say they are connected
> to "different domains". You just want redundancy.
>
> Actually, I'm not sure what point the above is trying to highlight. Is
> this a generic requirement for multihoming out of a DC that happens to
> be running NVO3 internally? If so, I don't think we need to say
> that. It's a given and mostly out of scope.
>
> If we are talking just about "nvo3 gateways", I presume that means
> getting in and out of a specific VN. For that, you just need
> multihoming for redundancy, and there is no need to talk about
> connecting those gateways to "different domains".
>
>>      3.3. VM Mobility
>>
>>         In DC environments utilizing VM technologies, an important feature
>>         is that VMs can move from one server to another server in the same
>>         or different L2 physical domains (within or across DCs) in a
>>         seamless manner.
>>
>>         A VM can be moved from one server to another in stopped or suspended
>>         state ("cold" VM mobility) or in running/active state ("hot" VM
>>         mobility). With "hot" mobility, VM L2 and L3 addresses need to be
>>         preserved. With "cold" mobility, it may be desired to preserve VM L3
>>         addresses.
>>
>>         Solutions to maintain connectivity while a VM is moved are necessary
>>         in the case of "hot" mobility. This implies that transport
>>         connections among VMs are preserved and that ARP caches are updated
>>         accordingly.
>>
>>         Upon VM mobility, NVE policies that define connectivity among VMs
>>         must be maintained.
>>
>>         Optimal routing during VM mobility is also an important aspect to
>>         address. It is expected that the VM's default gateway be as close as
>>         possible to the server hosting the VM and triangular routing be
>>         avoided.
> What is meant by "triangular routing" above? Specifically, how is this
> a result of mobility vs. a general requirement?
>
>>      3.4. Service Overlay Topologies
>>
>>         A number of service topologies may be used to optimize the service
>>         connectivity and to address NVE performance limitations.
>>
>>         The topology described in Figure 3 suggests the use of a tunnel mesh
>>         between the NVEs where each tenant instance is one hop away from a
>>         service processing perspective. Partial mesh topologies and an NVE
>>         hierarchy may be used where certain NVEs may act as service transit
>>         points.
>>
>>      4. Key aspects of overlay networks
>>
>>         The intent of this section is to highlight specific issues that
>>         proposed overlay solutions need to address.
>>
>>      4.1. Pros & Cons
>>
>>         An overlay network is a layer of virtual network topology on top of
>>         the physical network.
>>
>>         Overlay networks offer the following key advantages:
>>
>>            o Unicast tunneling state management and association with tenant
>>              systems reachability are handled at the edge of the network.
>>              Intermediate transport nodes are unaware of such state. Note
>>              that this is not the case when multicast is enabled in the core
>>              network.
> The comment about multicast needs expansion. Multicast (in the
> underlay)  is an underlay issue and has nothing to do with
> overlays. If Tenant traffic is mapped into multicast service on the
> underlay, then there is a connection. If the latter is what is meant,
> please add text to that effect.
>
>>            o Tunneling is used to aggregate traffic and hide tenant
>>              addresses from the underkay network, and hence offer the
>>              advantage of minimizing the amount of forwarding state required
>>              within the underlay network
>>
>>            o Decoupling of the overlay addresses (MAC and IP) used by VMs
>>              from the underlay network. This offers a clear separation
>>              between addresses used within the overlay and the underlay
>>              networks and it enables the use of overlapping addresses spaces
>>              by Tenant Systems
>>
>>            o Support of a large number of virtual network identifiers
>>
>>         Overlay networks also create several challenges:
>>
>>            o Overlay networks have no controls of underlay networks and lack
>>              critical network information
> Is some text missing from the above bullet?
>
>>                 o Overlays typically probe the network to measure link or
>>                   path properties, such as available bandwidth or packet
>>                   loss rate. It is difficult to accurately evaluate network
>>                   properties. It might be preferable for the underlay
>>                   network to expose usage and performance
>>                 information.
> I don't follow the above. Isn't the above a true statement for a host
> connected to an IP network as well?
>
>>            o Miscommunication or lack of coordination between overlay and
>>              underlay networks can lead to an inefficient usage of network
>>              resources.
> Might be good to give an example.
>
>>            o When multiple overlays co-exist on top of a common underlay
>>              network, the lack of coordination between overlays can lead to
>>              performance issues.
> Can you give examples? and how is this different from what we have
> today with different hosts "not coordinating"  when they use the
> network?
>
>>            o Overlaid traffic may not traverse firewalls and NAT
>              devices.
>
> Explain why this is a challenge. I'd argue that is the point. if a FW
> is needed, it could be part of the overlay.
>
>>            o Multicast service scalability. Multicast support may be
>>              required in the underlay network to address for each tenant
>>              flood containment or efficient multicast handling. The underlay
>>              may be also be required to maintain multicast state on a per-
>>              tenant basis, or even on a per-individual multicast flow of a
>>              given tenant.
>>
>>            o Hash-based load balancing may not be optimal as the hash
>>              algorithm may not work well due to the limited number of
>>              combinations of tunnel source and destination addresses. Other
>>              NVO3 mechanisms may use additional entropy information than
>>              source and destination addresses.
>>      4.2. Overlay issues to consider
>>
>>      4.2.1. Data plane vs Control plane driven
>>
>>         In the case of an L2NVE, it is possible to dynamically learn MAC
>>         addresses against VAPs.
> rewrite? what does it mean to "learn MAC addresses against VAPs".
>
>> It is also possible that such addresses be
>>         known and controlled via management or a control protocol for both
>>         L2NVEs and L3NVEs.
>>
>>         Dynamic data plane learning implies that flooding of unknown
>>         destinations be supported and hence implies that broadcast and/or
>>         multicast be supported or that ingress replication be used as
>>         described in section 4.2.3. Multicasting in the underlay network for
>>         dynamic learning may lead to significant scalability limitations.
>>         Specific forwarding rules must be enforced to prevent loops from
>>         happening. This can be achieved using a spanning tree, a shortest
>>         path tree, or a split-horizon mesh.
>>
>>         It should be noted that the amount of state to be distributed is
>>         dependent upon network topology and the number of virtual machines.
>>         Different forms of caching can also be utilized to minimize state
>>         distribution between the various elements. The control plane should
>>         not require an NVE to maintain the locations of all the tenant
>>         systems whose VNs are not present on the NVE. The use of a control
>>         plane does not imply that the data plane on NVEs has to maintain all
>>         the forwarding state in the control plane.
>>
>>      4.2.2. Coordination between data plane and control plane
>>
>>         For an L2 NVE, the NVE needs to be able to determine MAC addresses
>>         of the end systems connected via a VAP. This can be achieved via
>>         dataplane learning or a control plane. For an L3 NVE, the NVE needs
>>         to be able to determine IP addresses of the end systems connected
>>         via a VAP.
> Better:
>
>          For an L2 NVE, the NVE needs to be able to determine MAC
>          addresses of the end systems connected via a VAP. For an L3
>          NVE, the NVE needs to be able to determine IP addresses of the
>          end systems connected via a VAP. In both cases, this can be
>          achieved via dataplane learning or a control plane.
>
>>         In both cases, coordination with the NVE control protocol is needed
>>         such that when the NVE determines that the set of addresses behind a
>>         VAP has changed, it triggers the local NVE control plane to
>>         distribute this information to its peers.
>>
>>      4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic
>>
>>         There are two techniques to support packet replication needed for
>>         broadcast, unknown unicast and multicast:
> s/for broadcast/for tenant broadcast/
>
>>            o Ingress replication
>>
>>            o Use of underlay multicast trees
> draft-ghanwani-nvo3-mcast-issues-00.txt describes a third technique.
>
>>         There is a bandwidth vs state trade-off between the two approaches.
>>         Depending upon the degree of replication required (i.e. the number
>>         of hosts per group) and the amount of multicast state to maintain,
>>         trading bandwidth for state should be considered.
>>
>>         When the number of hosts per group is large, the use of underlay
>>         multicast trees may be more appropriate. When the number of hosts is
>>         small (e.g. 2-3), ingress replication may not be an issue.
>>
>>         Depending upon the size of the data center network and hence the
>>         number of (S,G) entries, but also the duration of multicast flows,
>>         the use of underlay multicast trees can be a challenge.
>>
>>         When flows are well known, it is possible to pre-provision such
>>         multicast trees. However, it is often difficult to predict
>>         application flows ahead of time, and hence programming of (S,G)
>>         entries for short-lived flows could be impractical.
>>
>>         A possible trade-off is to use in the underlay shared multicast
>>         trees as opposed to dedicated multicast trees.
>>      4.2.4. Path MTU
>>
>>         When using overlay tunneling, an outer header is added to the
>>         original frame. This can cause the MTU of the path to the egress
>>         tunnel endpoint to be exceeded.
>>
>>         In this section, we will only consider the case of an IP overlay.
>>
>>         It is usually not desirable to rely on IP fragmentation for
>>         performance reasons. Ideally, the interface MTU as seen by a Tenant
>>         System is adjusted such that no fragmentation is needed. TCP will
>>         adjust its maximum segment size accordingly.
>>
>>         It is possible for the MTU to be configured manually or to be
>>         discovered dynamically. Various Path MTU discovery techniques exist
>>         in order to determine the proper MTU size to use:
>>
>>            o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]
>>
>>                 o
>>                  Tenant Systems rely on ICMP messages to discover the MTU of
>>                   the end-to-end path to its destination. This method is not
>>                   always possible, such as when traversing middle boxes
>>                   (e.g. firewalls) which disable ICMP for security reasons
>>
>>            o Extended MTU Path Discovery techniques such as defined in
>>              [RFC4821]
>>
>>         It is also possible to rely on the overlay layer to perform
>>         segmentation and reassembly operations without relying on the Tenant
>>         Systems to know about the end-to-end MTU. The assumption is that
>>         some hardware assist is available on the NVE node to perform such
>>         SAR operations. However, fragmentation by the overlay layer can lead
> expand "SAR"
>
>>         to performance and congestion issues due to TCP dynamics and might
>>         require new congestion avoidance mechanisms from then underlay
>>         network [FLOYD].
>>
>>         Finally, the underlay network may be designed in such a way that the
>>         MTU can accommodate the extra tunneling and possibly additional nvo3
>>         header encapsulation overhead.
>>
>>      4.2.5. NVE location trade-offs
>>
>>         In the case of DC traffic, traffic originated from a VM is native
>>         Ethernet traffic. This traffic can be switched by a local virtual
>>         switch or ToR switch and then by a DC gateway. The NVE function can
>>         be embedded within any of these elements.
>>
>>         There are several criteria to consider when deciding where the NVE
>>         function should happen:
>>
>>            o Processing and memory requirements
>>
>>                o Datapath (e.g. lookups, filtering,
>>                   encapsulation/decapsulation)
>>
>>                o Control plane processing (e.g. routing, signaling, OAM) and
>>                   where specific control plane functions should be enabled
> missing closing ")"
>
>>            o FIB/RIB size
>>
>>            o Multicast support
>>
>>                o Routing/signaling protocols
>>
>>                o Packet replication capability
>>
>>                o Multicast FIB
>>
>>            o Fragmentation support
>>
>>            o QoS support (e.g. marking, policing, queuing)
>>
>>            o Resiliency
>>
>>      4.2.6. Interaction between network overlays and underlays
>>
>>         When multiple overlays co-exist on top of a common underlay network,
>>         resources (e.g., bandwidth) should be provisioned to ensure that
>>         traffic from overlays can be accommodated and QoS objectives can be
>>         met. Overlays can have partially overlapping paths (nodes and
>>         links).
>>
>>         Each overlay is selfish by nature. It sends traffic so as to
>>         optimize its own performance without considering the impact on other
>>         overlays, unless the underlay paths are traffic engineered on a per
>>         overlay basis to avoid congestion of underlay resources.
>>
>>         Better visibility between overlays and underlays, or generally
>>         coordination in placing overlay demand on an underlay network, can
>>         be achieved by providing mechanisms to exchange performance and
>>         liveliness information between the underlay and overlay(s) or the
>>         use of such information by a coordination system. Such information
>>         may include:
>>
>>            o Performance metrics (throughput, delay, loss, jitter)
>>
>>            o Cost metrics
>>
>>      5. Security Considerations
>>
>>         Nvo3 solutions must at least consider and address the following:
>>
>>            . Secure and authenticated communication between an NVE and an
>>              NVE management system.
>>
>>            . Isolation between tenant overlay networks. The use of per-
>>              tenant FIB tables (VNIs) on an NVE is essential.
>>
>>            . Security of any protocol used to carry overlay network
>>              information.
>>
>>            . Avoiding packets from reaching the wrong NVI, especially during
>>              VM moves.
>>
>>
> _______________________________________________
> nvo3 mailing list
> nvo3@ietf.org
> https://www.ietf.org/mailman/listinfo/nvo3