[nvo3] Review of draft-ietf-nvo3-framework-02

Thomas Narten <narten@us.ibm.com> Fri, 15 February 2013 22:34 UTC
Message-Id: <201302152233.r1FMXxdx020559@cichlid.raleigh.ibm.com>
From: Thomas Narten <narten@us.ibm.com>
To: nvo3@ietf.org
Date: Fri, 15 Feb 2013 17:33:59 -0500
Subject: [nvo3] Review of draft-ietf-nvo3-framework-02
Precedence: list
Below is a detailed review of the framework document.  Pretty much all
of these are editorial -- I don't think I really have any issue with
the substance. But there are a lot of suggestions for clarifying text
and tightening up the terminology, language, etc.

Thomas

High level: what happened to adding text describing the "oracle"
model, where there was a clear separation of the NVE, the oracle (and
the notion of a federated oracle). as well as separate protocosl for
the inter/intra-oracle control vs. server-to-nve control? Per the
September interim meeting, this needs to be added.

       [OVCPREQ] describes the requirements for a control plane protocol 
       required by overlay border nodes to exchange overlay mappings. 

The above document has been split into 2 documents. I think both
should be referenced. Also, I think it would be good to not combine
the two problem areas in the 2 drafts into one generic "control plane
protocol" reference. They are very different problems with no overlap,
and I think when folk talk about a "control plane", they are really
referring to problem in draft-kreeger-nvo3-hypervisor-nve-cp-00.txt.

>     1.2. General terminology

Nit: On terminology, how about using consistent syntax/format of terms
with abbreviation in parenthesis following the term itself, e.g.,
something like

OLD:

       NVE: Network Virtualization Edge. 

NEW:
      Network Virtualization Edge (NVE): ...

Add the following (other documents use this term and we should define
it here for use throughout NVO3):

   Closed User Group (CUG): Another term for Virtual Network.


>        NVE: Network Virtualization Edge. It is a network entity that sits 
>        on the edge of the NVO3 network. It implements network 
>        virtualization functions that allow for L2 and/or L3 tenant 
>        separation and for hiding tenant addressing information (MAC and IP 
>        addresses). An NVE could be implemented as part of a virtual switch 
>        within a hypervisor, a physical switch or router, or a network 
>        service appliance.

Could be improved. How about:

       NVO3 Network: on overlay network that provides an L2 or L3
       service to Tenant Systems over an L3 underlay network, using
       the architecture and protocols as defined by the NVO3 Working
       Group.

       Network Virtualization Edge (NVE). An NVE is the network entity
       that implements network virtualization functions and sits on
       the boundary between an NVO3 network and an underlying network.
       The network-facing side of the NVE uses the underlying L3
       network to tunnel frames to and from other NVEs. The
       server-facing side of the NVE sends and receives Ethernet
       Frames to and from individual Tenant Systems.  An NVE could be
       implemented as part of a virtual switch within a hypervisor, a
       physical switch or router, a Network Service Appliance, or be
       split across multiple devices.


>        VN: Virtual Network. This is a virtual L2 or L3 domain that belongs 
>        to a tenant.

Better

       Virtual Network (VN). A virtual network is a logical
       abstraction of a physical network that provides network
       services to a set of Tenant Systems.  To Tenant Systems, a
       virtual network looks like a normal network (i.e., providing
       unrestricted ethernet or L3 service), except that the only end
       stations connected to the virtual network are those belonging
       to a tenant's specific virtual network.
   
>        VNI: Virtual Network Instance. This is one instance of a virtual 
>        overlay network. It refers to the state maintained for a given VN on 
>        a given NVE. Two Virtual Networks are isolated from one another and 
>        may use overlapping addresses.

Better

       Virtual Network Instance (VNI): An specific instance of a
       Virtual Network.

(no need to say more)

>        Virtual Network Context or VN Context: Field that is part of the 
>        overlay encapsulation header which allows the encapsulated frame to 
>        be delivered to the appropriate virtual network endpoint by the 
>        egress NVE. The egress NVE uses this field to determine the 
>        appropriate virtual network context in which to process the packet. 
>        This field MAY be an explicit, unique (to the administrative domain) 
>        virtual network identifier (VNID) or MAY express the necessary 
>        context information in other ways (e.g., a locally significant 
>        identifier).

Better:

       Virtual Network Context (VN Context): Field in overlay
       encapsulation header that identifies the specific VN the packet
       belongs to. The egress NVE uses the VN Context to deliver the
       packet to the correct Tenant System.  The VM Context can be a
       locally significant identifier having meaning only in
       conjunction with additional information, such as the
       destination NVE address. Alternatively, the VM Context can have
       broader scope, e.g., be unique across the entire NVO3 network.
       

>        VNID:  Virtual Network Identifier. In the case where the VN context 
>        identifier has global significance, this is the ID value that is 
>        carried in each data packet in the overlay encapsulation that 
>        identifies the Virtual Network the packet belongs to. 

A VNID definition by itself doesn't seem all that helpful. (This term
came from early on when some of us assumed that the Context ID always
had non-local significance.)

I think we may want to define some additional terms here. VNID by
itself is not sufficient. Have a look at the terms "VN Alias", "VN
Name" and "VN ID" in
draft-kreeger-nvo3-hypervisor-nve-cp-00.txt. Those (or similar) terms
should probably be moved into the framework document.

>        Underlay or Underlying Network: This is the network that provides 
>        the connectivity between NVEs. The Underlying Network can be 
>        completely unaware of the overlay packets. Addresses within the 
>        Underlying Network are also referred to as "outer addresses" because 
>        they exist in the outer encapsulation. The Underlying Network can 
>        use a completely different protocol (and address family) from that 
>        of the overlay.

We should say that for NVo3, the underlay is assumed to be IP

Better:

       Underlay or Underlying Network: the network that provides the
       connectivity between NVEs and over which NVO3 packets are
       tunneled. The Underlay Network does not need to be aware that
       it is carrying NVO3 packets. Addresses on the Underlay Network
       appear as "outer addresses" in encapsulated NVO3 packets. In
       general, the Underlay Network can use a completely different
       protocol (and address family) from that of the overlay. In the
       case of NVO3, the underlay network will always be IP.

>        Data Center (DC): A physical complex housing physical servers, 
>        network switches and routers, network service sppliances and 
>        networked storage. The purpose of a Data Center is to provide 
>        application, compute and/or storage services. One such service is 
>        virtualized infrastructure data center services, also known as 
>        Infrastructure as a Service.

Should we add defn. for "network service appliance" ??

>        Virtual Data Center or Virtual DC: A container for virtualized 
>        compute, storage and network services. Managed by a single tenant, a 
>        Virtual DC can contain multiple VNs and multiple Tenant Systems that 
>        are connected to one or more of these VNs.

Tenant manages what is in the VDC. Network Admin manages all aspects
of mapping the virtual components to physical components.

Better:

       Virtual Data Center (Virtual DC): A container for virtualized
       compute, storage and network services. A Virtual DC is
       associated with a single tenant, and can contain multiple VNs
       and Tenant Systems connected to one or more of these VNs.

>        VM: Virtual Machine. Several Virtual Machines can share the 
>        resources of a single physical computer server using the services of 
>        a Hypervisor (see below definition).

Better (taken/adapted from RFC6820):

   Virtual machine (VM):  A software implementation of a physical
      machine that runs programs as if they were executing on a
      physical, non-virtualized machine.  Applications (generally) do
      not know they are running on a VM as opposed to running on a
      "bare" host or server, though some systems provide a
      paravirtualization environment that allows an operating systems or
      application to be aware of the presences of virtualization for
      optimization purposes.

>        Hypervisor: Server virtualization software running on a physical 
>        compute server that hosts Virtual Machines. The hypervisor provides 
>        shared compute/memory/storage and network connectivity to the VMs 
>        that it hosts. Hypervisors often embed a Virtual Switch (see below). 

Compute server is not defined and the term isn't used elsewhere in the
document. How about:

       Hypervisor: Software running on a server that allows multiple
       VMs to run on the same physical server. The hypervisor provides
       shared compute/memory/storage and network connectivity to the
       VMs that it hosts. Hypervisors often embed a Virtual Switch
       (see below).

Also, add (for completeness)

      Server: A physical end host machine that runs user
      applications. A standalone (or "bare metal") server runs a
      conventional operating system hosting a single tenant
      application. A virtualized server runs a hypervisor supporting
      one or more VMs.
       

>        Virtual Switch: A function within a Hypervisor (typically 
>        implemented in software) that provides similar services to a 
>        physical Ethernet switch.  It switches Ethernet frames between VMs 
>        virtual NICs within the same physical server, or between a VM and a 
>        physical NIC card connecting the server to a physical Ethernet 
>        switch or router. It also enforces network isolation between VMs 
>        that should not communicate with each other. 

slightly better:

       Virtual Switch (vSwitch): A function within a Hypervisor
       (typically implemented in software) that provides similar
       services to a physical Ethernet switch.  A vSwitch forwards
       Ethernet frames between VMs running on the same server, or
       between a VM and a physical NIC card connecting the server to a
       physical Ethernet switch. A vSwitch also enforces network
       isolation between VMs that by policy are not permitted to
       communicate with each other (e.g., by honoring VLANs).

> 
>        Tenant: In a DC, a tenant refers to a customer that could be an 
>        organization within an enterprise, or an enterprise with a set of DC 
>        compute, storage and network resources associated with it.

Better:

       Tenant: The customer using a virtual network and any associated
       resources (e.g., compute, storage and network).  A tenant could
       be an enterprise, a department, an organization within an
       enterprise, etc.


>        Tenant System: A physical or virtual system that can play the role 
>        of a host, or a forwarding element such as a router, switch, 
>        firewall, etc. It belongs to a single tenant and connects to one or 
>        more VNs of that tenant.

Better:

       Tenant System: A physical or virtual host associated with a specific
       tenant.  A Tenant System can play the role of a host, or a
       forwarding element such as a router, switch, firewall, etc. A
       Tenant System is associated with a specific tenant and connects
       to one or more of the tenant's VNs.

>        End device: A physical system to which networking service is 
>        provided. Examples include hosts (e.g. server or server blade), 
>        storage systems (e.g., file servers, iSCSI storage systems), and 
>        network devices (e.g., firewall, load-balancer, IPSec gateway). An 
>        end device may include internal networking functionality that 
>        interconnects the device's components (e.g. virtual switches that 
>        interconnect VMs running on the same server). NVE functionality may 
>        be implemented as part of that internal networking. 

Better:

        End device: A physical device that connects directly to the
        data center Underlay Network. An End Device is administered by
        the data center operator rather than a tenant and is part of
        data center infrastructure. An End Device may implement NVO3
        technology in support of NVO3 functions. Contrast with Tenant
        System, which is only connected to a Virtual Network.
        Examples include hosts (e.g. server or server blade), storage
        systems (e.g., file servers, iSCSI storage systems), and
        network devices (e.g., firewall, load-balancer, IPSec
        gateway).



> 
>        ELAN: MEF ELAN, multipoint to multipoint Ethernet service  

I'd suggest dropping these terms. They are barely used and are not
critical to understanding the framework

> 
>        EVPN: Ethernet VPN as defined in [EVPN] 

Remove. These terms are not used elswhere in the document.

>     1.3. DC network architecture 
> 
>        A generic architecture for Data Centers is depicted in Figure 1:  
> 
>                                     ,---------. 
>                                   ,'           `. 
>                                  (  IP/MPLS WAN ) 
>                                   `.           ,' 
>                                     `-+------+' 
>                                  +--+--+   +-+---+ 
>                                  |DC GW|+-+|DC GW| 
>                                  +-+---+   +-----+ 
>                                     |       / 
>                                     .--. .--. 
>                                   (    '    '.--. 
>                                 .-.' Intra-DC     ' 
>                                (     network      ) 
>                                 (             .'-' 
>                                  '--'._.'.    )\ \ 
>                                  / /     '--'  \ \ 
>                                 / /      | |    \ \ 
>                           +---+--+   +-`.+--+  +--+----+ 
>                           | ToR  |   | ToR  |  |  ToR  | 
>                           +-+--`.+   +-+-`.-+  +-+--+--+ 
>                            /     \    /    \   /       \  
>                         __/_      \  /      \ /_       _\__ 
>                  '--------'   '--------'   '--------'   '--------' 
>                  :  End   :   :  End   :   :  End   :   :  End   : 
>                  : Device :   : Device :   : Device :   : Device : 
>                  '--------'   '--------'   '--------'   '--------' 
>                                           
>                  Figure 1 : A Generic Architecture for Data Centers 

The above is not necessarily what a DC looks like. ARMD went through
this already, and there are many different data center network types.
For example, the above doesn't allow for a chassis with an embedded
switch between the End Device and ToR.

This picture should be generalized to use terms like "access layer",
and "aggregation layer" rather than specific terms like ToR. Have a
look at RFC 6820.

Feel free to grab text or just point there. 

>        An example of multi-tier DC network architecture is presented in 
>        this figure. It provides a view of physical components inside a DC.  

s/this figure/Figure 1/

>        A cloud network is composed of intra-Data Center (DC) networks and 
>        network services, and inter-DC network and network connectivity 
>        services. Depending upon the scale, DC distribution, operations 
>        model, Capex and Opex aspects, DC networking elements can act as 
>        strict L2 switches and/or provide IP routing capabilities, including 
>        service virtualization.

Do we really need to use the term "cloud network" and say what a
"cloud network" is? The term does not seem to be used elsewhere in the
document...

>        In some DC architectures, it is possible that some tier layers 
>        provide L2 and/or L3 services, are collapsed, and that Internet 
>        connectivity, inter-DC connectivity and VPN support are handled by a 
>        smaller number of nodes. Nevertheless, one can assume that the 
>        functional blocks fit in the architecture above.

Per above, see how the ARMD document handled this.

>     1.4. Tenant networking view 
> 
>        The DC network architecture is used to provide L2 and/or L3 service 
>        connectivity to each tenant. An example is depicted in Figure 2: 
>         
> 
>                          +----- L3 Infrastructure ----+           
>                          |                            |           
>                       ,--+--.                      ,--+--. 
>                 .....( Rtr1  )......              ( Rtr2  ) 
>                 |     `-----'      |               `-----' 
>                 |     Tenant1      |LAN12      Tenant1| 
>                 |LAN11         ....|........          |LAN13 
>           ..............        |        |     .............. 
>              |        |         |        |       |        | 
>             ,-.      ,-.       ,-.      ,-.     ,-.      ,-. 
>            (VM )....(VM )     (VM )... (VM )   (VM )....(VM ) 
>             `-'      `-'       `-'      `-'     `-'      `-' 
>                                           
>             Figure 2 : Logical Service connectivity for a single tenant 
> 
>        In this example, one or more L3 contexts and one or more LANs (e.g., 
>        one per application type) are assigned for DC tenant1.  

This picture is unclear. what does "tenant 1" cover in the picture?
What is an "L3 context"? I would assume it needs to refer to specific
VMs too...

>        For a multi-tenant DC, a virtualized version of this type of service 
>        connectivity needs to be provided for each tenant by the Network 
>        Virtualization solution.  

I would assume  NVO3 only cares about the multi-tenant case. Which
makes me wonder what the previous example is supposed to show. A
single tenant case? What does that mean?

> 
>     2. Reference Models 
> 
>     2.1. Generic Reference Model 
> 
>        The following diagram shows a DC reference model for network 
>        virtualization using L3 (IP/MPLS) overlays where NVEs provide a 
>        logical interconnect between Tenant Systems that belong to a 
>        specific tenant network. 
> 

Should the above say "that belong to a specific tenant's Virtual
Network"?

>      
>              +--------+                                    +--------+ 
>              | Tenant +--+                            +----| Tenant | 
>              | System |  |                           (')   | System | 
>              +--------+  |    ...................   (   )  +--------+ 
>                          |  +-+--+           +--+-+  (_)     
>                          |  | NV |           | NV |   | 
>                          +--|Edge|           |Edge|---+ 
>                             +-+--+           +--+-+ 
>                             / .                 .  
>                            /  .   L3 Overlay +--+-++--------+ 
>              +--------+   /   .    Network   | NV || Tenant | 
>              | Tenant +--+    .              |Edge|| System | 
>              | System |       .    +----+    +--+-++--------+ 
>              +--------+       .....| NV |........    
>                                    |Edge| 
>                                    +----+ 
>                                      |       
>                                      | 
>                            ===================== 
>                              |               | 
>                          +--------+      +--------+ 
>                          | Tenant |      | Tenant | 
>                          | System |      | System | 
>                          +--------+      +--------+ 

s/NV Edge/NVE/ for consistency

>      
>           Figure 3 : Generic reference model for DC network virtualization 
>                            over a Layer3 infrastructure 
> 
>        A Tenant System can be attached to a Network Virtualization Edge 
>        (NVE) node in several ways:

Each of these ways should be clearly labeled in Figure3.

> 
>          - locally, by being co-located in the same device

add something like: (e.g., as part of the hypervisor)

> 
>          - remotely, via a point-to-point connection or a switched network 
>          (e.g., Ethernet) 
> 
>        When an NVE is local, the state of Tenant Systems can be provided 
>        without protocol assistance. For instance, the operational status of 
>        a VM can be communicated via a local API. When an NVE is remote, the 
>        state of Tenant Systems needs to be exchanged via a data or control 
>        plane protocol, or via a management entity.

Better:

       When an NVE is co-located with a Tenant System, communication
       and synchronization between the TS and NVE takes place via
       software (e.g., using an internal API). When an NVE and TS are
       separated by an access link, interaction and synchronization
       between an NVE and TS require an explicit data plane, control
       plane, or management protocol.

> 
>        The functional components in Figure 3 do not necessarily map 
>        directly with the physical components described in Figure 1. 
> 
>        For example, an End Device can be a server blade with VMs and 
>        virtual switch, i.e. the VM is the Tenant System and the NVE 
>        functions may be performed by the virtual switch and/or the 
>        hypervisor. In this case, the Tenant System and NVE function are co-
>        located. 
> 
>        Another example is the case where an End Device can be a traditional 
>        physical server (no VMs, no virtual switch), i.e. the server is the 
>        Tenant System and the NVE function may be performed by the ToR. 

We should not use the term "ToR" here. We should be more generic and
say something like the "attached switch" or "access switch". 

> 
>        The NVE implements network virtualization functions that allow for 
>        L2 and/or L3 tenant separation and for hiding tenant addressing 
>        information (MAC and IP addresses), tenant-related control plane 
>        activity and service contexts from the underlay nodes.

We should probably  define "tenant separation" earlier in the document and then
just refer to that defintion. Add something like the following to the
defintions?:

    Tenant Separation: Tenant Separation refers to isolating traffic
    of different tenants so that traffic from one tenant is not
    visible to or delivered to another tenant, except when allowed by
    policy. Tenant Separation also refers to address space separation,
    whereby different tenants use the same address space for different
    virtual networks without conflict.


>     2.2. NVE Reference Model

> 
>        One or more VNIs can be instantiated on an NVE. Tenant Systems 
>        interface with a corresponding VNI via a Virtual Access Point
>        (VAP).

Define VAP in the terminology section.

>        An overlay module that provides tunneling overlay functions (e.g., 
>        encapsulation and decapsulation of tenant traffic from/to the tenant 
>        forwarding instance, tenant identification and mapping, etc), as 
>        described in figure 4:

Doesn't quite parse. Better:

        An overlay module on the NVE provides tunneling overlay
        functions (e.g., encapsulation and decapsulation of tenant
        traffic from/to the tenant forwarding instance, tenant
        identification and mapping, etc), as described in figure 4:

> 
>                           +------- L3 Network ------+ 
>                           |                         | 
>                           |       Tunnel Overlay    | 
>              +------------+---------+       +---------+------------+ 
>              | +----------+-------+ |       | +---------+--------+ | 
>              | |  Overlay Module  | |       | |  Overlay Module  | | 
>              | +---------+--------+ |       | +---------+--------+ | 
>              |           |VN context|       | VN context|          | 
>              |           |          |       |           |          | 
>              |  +--------+-------+  |       |  +--------+-------+  | 
>              |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  | 
>         NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2 
>              |    |   VAPs     |    |       |    |    VAPs   |     | 
>              +----+------------+----+       +----+-----------+-----+ 
>                   |            |                 |           | 
>            -------+------------+-----------------+-----------+------- 
>                   |            |     Tenant      |           | 
>                   |            |   Service IF    |           | 
>                  Tenant Systems                 Tenant Systems 
>      
>                   Figure 4 : Generic reference model for NV Edge 
> 
>        Note that some NVE functions (e.g., data plane and control plane 
>        functions) may reside in one device or may be implemented separately 
>        in different devices. For example, the NVE functionality could 
>        reside solely on the End Devices, or be distributed between the End 
>        Devices and the ToRs. In the latter case we say that the End Device 
>        NVE component acts as the NVE Spoke, and ToRs act as NVE hubs. 
>        Tenant Systems will interface with VNIs maintained on the NVE 
>        spokes, and VNIs maintained on the NVE spokes will interface with 
>        VNIs maintained on the NVE hubs.

Don't always assume "ToR".

Also, in Figure 4, "VN context' is listed as if it were a component or
something. But VN context is  previously defined as a field in the
overlay header. Is this something different? If so, we should use a
different term (to avoid confusion).

>     2.3. NVE Service Types 
> 
>        NVE components may be used to provide different types of virtualized 
>        network services. This section defines the service types and 
>        associated attributes. Note that an NVE may be capable of providing 
>        both L2 and L3 services. 

>     2.3.1. L2 NVE providing Ethernet LAN-like service 
> 
>        L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based 

drop the "(ELAN)" abbreviation. (its not needed and is not used
anywhere else).

>        multipoint service where the Tenant Systems appear to be 
>        interconnected by a LAN environment over a set of L3 tunnels. It 
>        provides per tenant virtual switching instance with MAC addressing 
>        isolation and L3 (IP/MPLS) tunnel encapsulation across the underlay. 
> 
>     2.3.2. L3 NVE providing IP/VRF-like service 
> 
>        Virtualized IP routing and forwarding is similar from a service 
>        definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN 
>        [RFC4364] and IPsec VPNs). It provides per tenant routing instance 

should provide an RFC reference for IPsec VPNs too...

s/per tenant/per-tenant/

>        with addressing isolation and L3 (IP/MPLS) tunnel encapsulation 
>        across the underlay. 
> 
>     3. Functional components 
> 
>        This section decomposes the Network Virtualization architecture into 
>        functional components described in Figure 4 to make it easier to 
>        discuss solution options for these components. 
> 
>     3.1. Service Virtualization Components 
> 
>     3.1.1. Virtual Access Points (VAPs) 
> 
>        Tenant Systems are connected to the VNI Instance through Virtual 
>        Access Points (VAPs).  
> 
>        The VAPs can be physical ports or virtual ports identified through 
>        logical interface identifiers (e.g., VLAN ID, internal vSwitch 
>        Interface ID coonected to a VM).

s/coonected/connected/

> 
>     3.1.2. Virtual Network Instance (VNI) 
> 
>        The VNI represents a set of configuration attributes defining access 
>        and tunnel policies and (L2 and/or L3) forwarding functions.
>        Per tenant FIB tables and control plane protocol instances are used 
>        to maintain separate private contexts between tenants. Hence tenants 
>        are free to use their own addressing schemes without concerns about 
>        address overlapping with other tenants.

Not exactly. The VNI is a VN Instance. Implementing a VNI requires a
bunch of stuff. Also, here I think it is better to talk about VNs than
tenants.

Also, in reading through the doc, we say over and over again that you
get address space separation, etc. No need to repeat this all the time!

Better:

        A VNI is a specific VN instance. Associated with each VNI is a
        set of metat data necessary to implement the specific VN
        service. For example, a per-VN forwarding or mapping table is
        needed deliver traffic to other members of the VN and ensure
        tenant separation between between different VNs.

>     3.1.3. Overlay Modules and VN Context 
> 
>        Mechanisms for identifying each tenant service are required to allow 
>        the simultaneous overlay of multiple tenant services over the same 
>        underlay L3 network topology. In the data plane, each NVE, upon 
>        sending a tenant packet, must be able to encode the VN Context for 
>        the destination NVE in addition to the L3 tunnel information (e.g., 
>        source IP address identifying the source NVE and the destination IP 
>        address identifying the destination NVE, or MPLS label). This allows 
>        the destination NVE to identify the tenant service instance and 
>        therefore appropriately process and forward the tenant packet.  
> 
>        The Overlay module provides tunneling overlay functions: tunnel 
>        initiation/termination, encapsulation/decapsulation of frames from 
>        VAPs/L3 Backbone and may provide for transit forwarding of IP 

s/L3 Backbone/L3 underlay

>        traffic (e.g., transparent tunnel forwarding).

What is this "transit forwarding"?

>        In a multi-tenant context, the tunnel aggregates frames from/to 
>        different VNIs. Tenant identification and traffic demultiplexing are 
>        based on the VN Context identifier (e.g., VNID).

Let's drop use of VNID here (since IDs can be locally significant too).

> 
>        The following approaches can been considered: 
> 
>           o One VN Context per Tenant: A globally unique (on a per-DC 
>             administrative domain) VNID is used to identify the related 
>             Tenant instances. An example of this approach is the use of 
>             IEEE VLAN or ISID tags to provide virtual L2 domains.

I think this is off. We are mixing "tenant" and "VN". A tenant can
have multiple different VNIs associated with it.  Each of those VNIs
uses different VN Contexts. Thus, the VN Context != Tenant.

> 
>           o One VN Context per VNI: A per-tenant local value is 
>             automatically generated by the egress NVE and usually 
>             distributed by a control plane protocol to all the related 
>             NVEs. An example of this approach is the use of per VRF MPLS 
>             labels in IP VPN [RFC4364].

This seems off. There could be a different VN Context for each NVE, so
its not "per VNI".

>           o One VN Context per VAP: A per-VAP local value is assigned and 
>             usually distributed by a control plane protocol. An example of 
>             this approach is the use of per CE-PE MPLS labels in IP VPN 
>             [RFC4364].  
> 
>        Note that when using one VN Context per VNI or per VAP, an 
>        additional global identifier may be used by the control plane to 
>        identify the Tenant context.

need a name for that global identifier term and put it in the
terminology section.

> 
>     3.1.4. Tunnel Overlays and Encapsulation options 
> 
>        Once the VN context identifier is added to the frame, a L3 Tunnel 

When using term Context Identifier, capitalize it.


>        encapsulation is used to transport the frame to the destination NVE. 
>        The backbone devices do not usually keep any per service state, 
>        simply forwarding the frames based on the outer tunnel
>        header.

don't use "backbone devices" term. use "underlay devices"?

> 
>        Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS 
>        tunneling options (e.g., BGP VPN, VPLS) can be used. 
> 
>     3.1.5. Control Plane Components

This section should be expanded to show the different problem areas
for the control plane (specifically server-to-NVE and NVE-to-oracle).

> 
>        Control plane components may be used to provide the following 
>        capabilities: 
> 
>           . Auto-provisioning/Service discovery 
> 
>           . Address advertisement and tunnel mapping 
> 
>           . Tunnel management

The above really should be expanded a bit. E.g., what does "auto
provisioning" refer to? I don't think a lot is needed, but a sentence
or two per bullet point would help. Also, there are 3 bullet points
here, but there are 4 subsections that follow, and they do not match
the above bullet points.

> 
>        A control plane component can be an on-net control protocol 
>        implemented on the NVE or a management control entity.

What is "on-net protocol"

> 
>     3.1.5.1. Distributed vs Centralized Control Plane 
> 
>        A control/management plane entity can be centralized or distributed. 
>        Both approaches have been used extensively in the past. The routing 
>        model of the Internet is a good example of a distributed approach. 
>        Transport networks have usually used a centralized approach to 
>        manage transport paths.

What is a "transport network"? I'm not sure what these are and why
they are "usually"  use a centrallized network.

>        It is also possible to combine the two approaches i.e. using a 
>        hybrid model. A global view of network state can have many benefits 
>        but it does not preclude the use of distributed protocols within the 
>        network. Centralized controllers provide a facility to maintain 
>        global state, and distribute that state to the network which in 
>        combination with distributed protocols can aid in achieving greater 
>        network efficiencies, and improve reliability and robustness. Domain 
>        and/or deployment specific constraints define the balance between 
>        centralized and distributed approaches. 
> 
>        On one hand, a control plane module can reside in every NVE. This is 
>        how routing control plane modules are implemented in routers. At the 
>        same time, an external controller can manage a group of NVEs via an 
>        agent in each NVE. This is how an SDN controller could communicate 
>        with the nodes it controls, via OpenFlow [OF] for instance.

Expand SDN on first usage...

> 
>        In the case where a logically centralized control plane is 
>        preferred, the controller will need to be distributed to more than 
>        one node for redundancy and scalability in order to manage a large 
>        number of NVEs. Hence, inter-controller communication is necessary 
>        to synchronize state among controllers. It should be noted that 
>        controllers may be organized in clusters. The information exchanged 
>        between controllers of the same cluster could be different from the 
>        information exchanged across clusters.

This section  does not really capture the oracle discussion we had at
the interim meeting.

>     3.1.5.2. Auto-provisioning/Service discovery 
> 
>        NVEs must be able to identify the appropriate VNI for each Tenant 
>        System. This is based on state information that is often provided by 
>        external entities. For example, in an environment where a VM is a 
>        Tenant System, this information is provided by compute management 
>        systems, since these are the only entities that have visibility of 
>        which VM belongs to which tenant.

Above. might be better to say this is provided by the vm orchestration
system (and maybe define this term in terminology section?)

>     3.1.5.3. Address advertisement and tunnel mapping 
> 
>        As traffic reaches an ingress NVE, a lookup is performed to 
>        determine which tunnel the packet needs to be sent to. It is then 

s/sent to/sent on/ ??

>        encapsulated with a tunnel header containing the destination 
>        information (destination IP address or MPLS label) of the egress 
>        overlay node. Intermediate nodes (between the ingress and egress 
>        NVEs) switch or route traffic based upon the outer destination 
>        information.

Use active voice above? And say which entity it is that does the work?
E.g.

        As traffic reaches an ingress NVE, the NVE performs a lookup
        to determine which remote NVE the packet should be sent
        to. The NVE [or Overlay Module???] then adds a tunnel
        encapsulation header containing the destination information
        (destination IP address or MPLS label) of the egress NVE as
        well as an appropriate Context ID.  Nodes on the underlay
        network (between the ingress and egress NVEs) forward traffic
        based solely on the outer destination header information.

>        One key step in this process consists of mapping a final destination 
>        information to the proper tunnel. NVEs are responsible for 
>        maintaining such mappings in their forwarding tables. Several ways 
>        of populating these tables are possible: control plane driven, 
>        management plane driven, or data plane driven.  

Better:

        A key step in the above process consists of identifying the
        destination NVE the packet is to be tunneled to. NVEs are
        responsible for maintaining a set of forwarding or mapping
        tables that hold the bindings between between destination VM
        and egress NVE addresses. Several ways of populating these
        tables are possible: control plane driven, management plane
        driven, or data plane driven.

>     3.2. Multi-homing 
> 
>        Multi-homing techniques can be used to increase the reliability of 
>        an nvo3 network. It is also important to ensure that physical 
>        diversity in an nvo3 network is taken into account to avoid single 
>        points of failure. 
> 
>        Multi-homing can be enabled in various nodes, from tenant systems 
>        into TORs, TORs into core switches/routers, and core nodes into DC 
>        GWs. 
> 
>        The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP 
>        routing as the means to re-route traffic upon failures and/or ECMP 
>        techniques or on MPLS re-rerouting capabilities. 
> 
>        When a tenant system is co-located with the NVE on the same end-
>        system, the tenant system is single homed to the NVE via a vport 
>        that is virtual NIC (vNIC). When the end system and the NVEs
> are

vport/vNIC terminology is not defined (and is somewhat proprietary).

And shouldn't the VAP terminology be used here?

>        separated, the end system is connected to the NVE via a logical 
>        Layer2 (L2) construct such as a VLAN. In this latter case, an end 
>        device or vSwitch on that device could be multi-homed to various 
>        NVEs. An NVE may provide an L2 service to the end system or a l3 
>        service. An NVE may be multi-homed to a next layer in the DC at 
>        Layer2 (L2) or Layer3 (L3). When an NVE provides an L2 service and 
>        is not co-located with the end system, techniques such as Ethernet 
>        Link Aggregation Group (LAG) or Spanning Tree Protocol (STP) can be 
>        used to switch traffic between an end system and connected 
>        NVEs without creating loops. Similarly, when the NVE provides L3 
>        service, similar dual-homing techniques can be used. When the NVE 
>        provides a L3 service to the end system, it is possible that no 
>        dynamic routing protocol is enabled between the end system and the 
>        NVE. The end system can be multi-homed to multiple physically-
>        separated L3 NVEs over multiple interfaces. When one of the 
>        links connected to an NVE fails, the other interfaces can be used to 
>        reach the end system.

The above seemt to talk about what I'll call a "distributed NVE
model", where a TES is connected to more than one NVE (or the one NVE
that is somehow distributed). If this document is going to talk about
that as a possibility in the context of multihoming, I think we need
to talk more generally about a TES connected to more than one
TES. This document doesn't really talk about that at all.

Personally, I'm not sure we should go here. A lot of complexity that I
suspect is not worth the cost. I'd suggest sticking with a single NVE
per TES.

>        External connectivity out of an nvo3 domain can be handled by two or 
>        more nvo3 gateways. Each gateway is connected to a different domain 
>        (e.g. ISP), providing access to external networks such as VPNs or 
>        the Internet. A gateway may be connected to two nodes. When a 
>        connection to an upstream node is lost, the alternative connection 
>        is used and the failed route withdrawn.

For external multihoming, there is no reason to say they are connected
to "different domains". You just want redundancy.

Actually, I'm not sure what point the above is trying to highlight. Is
this a generic requirement for multihoming out of a DC that happens to
be running NVO3 internally? If so, I don't think we need to say
that. It's a given and mostly out of scope.

If we are talking just about "nvo3 gateways", I presume that means
getting in and out of a specific VN. For that, you just need
multihoming for redundancy, and there is no need to talk about
connecting those gateways to "different domains".

>     3.3. VM Mobility 
> 
>        In DC environments utilizing VM technologies, an important feature 
>        is that VMs can move from one server to another server in the same 
>        or different L2 physical domains (within or across DCs) in a 
>        seamless manner. 
> 
>        A VM can be moved from one server to another in stopped or suspended 
>        state ("cold" VM mobility) or in running/active state ("hot" VM 
>        mobility). With "hot" mobility, VM L2 and L3 addresses need to be 
>        preserved. With "cold" mobility, it may be desired to preserve VM L3 
>        addresses. 
> 
>        Solutions to maintain connectivity while a VM is moved are necessary 
>        in the case of "hot" mobility. This implies that transport 
>        connections among VMs are preserved and that ARP caches are updated 
>        accordingly. 
> 
>        Upon VM mobility, NVE policies that define connectivity among VMs 
>        must be maintained. 
> 
>        Optimal routing during VM mobility is also an important aspect to 
>        address. It is expected that the VM's default gateway be as close as 
>        possible to the server hosting the VM and triangular routing be 
>        avoided.

What is meant by "triangular routing" above? Specifically, how is this
a result of mobility vs. a general requirement?

>     3.4. Service Overlay Topologies 
> 
>        A number of service topologies may be used to optimize the service 
>        connectivity and to address NVE performance limitations.  
> 
>        The topology described in Figure 3 suggests the use of a tunnel mesh 
>        between the NVEs where each tenant instance is one hop away from a 
>        service processing perspective. Partial mesh topologies and an NVE 
>        hierarchy may be used where certain NVEs may act as service transit 
>        points. 
> 
>     4. Key aspects of overlay networks 
> 
>        The intent of this section is to highlight specific issues that 
>        proposed overlay solutions need to address. 
> 
>     4.1. Pros & Cons 
> 
>        An overlay network is a layer of virtual network topology on top of 
>        the physical network.  
> 
>        Overlay networks offer the following key advantages: 
> 
>           o Unicast tunneling state management and association with tenant 
>             systems reachability are handled at the edge of the network. 
>             Intermediate transport nodes are unaware of such state. Note 
>             that this is not the case when multicast is enabled in the core 
>             network.

The comment about multicast needs expansion. Multicast (in the
underlay)  is an underlay issue and has nothing to do with
overlays. If Tenant traffic is mapped into multicast service on the
underlay, then there is a connection. If the latter is what is meant,
please add text to that effect.

>           o Tunneling is used to aggregate traffic and hide tenant 
>             addresses from the underkay network, and hence offer the 
>             advantage of minimizing the amount of forwarding state required 
>             within the underlay network 
> 
>           o Decoupling of the overlay addresses (MAC and IP) used by VMs 
>             from the underlay network. This offers a clear separation 
>             between addresses used within the overlay and the underlay 
>             networks and it enables the use of overlapping addresses spaces 
>             by Tenant Systems 
> 
>           o Support of a large number of virtual network identifiers 
> 
>        Overlay networks also create several challenges: 
> 
>           o Overlay networks have no controls of underlay networks and lack 
>             critical network information

Is some text missing from the above bullet?

> 
>                o Overlays typically probe the network to measure link or 
>                  path properties, such as available bandwidth or packet 
>                  loss rate. It is difficult to accurately evaluate network 
>                  properties. It might be preferable for the underlay 
>                  network to expose usage and performance
>                information.

I don't follow the above. Isn't the above a true statement for a host
connected to an IP network as well?

>           o Miscommunication or lack of coordination between overlay and 
>             underlay networks can lead to an inefficient usage of network 
>             resources.

Might be good to give an example.

>           o When multiple overlays co-exist on top of a common underlay 
>             network, the lack of coordination between overlays can lead to 
>             performance issues.

Can you give examples? and how is this different from what we have
today with different hosts "not coordinating"  when they use the
network?

>           o Overlaid traffic may not traverse firewalls and NAT
            devices.

Explain why this is a challenge. I'd argue that is the point. if a FW
is needed, it could be part of the overlay.

>           o Multicast service scalability. Multicast support may be 
>             required in the underlay network to address for each tenant 
>             flood containment or efficient multicast handling. The underlay 
>             may be also be required to maintain multicast state on a per- 
>             tenant basis, or even on a per-individual multicast flow of a 
>             given tenant. 
> 
>           o Hash-based load balancing may not be optimal as the hash 
>             algorithm may not work well due to the limited number of 
>             combinations of tunnel source and destination addresses. Other 
>             NVO3 mechanisms may use additional entropy information than 
>             source and destination addresses. 
>     4.2. Overlay issues to consider 
> 
>     4.2.1. Data plane vs Control plane driven 
> 
>        In the case of an L2NVE, it is possible to dynamically learn MAC 
>        addresses against VAPs.

rewrite? what does it mean to "learn MAC addresses against VAPs".

> It is also possible that such addresses be 
>        known and controlled via management or a control protocol for both 
>        L2NVEs and L3NVEs.  
> 
>        Dynamic data plane learning implies that flooding of unknown 
>        destinations be supported and hence implies that broadcast and/or 
>        multicast be supported or that ingress replication be used as 
>        described in section 4.2.3. Multicasting in the underlay network for 
>        dynamic learning may lead to significant scalability limitations. 
>        Specific forwarding rules must be enforced to prevent loops from 
>        happening. This can be achieved using a spanning tree, a shortest 
>        path tree, or a split-horizon mesh. 
> 
>        It should be noted that the amount of state to be distributed is 
>        dependent upon network topology and the number of virtual machines. 
>        Different forms of caching can also be utilized to minimize state 
>        distribution between the various elements. The control plane should 
>        not require an NVE to maintain the locations of all the tenant 
>        systems whose VNs are not present on the NVE. The use of a control 
>        plane does not imply that the data plane on NVEs has to maintain all 
>        the forwarding state in the control plane. 
> 
>     4.2.2. Coordination between data plane and control plane 
> 
>        For an L2 NVE, the NVE needs to be able to determine MAC addresses 
>        of the end systems connected via a VAP. This can be achieved via 
>        dataplane learning or a control plane. For an L3 NVE, the NVE needs 
>        to be able to determine IP addresses of the end systems connected 
>        via a VAP.

Better:

        For an L2 NVE, the NVE needs to be able to determine MAC
        addresses of the end systems connected via a VAP. For an L3
        NVE, the NVE needs to be able to determine IP addresses of the
        end systems connected via a VAP. In both cases, this can be
        achieved via dataplane learning or a control plane.

>        In both cases, coordination with the NVE control protocol is needed 
>        such that when the NVE determines that the set of addresses behind a 
>        VAP has changed, it triggers the local NVE control plane to 
>        distribute this information to its peers. 
> 
>     4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic 
> 
>        There are two techniques to support packet replication needed for 
>        broadcast, unknown unicast and multicast:

s/for broadcast/for tenant broadcast/

>           o Ingress replication 
> 
>           o Use of underlay multicast trees

draft-ghanwani-nvo3-mcast-issues-00.txt describes a third technique.

>        There is a bandwidth vs state trade-off between the two approaches. 
>        Depending upon the degree of replication required (i.e. the number 
>        of hosts per group) and the amount of multicast state to maintain, 
>        trading bandwidth for state should be considered. 
> 
>        When the number of hosts per group is large, the use of underlay 
>        multicast trees may be more appropriate. When the number of hosts is 
>        small (e.g. 2-3), ingress replication may not be an issue. 
> 
>        Depending upon the size of the data center network and hence the 
>        number of (S,G) entries, but also the duration of multicast flows, 
>        the use of underlay multicast trees can be a challenge. 
> 
>        When flows are well known, it is possible to pre-provision such 
>        multicast trees. However, it is often difficult to predict 
>        application flows ahead of time, and hence programming of (S,G) 
>        entries for short-lived flows could be impractical. 
> 
>        A possible trade-off is to use in the underlay shared multicast 
>        trees as opposed to dedicated multicast trees. 

>     4.2.4. Path MTU 
> 
>        When using overlay tunneling, an outer header is added to the 
>        original frame. This can cause the MTU of the path to the egress 
>        tunnel endpoint to be exceeded.  
> 
>        In this section, we will only consider the case of an IP overlay. 
> 
>        It is usually not desirable to rely on IP fragmentation for 
>        performance reasons. Ideally, the interface MTU as seen by a Tenant 
>        System is adjusted such that no fragmentation is needed. TCP will 
>        adjust its maximum segment size accordingly. 
> 
>        It is possible for the MTU to be configured manually or to be 
>        discovered dynamically. Various Path MTU discovery techniques exist 
>        in order to determine the proper MTU size to use: 
> 
>           o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] 
> 
>                o 
>                 Tenant Systems rely on ICMP messages to discover the MTU of 
>                  the end-to-end path to its destination. This method is not 
>                  always possible, such as when traversing middle boxes 
>                  (e.g. firewalls) which disable ICMP for security reasons 
> 
>           o Extended MTU Path Discovery techniques such as defined in 
>             [RFC4821] 
> 
>        It is also possible to rely on the overlay layer to perform 
>        segmentation and reassembly operations without relying on the Tenant 
>        Systems to know about the end-to-end MTU. The assumption is that 
>        some hardware assist is available on the NVE node to perform such 
>        SAR operations. However, fragmentation by the overlay layer can lead 

expand "SAR"

>        to performance and congestion issues due to TCP dynamics and might 
>        require new congestion avoidance mechanisms from then underlay 
>        network [FLOYD]. 
> 
>        Finally, the underlay network may be designed in such a way that the 
>        MTU can accommodate the extra tunneling and possibly additional nvo3 
>        header encapsulation overhead. 
> 
>     4.2.5. NVE location trade-offs  
> 
>        In the case of DC traffic, traffic originated from a VM is native 
>        Ethernet traffic. This traffic can be switched by a local virtual 
>        switch or ToR switch and then by a DC gateway. The NVE function can 
>        be embedded within any of these elements. 
> 
>        There are several criteria to consider when deciding where the NVE 
>        function should happen: 
> 
>           o Processing and memory requirements 
> 
>               o Datapath (e.g. lookups, filtering, 
>                  encapsulation/decapsulation) 
> 
>               o Control plane processing (e.g. routing, signaling, OAM) and 
>                  where specific control plane functions should be enabled 

missing closing ")"

>           o FIB/RIB size 
> 
>           o Multicast support 
> 
>               o Routing/signaling protocols 
> 
>               o Packet replication capability 
> 
>               o Multicast FIB 
> 
>           o Fragmentation support 
> 
>           o QoS support (e.g. marking, policing, queuing)  
> 
>           o Resiliency 
> 
>     4.2.6. Interaction between network overlays and underlays 
> 
>        When multiple overlays co-exist on top of a common underlay network, 
>        resources (e.g., bandwidth) should be provisioned to ensure that 
>        traffic from overlays can be accommodated and QoS objectives can be 
>        met. Overlays can have partially overlapping paths (nodes and 
>        links). 
> 
>        Each overlay is selfish by nature. It sends traffic so as to 
>        optimize its own performance without considering the impact on other 
>        overlays, unless the underlay paths are traffic engineered on a per 
>        overlay basis to avoid congestion of underlay resources. 
> 
>        Better visibility between overlays and underlays, or generally 
>        coordination in placing overlay demand on an underlay network, can 
>        be achieved by providing mechanisms to exchange performance and 
>        liveliness information between the underlay and overlay(s) or the 
>        use of such information by a coordination system. Such information 
>        may include: 
> 
>           o Performance metrics (throughput, delay, loss, jitter) 
> 
>           o Cost metrics 
> 
>     5. Security Considerations 
> 
>        Nvo3 solutions must at least consider and address the following: 
> 
>           . Secure and authenticated communication between an NVE and an 
>             NVE management system. 
> 
>           . Isolation between tenant overlay networks. The use of per-
>             tenant FIB tables (VNIs) on an NVE is essential. 
> 
>           . Security of any protocol used to carry overlay network 
>             information. 
> 
>           . Avoiding packets from reaching the wrong NVI, especially during 
>             VM moves. 
> 
>
[nvo3] Review of draft-ietf-nvo3-framework-02 Thomas Narten
Re: [nvo3] Review of draft-ietf-nvo3-framework-02 Benson Schliesser
Re: [nvo3] Review of draft-ietf-nvo3-framework-02 LASSERRE, MARC (MARC)
Re: [nvo3] Review of draft-ietf-nvo3-framework-02 Larry Kreeger (kreeger)