Re: [RTG-DIR] Rtgdir telechat review of draft-ietf-anima-reference-model-07

Brian E Carpenter <brian.e.carpenter@gmail.com> Thu, 30 August 2018 02:06 UTC

Return-Path: <brian.e.carpenter@gmail.com>
X-Original-To: rtg-dir@ietfa.amsl.com
Delivered-To: rtg-dir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EB75712DD85; Wed, 29 Aug 2018 19:06:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B-Qw24M94rVg; Wed, 29 Aug 2018 19:06:13 -0700 (PDT)
Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6595A12426A; Wed, 29 Aug 2018 19:06:13 -0700 (PDT)
Received: by mail-pg1-x52d.google.com with SMTP id r1-v6so3146960pgp.11; Wed, 29 Aug 2018 19:06:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=G+Xvozs8dBqcn3K3nLfZncelPChsNWMHiPW5Ua0tc/E=; b=aRW8yijknBJEwsrK540KpDoJGWbyNFIc5D+X+UjaDxaV8sCE+ML70JFF8tc47+FkS2 k2VzwksVEaDbWK9/I3kCLUzAu4FGG2awBgDHRfa9Q7rZ9k/ZCoIysoc0pZ/Fc3p0OB74 KALmDcdbKOciGn0qoy+c1IDBRr03axsbuwy93vr+pauCUxS4sVJVVBR8vW+iCcNNbsrU elAukCHEd0LuOxDcjffsm5rGBNmJmhlcFphoEdIdDiuLZcYw/YMU5vtLV2o1W+KzdxcQ rKfWW1BVvhipaXaKqLWN1vL5L46qWAbXHQyVW4KHZ6yqRwwA4VE3e56QTjjry7pAXkrR S8OQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=G+Xvozs8dBqcn3K3nLfZncelPChsNWMHiPW5Ua0tc/E=; b=HcwcAsi9PYAorkA+DSTdWJlCUfXxAf6ACFP36k0KX61F3Y9olHoATmt7aEt9BinbVK UhRe9eFNHK0m9iqq/27OCTA89Td5nOF17USq8vIr06YphUbV6LPpAl1I02OF9XclD1hf RGxv6BlgEjbKCwsYi8B4BoNJb6ASPSiVvBcXNHh8fXzEzmrhBOo9WG24SDk95PPAQB7k +ZarCLTYz1uP2ACPV+Q9DCygTVbAdQFjW6kfoRVdHqzCi7DpfyCRaRkZcilK7E6Pf5Eq FNkdwKfOOZeE8WA6/7EAxw6PfUpSZi9dxuF+hBwDtSChFNigPlpV4LjubX91szVv+uIF assA==
X-Gm-Message-State: APzg51C4uZ5UI4WAlUvUhfpaqojkYIRcyy9vJlJ2ynt5bbrHDm6OkMFd 5Umc+It9DB0Gz+isPoI0RKfT/zLV
X-Google-Smtp-Source: ANB0VdaFmCuJC+0Q9yQ5tW3pLVcS++Ola5BHk24KGSy4WedMr/e3C/qY1kSRsgGkoQCOkSg0AfA+eQ==
X-Received: by 2002:a63:2a0b:: with SMTP id q11-v6mr7630305pgq.36.1535594772585; Wed, 29 Aug 2018 19:06:12 -0700 (PDT)
Received: from [192.168.178.30] ([118.148.68.33]) by smtp.gmail.com with ESMTPSA id u43-v6sm10362361pgn.81.2018.08.29.19.06.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 29 Aug 2018 19:06:11 -0700 (PDT)
To: "Michael H. Behringer" <michael.h.behringer@gmail.com>, Christian Hopps <chopps@chopps.org>, rtg-dir@ietf.org
Cc: anima@ietf.org, draft-ietf-anima-reference-model.all@ietf.org
References: <153529941582.11902.1347468414499836311@ietfa.amsl.com> <6288ec99-fbf6-e2e0-32c3-e402c19fdecd@gmail.com> <82183cb3-eefc-a22c-dcfa-d412733d933b@gmail.com>
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Message-ID: <0bb46db0-24f4-5396-9180-2adf20186233@gmail.com>
Date: Thu, 30 Aug 2018 14:06:06 +1200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <82183cb3-eefc-a22c-dcfa-d412733d933b@gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtg-dir/ZS5meS0iUoZSVfi8bc7sGU-Y73I>
Subject: Re: [RTG-DIR] Rtgdir telechat review of draft-ietf-anima-reference-model-07
X-BeenThere: rtg-dir@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Routing Area Directorate <rtg-dir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-dir>, <mailto:rtg-dir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtg-dir/>
List-Post: <mailto:rtg-dir@ietf.org>
List-Help: <mailto:rtg-dir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-dir>, <mailto:rtg-dir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Aug 2018 02:06:16 -0000

On 2018-08-29 22:39, Michael H. Behringer wrote:
> Christian, thanks for the review, my comments inline...
> 
> On 26/08/2018 22:57, Brian E Carpenter wrote:
>> (Ccs trimmed)
>>
>> Christian,
>>
>> Thanks for this careful review. I'll comment here on the larger issues:
>>
>> On 2018-08-27 04:03, Christian Hopps wrote:
>> ...
>>> Minor Major Issues:
>>>
>>> - Virtualization is mentioned once in "4.2 addressing" section. To quote:
>>>
>>>    TEXT: "Support for virtualization: Autonomic Nodes may support Autonomic
>>>    Service Agents in different virtual machines or containers. The addressing
>>>    scheme should support this architecture."
>>>
>>>    The special casing of VM/containers here seems to indicate that virtual
>>>    devices are not "1st class citizens" in an autonomic network. In particular I
>>>    could easily imagine virtual machines being full blown autonomic nodes
>>>    themselves. Assuming the intent is not to restrict virtual devices in this
>>>    manor something needs to be said (somewhere) to make that clear.
>> I don't think that was the intention. We haven't really explored this in detail,
>> but I can certainly imagine a deployment (for example) where each tenant in
>> a data centre has its own virtual autonomic network, and the underlying physical
>> network is also autonomic. Since the ACP is expected to be implemented as
>> a VRF, you could even argue that every autonomic network is virtual.
>>
>> So, yes, we can reword this.
> 
> To add to Brian: I agree there was no intention to "downgrade" 
> virtualization. Nor am I aware of any text that indicates that, also not 
> what you quote above. We didn't mean this to say "this is not 
> recommended", only "we haven't explored / documented that further". I am 
> convinced that the Autonomic architecture will be all over data centers 
> one day, so OBVIOUSLY virtualization is important.
> 
> Happy to re-word, but: What do you suggest we change / add? I'm really 
> not clear...

A modest suggestion, on the basis that if Christian read the text differently
from the authors' intentions, other people might do so as well.

Support for virtualization: Autonomic functions can exist either at the
level of the physical network and physical devices, or at the level
of virtual machines, containers and networks. In particular, Autonomic
Nodes may support Autonomic Service Agents in virtual entities. The
infrastructure, including the addressing scheme, should be able to
support this architecture.

(Comment: it isn't just addressing. There are software design
implications too, but that is out of scope here, I think.)

>>> - Robust programming techniques. I think the intention here is to say that the
>>>    design of ASAs must have robustness as a top design principle. I think in
>>>    doing that it should talk about what being robust means; however, it should
>>>    not be talking about how to accomplish that as there are multiple ways to
>>>    achieve this goal.
>>>
>>>    In particular I feel saying that restarting is the *last* thing an ASA should
>>>    do is way overreaching into engineering the solution rather than specifying
>>>    the requirement. Indeed plenty of people think that overly complex recovery
>>>    mechanisms that try everything under the sun to *not* restart often have more
>>>    bugs and are less robust than KISS solutions that "fail" simply but recover
>>>    quickly with minimal or no disruption.
>>>
>>>    I feel this section reads a bit more like someones idea of how to design a
>>>    robust system instead of talking about what robust means which is the intent I
>>>    believe.
>>>
>>>    Perhaps better is just to focus on robust design ideas (some are already
>>>    stated in the text):
>>>
>>>    - must deal with discovery and negotiation failure as routine.
>>>    - recovering from failures should be minimally disruptive.
>>>    - must not leak resources.
>>>    - must monitor for and deal with hung code.
>>>    - must include security analysis
>> OK. Since I drafted that text, I will leave the document editor to fix
>> it. (Some of the detail probably belongs in another draft specifically
>> about ASAs, which I am editing.)
> 
> Brian: Haha, that's called "passing on the ball" I believe... :-)

Good catch!

> Christian: With your input above, I suggest to reword that paragraph to:
> 
> Since autonomic systems must be self-repairing, it is of great 
> importance that ASAs are
> coded using robust programming techniques. All run-time error conditions 
> must be caught,
> leading to suitable >>> minimally disruptive <<< recovery actions, >>> 
> also considering a complete restart of the ASA.<<<
> 
> The other bullets are covered in the text.

OK for me.

>>> - 7.4: When text talks about feedback loop, it mentions "allow the intervention"
>>>    of human admin or control system; however, it then describes the feedback loop
>>>    as presenting default actions and allowing for override. This is fine, but it
>>>    seems to leave out the common case where something is misbehaving and would
>>>    not be presenting any choices to the administrator (using the feedback loop),
>>>    so the admin must forcefully intervene.
>> Yes. I think the word "feedback" is a bad choice. For engineers raised on
>> Nyquist diagrams it is part of a closed loop; for other people it means
>> feedback to humans. The text needs clarifying.
> Christian: Right. I think this may be clearer when we distinguish (more 
> explicitly than is the case now) that there are two different systems 
> involving the NOC:
> 
> 1 - a closed loop, called feedback loop at the moment.
> 2 - a unidirectional error message.
> 
> 1 works like this: Node detects abnormal condition, informs NOC "here is 
> what I see, I will take recovery X at time Y to resolve this". The NOC 
> can then do nothing, or override this action.
> 
> 2 is a simple report, with no suggestion what to do, and no default 
> recovery action. It's the NOC engineer's job to figure out what to do. I 
> think this is the "common case" you refer to. Put differently: A loop as 
> in (1) REQUIRES some form of reaction by the node, otherwise it doesn't 
> belong here.
> 
> Secion 7.4 intents to cover case 1 only.
> In my mind, case 2 is like current notification mechanisms (syslog etc), 
> part of "traditional management", thus not covered in this draft.
> 
> So maybe the solution is to point out what I say above: If there is no 
> autonomic recovery option ("not presenting any choices" in your text), 
> it falls into the traditional, non-autonomic scenario. And the draft is 
> clear that such traditional mechanisms will co-exist for a long time still.
> 
> Proposal:
> "
>     Feedback loops are required in an autonomic network to allow the
>     intervention of a human administrator or central control systems,
>     while maintaining a default behaviour.  Through a feedback loop an
>     administrator ***must*** be prompted with a default action, and has the
>     possibility to acknowledge or override the proposed default action.
>     *** Uni-directional notifications to the NOC, that do not propose
>     any default action, and do not allow an override as part of the 
> transaction
>     are considered like traditional notification services, such as 
> syslog. They
>     are expected to co-exist with autonomic methods, but are not covered in
>     this draft.
> "
> 
> Would that be clear?

Again, OK for me.

>>> Minor Issues:
>>>
>>> - 6.1 TEXT: "It must be possible to run ASAs as non-privileged (user space)
>>>    processes except for those (such as the infrastructure ASAs) that necessarily
>>>    require kernel privilege. Also, it is highly desirable that ASAs can be
>>>    dynamically loaded on a running node."
>>>
>>>    ISSUE: Discussing implementation details like user-space, kernel privilege and
>>>    dynamic loading seems unnecessary and outside the scope of this document. Does
>>>    this document care if I implement my ASA on a real-time architecture with no
>>>    "user space" etc..?
>> Fair enough. See my above comment re robustness.
> 
> I see where you're coming from, and sort of agree, and don't :-) 
> Formally you're absolutely right, implementation details don't belong 
> here. However, we had a large number of exchanges on user vs kernel 
> space, which had design consequences. I feel it would be useful to leave 
> this point in the document.

Yes, because we want to insist that ASAs will be written not by kernel
programmers but ordinary humans ;-).

Whatever happens to the text here, the idea is that the complete
discussion of ASAs will be in draft-carpenter-anima-asa-guidelines,
assuming that the WG adopts it.

Regards
    Brian

>> I'll leave the rest of your comments to the document editor.
>>
>> Regards
>>      Brian
>>
>>> - 4.6 Why call out global routing and overlay networks in particular? Is the
>>>    real intention to just say that the ACP implementation is not restricted to
>>>    any specific type of networking?
> I see that this can cause confusion. It also dates back to a looong 
> discussion on what the ACP should look like. Frankly, I think we have 
> all converged on the overlay model. So while the global routing table 
> model is theoretically an option, we should probably drop it here. 
> Suggestion:
> 
> "
>     The "Autonomic Control Plane" carries the control protocols in an
>     autonomic network.  In the architecture described here, it is 
> implemented
>     as an overlay network.  The document "An
>     Autonomic Control Plane" ([I-D.ietf-anima-autonomic-control-plane])
>     describes the implementation details suggested here.  See
>     [I-D.ietf-anima-stable-connectivity] for uses cases for the ACP.
> "
>>>
>>> - TEXT: 6.3.1.2 "on a given LAN"
>>>
>>>    NIT: Everyone knows what a LAN is; however, I wonder if the text should be
>>>    more generic and actually describe what it really requires here which is a
>>>    broadcast or multicast network?
> replace with "... on a given broadcast or multicast network." OK?
>>>
>>> Questions/Comments:
>>>
>>> - QUESTION: IoT and node requirements. There a couple node ASA requirements. I
>>>    found myself wondering if a very simple IoT things like thermostats might ever
>>>    be an AN and if so did they all really need to have joining assistent ASAs? It
>>>    could be that the answer is "Yes, they do or they can't be nodes". I was just
>>>    curious.
> We had that discussion a lot. 3.1 states that "At a later stage
>     ANIMA may define a scope for constrained nodes with a reduced ANI and
>     well-defined minimal functionality.  They are currently out of scope."
> 
> All sorts of things may happen: You may have devices that are hard-coded 
> to run in a domain, so they don't need to join (very specific cases 
> only, IMO). Or they have a SIM card (or similar), and don't need 
> enrolment (much more likely to happen). Many more... :-)
>>>
>>> - COMMENT: For the types of ASAs: simple (run anywhere), complex (resource
>>>    restricted), and infra (run everywhere), I was reminded of Kubernetes/cloud
>>>    orchestration, and the concept of DaemonSets (pods that run everywhere) and
>>>    Deployments (pods that can run anywhere, possibly be scaled replicated, and
>>>    may also have requirements that restrict where they can run). I imagine that
>>>    folks in Anima have also looked at this, but if not it would be good to as
>>>    they seem to be solving very similar problems.
> Will do! Good point!
>>> Nits:
>>>
>>> - TEXT: 3.2 "However, the information is tracked independently of the status of
>>>    the peer nodes; specifically, it contains information about non-enrolled
>>>    nodes, nodes of the same and other domains. "
>>>
>>>    QUESTION: What are peer nodes? Is this another name for adjacent nodes? If so
>>>    "s/peer/adjacent/".
> We need to keep "peer" here. The adjacency table may have peers that are 
> not adjacent.
>>>
>>> - TEXT: 3.3.1 "enrols"
>>>    CHANGE: "enrolls"
> Being a non-native English speaker, I believe this is US vs GB English, 
> and I leave others to sort that :-)
>>> - TEXT: 3.3.3 "In this state, the autonomic node has at least one ACP channel to
>>>    another device. It can participate in further autonomic transactions, such as
>>>    starting autonomic service agents. For example it must now enable the join
>>>    assistant ASA, to help other devices to join the domain.
>>>
>>>    NIT: "For example foo" is not a sentence on it's own, also "It" is not a good
>>>    subject as there are multiple nouns in the previous sentence that could serve
>>>    as antecedents.
>>>
>>>    SUGGEST: 3.3.3 "In this state, the autonomic node has at least one ACP channel
>>>    to another device. The node can now participate in further autonomic
>>>    transactions, such as starting autonomic service agents (e.g., it must now
>>>    enable the join assistant ASA, to help other devices to join the domain).
> Will do, thanks.
>>>
>>> - TEXT: 4.1 "Names are typically assigned by a Registrar at bootstrap time and
>>>    persistent over the lifetime of the device."
>>>
>>>    NIT: s/persistent/and persist/
> I leave that to the RFC Editor to decide. To me our version is not 
> wrong... Again, not native ....
>>>
>>> - TEXT: "Out of scope are addressing approaches for the data plane of the
>>>    network, which may be configured and managed in the traditional way, or
>>>    negotiated as a service of an ASA. One use case for such an autonomic function
>>>    is described in [I-D.ietf-anima-prefix-management]."
>>>
>>> - NIT: Sounds sort of Yoda-like, and the compounding makes things less clear.
> Yoda I like! :-)
>>>    SUGGEST: "Addressing approaches for the data plane of the network are outside
>>>    the scope of this document. These addressing approaches may be configured and
>>>    managed in the traditional way, or negotiated as a service of an ASA. One use
>>>    case for such an autonomic function is described in
>>>    [I-D.ietf-anima-prefix-management]."
> Will change.
>>>
>>> - TEXT: 6.1: "Following an initial discovery phase, the device properties and
>>>    those of its neighbors are the foundation of the behavior of a specific
>>>    device. A device and its ASAs have no pre-configuration for the particular
>>>    network in which they are installed."
>>>
>>>    NIT: Why suddenly lose the "node" abstraction and start talking about devices
>>>    here? I think it continues to work well to say "node" (e.g., "node
>>>    properties", "specific node" and "A node and its ASAs...").
> OK, will change to "node" - you're right.
>>>
>>> - TEXT: 6.2 "install ASA: copy the ASA code onto the host and start it,"
>>>    NIT: "s/host/node/"
>>>
> OK.
> 
> Thanks for the thorough review Christian! I wait if anyone has an issue 
> with thos suggestions, and if not, edit the draft accordingly.
> 
> thanks!
> Michael
> 
>