Re: [Anima-bootstrap] BRSKI State Machine

Thanks for the detailed review notes! They are much appreciated and very timely. I’ll be spending time this week addressing them. 

Responding to the higher level discussion inline, 

> On Oct 14, 2016, at 8:42 AM, Michael Behringer (mbehring) <mbehring@cisco.com> wrote:
> 
> Hi Folks, 
> 
> You know that I'm doing a complete thorough top-to-bottom review on the brski draft, but I'm only half-way through right now. (Yes, I'm taking it seriously ;-) 
> 
> I'm bringing forward here a single topic that I think is fairly important, so that we can start discussion about that. And that is the state machine. My high-level observation is that I think the draft isn't precise enough yet to allow for independent, interoperable implementations. There are too many "lose ends". 
> 
> So, I started looking through the state machine (figure 3), and thought this through in more detail. 
> 
> * First of all, one thing isn't coming out clearly (it's there, but somehow not obvious at all): We have three "paths" through the algorithm, and it is the *pledge* that has "hard coded" which paths we're taking: 
> 
> 1) join any domain (first come first join)  
>   --> No MASA required

For the record: I consider this a security vulnerability but accept that it will take a number of high profile attacks before folks come around to agreeing with me. ;) I recommend against this. 

> 2) require audit token 
>   --> MASA required, audit mode
> 3) require authentication token 
>   --> MASA required, ownership tracking mode
> 
> [I really hope we agree on that!!!]

Agreed. 

Where #2 and #3 could be seen as a single path with slightly different information in the message from the MASA server; but we’d be quickly be into the weeds of the msg format if we get into that here. 

> This needs to come out much more clearly. Should this "hard coded" behaviour be changeable under certain conditions? (Don't think so, but...) 
> The knee-jerk reaction would be to put this under 3.1, but I think it's more important than that! It should be explained very early, somewhere in 1), maybe in  1.2. Happy to write up some text if the team wants me to (and if we agree ;-) 
> 
> * When you try to do a state machine with figure 3, there are a few things that don't quite gel. Main points are: 
> 
> - "Identity" isn't really a state in itself. I would argue a pledge USES its identity in the next step. 

From a protocol perspective the pledge completes authentication as part of the TLS handshake and only after that is complete does it ‘request join’. So I called these distinct states. I don’t feel strongly about it though and am open to combining these states. 

> - I think we need to bring out more strongly that the state machine needs to track peer and domain. Because, if there is a failure, the pledge should, depending on the failure of course, not try the same domain again, and probably not the same peer either. This isn't coming out today. 
> In fact, this is why I liked the "adjacency table" so much that I presented in Berlin (and before): Because there you see much clearer that, if enrolment fails with peer x, you may just move to the next one. As mentioned it's all there, but to a new reader this won't come out clearly, I'm afraid.

Yeah, I can see your point that this is buried in the text of 3.1.1 where it is implied that there is a list of "services returned during each query” and in failure the list processing "picks up where it left off” but thats pretty subtle. 

> - We may want a "reason for rejection" if the domain rejects a device (for all negative cases). In some case, it could be a "wait a minute, I'm currently overloaded", in others "we don't like you in this domain", or "your enrolment mode (see first point) is not acceptable". 
> In "real life" this would allow some visual feedback at the install site, so that the engineer knows whether he should wait or can go. 
> [note: there may be security reasons to NOT give a reason for rejection, need to think more about this]

I think here we need to provide information about what happened. This is why s5.4 exists to have the pledge send telemetry back to the network that attempted bootstrapping. 

But note this is from the pledge to the domain. The device is assumed to be headless/zero-touch etc so I wasn’t thinking in terms of sending error messages to it. I’m open to doing so though.

> - I didn't quite like "imprint" as a state either. To me, the next logical state was "validation". see attached ppt for more details. But bottom line, we need to reflect the 3 "paths" through the algorithm here again. 

“validation” is a fine thing to call that state. 

> 
> - And finally, I suggest we rename "being managed" to "enrolled". Reason is: I'm also drawing up a complete state machine for an ANIMA node, and there I think the main "transition points" between BRSKI and ACP is when the device is "enrolled". Thus I suggest to call the final state in BRSKI "Enrolled", and the first one in ACP the same. (Besides, "being managed" doesn't sound right when we're talking a fully autonomic device.)

I think there is a distinction between “obtaining an identity on the domain” and “what i do after I have an identity to be engaged with the domain”. So there are two states here. But yes, “being managed” could be “on the domain” or something. 

> 
> In the attached ppt I made those few changes, and I marked with a red star, where I think we need more work before any last call, apart from what  I already mentioned: 
> 
> - we need to specify precisely the discovery method, with mDNS field names, and other details. In my head we're using mDNS here, and I *think* we agreed on that? 

yes. with understanding that the proxy to registrar SHOULD be discovered using GRASP for ACP devices. 

> But, we'll need the same method also for the ACP draft: When both nodes have a certificate, they need to discover each other as well. 
> I've been haggling with Toerless about this :-)   I think we should take the mDNS insecure discovery into a separate, new draft.

I don’t follow. mDNS simply *is* insecure. This is important since we can’t establish a secure discovery yet. 

> This is likely very short, BUT: I think it doesn't really belong in the BRSKI draft (specifically if we use BRSKI also for non-ANIMA environments), neither in the ACP draft (because we also need it in BRSKI). Having a separate draft would be very clean. However I understand (when pushed hard) we may not want to do this for admin reasons. 
> Alternatively, we specify the discovery in the ACP draft, and BRSKI refers to it. I like this less, but will not scream murder if others insist. 

I think discovery of the proxy must be in this draft. I’m happy to move the proxy’s discovery of the registrar to another draft but I think its ok to recommend GRASP for that connection so I don’t see a problem with that. 

- max

> 
> So much for now. Still on the full review, but this is pretty high level, and pretty fundamental. Happy to help with text and/or ASCII art if we decide to take on some of these points. 
> 
> Michael
> 
> 
> <brski state machine.pptx>_______________________________________________
> Anima-bootstrap mailing list
> Anima-bootstrap@ietf.org
> https://www.ietf.org/mailman/listinfo/anima-bootstrap