Re: [Anima-bootstrap] BRSKI State Machine

> -----Original Message-----
> From: Max Pritikin (pritikin)
> Sent: 18 October 2016 01:05
> To: Michael Behringer (mbehring) <mbehring@cisco.com>
> Cc: anima-bootstrap@ietf.org
> Subject: Re: [Anima-bootstrap] BRSKI State Machine
> 
> Thanks for the detailed review notes! They are much appreciated and very
> timely. I’ll be spending time this week addressing them.
> 
> Responding to the higher level discussion inline,
> 
> > On Oct 14, 2016, at 8:42 AM, Michael Behringer (mbehring)
> <mbehring@cisco.com> wrote:
> >
> > Hi Folks,
> >
> > You know that I'm doing a complete thorough top-to-bottom review on
> > the brski draft, but I'm only half-way through right now. (Yes, I'm
> > taking it seriously ;-)
> >
> > I'm bringing forward here a single topic that I think is fairly important, so
> that we can start discussion about that. And that is the state machine. My
> high-level observation is that I think the draft isn't precise enough yet to
> allow for independent, interoperable implementations. There are too many
> "lose ends".
> >
> > So, I started looking through the state machine (figure 3), and thought this
> through in more detail.
> >
> > * First of all, one thing isn't coming out clearly (it's there, but somehow not
> obvious at all): We have three "paths" through the algorithm, and it is the
> *pledge* that has "hard coded" which paths we're taking:
> >
> > 1) join any domain (first come first join)
> >   --> No MASA required
> 
> For the record: I consider this a security vulnerability but accept that it will
> take a number of high profile attacks before folks come around to agreeing
> with me. ;) I recommend against this.
> 
> > 2) require audit token
> >   --> MASA required, audit mode
> > 3) require authentication token
> >   --> MASA required, ownership tracking mode
> >
> > [I really hope we agree on that!!!]
> 
> Agreed.
> 
> Where #2 and #3 could be seen as a single path with slightly different
> information in the message from the MASA server; but we’d be quickly be
> into the weeds of the msg format if we get into that here.

So this discussion is important to get right before we go to last call on this documents, since a lot of the details depend on those high level choices. I suggest we try to get consensus on how to deal with those 3 cases asap. 

I agree with your observation on 1. We have to make a conscious choice between a) being less secure but quickly deployable, or b) being more secure but requiring more vendor support before being able to "fly". My point is, we're not really making this choice today; while we sort of try to divert from 1) we really allow everything. If we really want to disallow option 1, we should say so. If not, my feeling is after reading the whole thing: we MUST explain all options very clearly in parallel, else there are too many variants and loose ends. 

I'm not sure we can combine 2 and 3 into a single case, although it would be desirable of course. 

In 2) the pledge MUST create and send a nonce (this is explained in 5.1). You cannot audit without a nonce. Freshness is a MUST for audit events. 
In 3) the nonce is optional. And we have a deployment model where we request all authorization tokens up front (potentially at time of purchase), so this will be used. And it is the pledge who decides whether to send a nonce or not, and if it does, the response MUST be fresh. (correct me if I'm wrong). So, really, we actually even have case 3a) and 3b), and we MUST explain that in the "behaviour of a new device". We need to be very clear and explicit... 

Therefore, I think, we probably must separate both cases in the draft.... Need to think more. It would be GREAT if we could combine them, as you suggested... 

> > This needs to come out much more clearly. Should this "hard coded"
> > behaviour be changeable under certain conditions? (Don't think so,
> > but...) The knee-jerk reaction would be to put this under 3.1, but I
> > think it's more important than that! It should be explained very
> > early, somewhere in 1), maybe in  1.2. Happy to write up some text if
> > the team wants me to (and if we agree ;-)
> >
> > * When you try to do a state machine with figure 3, there are a few things
> that don't quite gel. Main points are:
> >
> > - "Identity" isn't really a state in itself. I would argue a pledge USES its
> identity in the next step.
> 
> From a protocol perspective the pledge completes authentication as part of
> the TLS handshake and only after that is complete does it ‘request join’. So I
> called these distinct states. I don’t feel strongly about it though and am open
> to combining these states.

But it is a single TLS connection? (I'm not yet entirely clear on the latest protocol discussions)

I think: If we really have two separate TLS connections, then we can split; but if it's just different transactions within a single TLS connection, I would call it one state. 

> > - I think we need to bring out more strongly that the state machine needs
> to track peer and domain. Because, if there is a failure, the pledge should,
> depending on the failure of course, not try the same domain again, and
> probably not the same peer either. This isn't coming out today.
> > In fact, this is why I liked the "adjacency table" so much that I presented in
> Berlin (and before): Because there you see much clearer that, if enrolment
> fails with peer x, you may just move to the next one. As mentioned it's all
> there, but to a new reader this won't come out clearly, I'm afraid.
> 
> Yeah, I can see your point that this is buried in the text of 3.1.1 where it is
> implied that there is a list of "services returned during each query” and in
> failure the list processing "picks up where it left off” but thats pretty subtle.

Yes, I think we could be more explicit; discovery finds a set of nodes with different characteristics, and we need to explain how to navigate through this set. 

> > - We may want a "reason for rejection" if the domain rejects a device (for
> all negative cases). In some case, it could be a "wait a minute, I'm currently
> overloaded", in others "we don't like you in this domain", or "your enrolment
> mode (see first point) is not acceptable".
> > In "real life" this would allow some visual feedback at the install site, so that
> the engineer knows whether he should wait or can go.
> > [note: there may be security reasons to NOT give a reason for
> > rejection, need to think more about this]
> 
> I think here we need to provide information about what happened. This is
> why s5.4 exists to have the pledge send telemetry back to the network that
> attempted bootstrapping.
> 
> But note this is from the pledge to the domain. The device is assumed to be
> headless/zero-touch etc so I wasn’t thinking in terms of sending error
> messages to it. I’m open to doing so though.

Indeed. I think we can probably find security reasons why we don't want too much information back to the pledge, ideally just yes/no. 

However, I think there are three cases an installing person needs to distinguish: 
green: device is enrolled; you can go!
red: device was rejected; stick a big label on the device "rejected" and send to NOC. Try another device, or go home.  
yellow: there is a temporary issue; wait. 

And we should at least provide that level of information back to the pledge. 

> > - I didn't quite like "imprint" as a state either. To me, the next logical state
> was "validation". see attached ppt for more details. But bottom line, we
> need to reflect the 3 "paths" through the algorithm here again.
> 
> 
> “validation” is a fine thing to call that state.
> 
> >
> > - And finally, I suggest we rename "being managed" to "enrolled".
> > Reason is: I'm also drawing up a complete state machine for an ANIMA
> > node, and there I think the main "transition points" between BRSKI and
> > ACP is when the device is "enrolled". Thus I suggest to call the final
> > state in BRSKI "Enrolled", and the first one in ACP the same.
> > (Besides, "being managed" doesn't sound right when we're talking a
> > fully autonomic device.)
> 
> I think there is a distinction between “obtaining an identity on the domain”
> and “what i do after I have an identity to be engaged with the domain”. So
> there are two states here. But yes, “being managed” could be “on the
> domain” or something.

Agree, but I think "what I do after I have an identity" is out of scope for this doc. (we have some informal section on this, 4.5, which is probably useful). 

> > In the attached ppt I made those few changes, and I marked with a red
> star, where I think we need more work before any last call, apart from what  I
> already mentioned:
> >
> > - we need to specify precisely the discovery method, with mDNS field
> names, and other details. In my head we're using mDNS here, and I *think*
> we agreed on that?
> 
> yes. with understanding that the proxy to registrar SHOULD be discovered
> using GRASP for ACP devices.

And that needs to be written somewhere in BRSKI, agreed? We need a section on how the proxy discovers a Registrar; Registrar selection, etc. Sigh... More work!

> > But, we'll need the same method also for the ACP draft: When both nodes
> have a certificate, they need to discover each other as well.
> > I've been haggling with Toerless about this :-)   I think we should take the
> mDNS insecure discovery into a separate, new draft.
> 
> I don’t follow. mDNS simply *is* insecure. This is important since we can’t
> establish a secure discovery yet.

That wasn't my point; I wanted to say that we need to specify exactly which discovery methods are used in BRSKI and ACP.  

> > This is likely very short, BUT: I think it doesn't really belong in the BRSKI
> draft (specifically if we use BRSKI also for non-ANIMA environments), neither
> in the ACP draft (because we also need it in BRSKI). Having a separate draft
> would be very clean. However I understand (when pushed hard) we may not
> want to do this for admin reasons.
> > Alternatively, we specify the discovery in the ACP draft, and BRSKI refers to
> it. I like this less, but will not scream murder if others insist.
> 
> I think discovery of the proxy must be in this draft. I’m happy to move the
> proxy’s discovery of the registrar to another draft but I think its ok to
> recommend GRASP for that connection so I don’t see a problem with that.

I suggest we leave that in the BRSKI draft. 

Thanks for your quick response! 
Michael

> - max
> 
> >
> > So much for now. Still on the full review, but this is pretty high level, and
> pretty fundamental. Happy to help with text and/or ASCII art if we decide to
> take on some of these points.
> >
> > Michael
> >
> >
> > <brski state
> >
> machine.pptx>_______________________________________________
> > Anima-bootstrap mailing list
> > Anima-bootstrap@ietf.org
> > https://www.ietf.org/mailman/listinfo/anima-bootstrap