Re: [Ipoverib] Comments on draft-ietf-ipoib-link-multicast-04.txt

Vivek Kashyap <vivk@us.ibm.com> Wed, 07 May 2003 16:01 UTC

Received: from www1.ietf.org (ietf.org [132.151.1.19] (may be forged)) by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA06659 for <ipoverib-archive@lists.ietf.org>; Wed, 7 May 2003 12:01:00 -0400 (EDT)
Received: from www1.ietf.org (localhost.localdomain [127.0.0.1]) by www1.ietf.org (8.11.6/8.11.6) with ESMTP id h47G95830362; Wed, 7 May 2003 12:09:05 -0400
Received: from ietf.org (odin.ietf.org [132.151.1.176]) by www1.ietf.org (8.11.6/8.11.6) with ESMTP id h47G8m830322 for <ipoverib@optimus.ietf.org>; Wed, 7 May 2003 12:08:48 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA06581; Wed, 7 May 2003 11:59:16 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 19DRM8-0002Zs-00; Wed, 07 May 2003 12:01:20 -0400
Received: from e35.co.us.ibm.com ([32.97.110.133]) by ietf-mx with esmtp (Exim 4.12) id 19DRM6-0002ZY-00; Wed, 07 May 2003 12:01:18 -0400
Received: from westrelay05.boulder.ibm.com (westrelay05.boulder.ibm.com [9.17.193.33]) by e35.co.us.ibm.com (8.12.9/8.12.2) with ESMTP id h47G1buT157448; Wed, 7 May 2003 12:01:37 -0400
Received: from d03nm122.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.193.81]) by westrelay05.boulder.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h47G1KAp064146; Wed, 7 May 2003 10:01:23 -0600
Subject: Re: [Ipoverib] Comments on draft-ietf-ipoib-link-multicast-04.txt
To: "H.K. Jerry Chu" <Jerry.Chu@eng.sun.com>
Cc: ipoverib@ietf.org, ipoverib-admin@ietf.org, Roy Brabson <rbrabson@us.ibm.com>
X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000
Message-ID: <OFFFA16205.D161C7A5-ON88256D1F.00557D51@us.ibm.com>
From: Vivek Kashyap <vivk@us.ibm.com>
Date: Wed, 07 May 2003 09:01:04 -0700
X-MIMETrack: Serialize by Router on D03NM122/03/M/IBM(Release 6.0.1 [IBM]|April 28, 2003) at 05/07/2003 10:01:23
MIME-Version: 1.0
Content-type: text/plain; charset="US-ASCII"
Sender: ipoverib-admin@ietf.org
Errors-To: ipoverib-admin@ietf.org
X-BeenThere: ipoverib@ietf.org
X-Mailman-Version: 2.0.12
Precedence: bulk
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=unsubscribe>
List-Id: IP over InfiniBand WG Discussion List <ipoverib.ietf.org>
List-Post: <mailto:ipoverib@ietf.org>
List-Help: <mailto:ipoverib-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=subscribe>





See below between <VK>

--
Vivek Kashyap
Linux Technology Center, IBM
vivk@us.ibm.com
kashyapv@us.ibm.com
Ph: 503 578 3422 T/L: 775 3422



                                                                                                                                    
                      "H.K. Jerry Chu"                                                                                              
                      <Jerry.Chu@eng.su        To:       Roy Brabson/Raleigh/IBM@IBMUS                                              
                      n.com>                   cc:       ipoverib@ietf.org                                                          
                      Sent by:                 Subject:  Re: [Ipoverib] Comments on draft-ietf-ipoib-link-multicast-04.txt          
                      ipoverib-admin@ie                                                                                             
                      tf.org                                                                                                        
                                                                                                                                    
                                                                                                                                    
                      05/06/03 21:55                                                                                                
                      Please respond to                                                                                             
                      "H.K. Jerry Chu"                                                                                              
                                                                                                                                    
                                                                                                                                    




>> >- Section 9.0 indicates that "The packet SHOULD be forwarded to locally

>
>> >  connected routers.  [snip] The specific mechanism for a sender to
>> >  forward packets to routers are left to implementations".  At a
>> >  minimum, I think the exact implementation needs to be specified.
>>
>> Why? (The draft strives to avoid dictating implementation details that
>> do not affect interoperability.)
>
>For interoperatibilty.  The requirement is the host and router must agree
>on using a common mechanism for exchanging the IP multicast packet.
>Implementations are free to choose a mechanism as long as all other
>implementations with which they communicate also choose to implement the
>same mechanism.  I don't think this is an area where we can or should punt

>- a minimal, common approach needs to be mandated such that all IPoIB
>nodes implement that mechanism so that any two IPoIB nodes will
>interoperate.

I still don't see any problem you may have in mind. As long as a sender
gets the pkt to the router, why it matter how it gets there? E.g.,
routers will listen to both the broadcast and the all-router multicast
addresses. Does it matter which address a pkt is sent to?

<VK> As I noted in my previous mail... the onus is on the router. The
router is required to listen to multicast packets on both: a) all-router
MGID or b) broadcast-GID

We could make that explicit in the draft. The second option, which
apparently Roy favours is to specify that the nodes must, if the relevant
MGID doesn't exist, send the packet to the MGID corresponding to the
all-router MGID. This option is fine but it then adds another requirement:

   All IPoIB nodes MUST request creation request for all-router MGID
   whether the corresponding IP layer joins the all-router IP address or
   not.

If the fall-back is the broadcast-GID then the end-nodes do not have to do
anything special.

The current option allows the sender to choose between taking the simple
route of 'fall-back to broadcast GID' vs. 'must join all-router MGID'.

I am ok with either case and if the 'must join all-router MGID' case is
preferable then we can specify that instead.
<VK>

>
>> >  The
>> >  text suggests using all-router multicast group or the broadcast
>group.
>> >  If a router is required to create the all-router multicast group (see

>
>> >  the previous comment), then wouldn't it be sufficient to check for
>the
>> >  existence of the all-router multicast group and, if it exists, send
>to
>> >  that?  If it doesn't exist, then there can't be any multicast routers

>
>> >  on the link, and since there aren't any local listeners there are no
>> >  on-link listeners there isn't any need to send the packet.
>>
>> Don't the code sinppet following 9.B match pretty much what you
>described
>> above?
>
>Yes, it does, and if the text matched the code snippet then I wouldn't
>have an issue.  But the text which supports the code snippet doesn't.

I can remove the text if it causes any confusion. I thought it is very
obvious if no router exists then there is no need to forward (as
illustrated by the code snippet).

>The
>text basically says you could send to the all-router multicast group OR
>the broadcast group.  And either should work, as long as the receiving
>IPoIB node MUST accept the IP multicast packets regardless of the IB group

>over which they are received (I'm not sure I've seen this explicitly
>stated in the draft, but I may have missed it).

I thought this is required by IP already. But please tell me if I'm
wrong.

>But do we really need to
>have multiple mechanisms to accomplish the same result?  Wouldn't we be
>better off just choosing one and going with that?  And how are we going to

>find multiple interoperable implementations for each option when the draft

>puts no limits on the approaches which can be taken?

The multiple mechanisms were not there in the earlier draft, but were
requested later by my co-author. I personally don't see any harm with
it but I don't mind taking it out if Vivek agrees with you.

<VK>
Don't remember, could have on reasoning as described above.

I think we can take it out and replace it with the condition that the
packets are sent to the all-router MGID AND all IPoIB nodes (because any
can be a sender) need to request creation report for all-router MGID and
then IB_Join it.
<VK>

>> >
>> >  And shouldn't the IPoIB sender subscribe to the IB multicast group
>> >  creation events using a wildcard MGID so that it can
>> >  "SendOnlyNonMember" join the group if a listener appears on the local

>
>> >  IB subnet, or "SendOnlyNonMember" join the all-routers group if a
>> >  router appears on-link and creates the all-routers multicast group?
>> >  And doesn't it need to subscribe to the IB multicast group deletion
>> >  events so that it knows when to switch back to sending to the
>> >  all-router multicast group and/or stop sending?
>>
>> Aren't you saying the same thing (more or less) as what is described in
>> the section following the code snippet?
>
>You are right - I must have missed the text.  But you need to change the
>"should" to a "MUST" in that paragraph.

Since the caching part is a "should", the subscription part ought to be
a "should" too, right?

>
>> >
>> >  But I'm wondering why a node needs to do this at all.  Wouldn't it
>> >  make more sense for the IPoIB node instead create the multicast group

>
>> >  and join the group so that IPoIB multicast routers can join the
>group?
>> >  Yes, the IPoIB sender will need create the group as a "FullMember",
>> >  but this seems to provide a simpler mapping of IP multicast to IB
>> >  multicast.
>>
>> While what you suggested may simplify the procedure, it essentially
>> removes the difference between a multicast sender and a multicast
>receiver,
>> which is wrong. In the IP multicast model a multicast sender doesn't
>need
>> to be a member of a multicast group to send. The draft has tried to
>> keep the same distinction.
>
>However, you are mapping the IP multicast model to the IB multicast model,

>and the two are not the same.  The reality is a sender must join an IB
>multicast group in order to send to it (albeit as a "SendOnlyNonMember"),

This is the big difference that was debated in IBTA.

>and this step does not exist in the IP model.  I imagine that the vast
>majority of the code/complexity of an IPoIB implementation is going to be
>dealing with IB multicasting.  I was hoping there was some way to reduce
>both at an IPoIB host.  Maybe it is just wishful thinking.

I actually have the same philosophy as well. I'm always against complexity
unless it's fully justified. But IB multicast just works differently than
IP multicast. We have gone through a number of alternatives and the
current one seemed to a good balance between complexity and correctness.
I don't remember all the tradeoffs but if we simply require a sender
to be a full-member, how would one know when to leave a group, when to
delete a group..., etc. Also wouldn't it enable anyone to write a silly
sender program to try to send to a million different addresses and cause
all the IB multicast resources to be used up? (The join call by
IP_ADD_MEMBERSHIP at least places some resource limitation.)

<VK> Yes,  a major consideration has been the limited multicast resources
avaialble in IB.  Some other issues were such as the case I described a
little while back: there is a very large chance of the router not receiving
the packet when the sender the group due to the possible delay in the
report reaching the router and the router IB_joining the group etc.
<VK>

>
>By the way, after thinking about this a little more, I'm not sure that
>creating a group as I suggested makes sense in all cases.  For instance,
>it is only useful for an IP multicast group which has a scope greater than

>link-local.  For link-local scope, only nodes directly attached to the
>IPoIB link are able to receive the packet.  Creating the group for the
>purpose of forwarding to an IPoIB multicast router is simply wasteful, as
>the IPoIB multicast router will not be able to forward the packet to
>another IP link.
>
>A similar optimization can also be taken with the current approach
>described in this draft - for an IP multicast group with link-local scope,

>if the IB MGID for that IP multicast group does not exist then an IPoIB
>node really doesn't need to send the multicast packet to the all-router
>MGID or broadcast GID.

Correct. This level of details can arguably be left to implementation
because it doesn't affect interoperability. But I think this optimization
is simple and useful enough that I'll add a note to the draft.

<VK> Agreed. <VK>

>
>> >- Section 9.0 also indicates that "A multicast sender must join the
>> >  target multicast group as a "SendOnlyNonMember" before outgoing
>> >  messages from it can be successfully routed." This text should be
>> >  clarified to indicate this is only true if the node has not already
>> >  joined the group as either a "FullMember" or "SendOnlyNonMember". The

>
>> >  If-Then-Else logic below correctly shows this behavior, but this
>> >  paragraph does not reflect that behavior.
>>
>> Although your comment is correct, I'd argue this kind of details can
>> be found in the IBA spec. It has nothing to do with IPoIB in particular,
>> So the question becomes how much IBA info we need to repeat in an IPoIB
>RFC.
>> Since the code sippet already covers this correctly, if you still
>believe
>> this needs to be clarified i can certain add more text to it.
>
>My feeling is enough to describe HOW an IPoIB node will USE the IBA
>services, and this is one of those cases.  I would prefer to see the text
>clarified if you do another revision of the draft, but given the code
>snippet describes the correct behavior I wouldn't hold this draft up just
>to clarify this paragraph.

Ok.

>
>> >- Section 9.0 indicates that "For the rest of attributes, it is
>> >  recommended the same values from the all-node multicast/broadcast
>> >  group be used." Should this instead say "For the rest of the
>> >  attributes, the same values from the all-node multicast/broadcast
>> >  group SHOULD be used." Regardless of the wording, this text seems to
>> >  contradict the Section 7.0 which states a node must "use the rest of
>> >  the link attributes associated with the group for all future
>> >  communication on the link.  If Section 7.0 is correct, should this be

>
>> >  a MUST instead of a SHOULD?  And, if not, under what circumstances
>> >  would a node choose to not use the same values?
>>
>> 7.0 describes the attributes used by a node when conducting unicast
>> communication. 9.0 describes the attributes used to create a multicast
>> group, i.e., for multicast communication. So the question is:
>> Must path attributes used for unicast match with those for multicast?
>>
>> The flexibility allowed in the multicast case in the current draft is
>> deliberate. I don't know in reality if this is useful, or even good to
>> have. (Comments are welcome.) My design principle has been to not to
>> make something a requirement unless it is needed for interoperability.
>
>Flexibility is often times good, but so is describing a minimal set of
>functions necessary for interoperability.  If a node selects link
>attributes when creating an MGID which does not match the broadcast GID's
>attributes, then it is possible that not all nodes which are members of
>the IPoIB link will be able to support those attributes.  If this happens,

>then the two nodes will not be able to communicate using the IP multicast
>group mapped to the MGID.  All of a sudden, an IPoIB node can't send an
>ARP request or Neighbor Solicitation to another node.

I think some of these "deliberate" flexibility (or ambuguity) came from
the ambuguity of IB itself in some of the mutlicast attributes such as
SL, TClass..., etc. It is not clear to me if they must or even can
follow the ones for unicast (i.e. described by the broadcast group)
in all cases.

>
>Perhaps it would be better to state that a node which creates a multicast
>group MUST use the same values for the attributes as the all-node
>multicast/broadcast group, but a node joining a multicast group must not
>verify the values match.  If the node joining the group can't support the
>values for the multicast group, then it should log the error.  That way, a

>future RFC could describe why/when/how a node would choose to use
>different values for the link attributes but still be able to interoperate

>with IPoIB nodes.

I'll leave the above for IB folks to comment.

Jerry

>
>Roy

_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib



_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib