Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)

marcelo bagnulo braun <marcelo@it.uc3m.es> Wed, 08 March 2006 12:08 UTC

Envelope-to: shim6-data@psg.com
Delivery-date: Wed, 08 Mar 2006 12:09:36 +0000
Mime-Version: 1.0 (Apple Message framework v623)
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Message-Id: <dddee513ba0be7a86d6b9921ebf3a76b@it.uc3m.es>
Content-Transfer-Encoding: quoted-printable
Cc: shim6-wg <shim6@psg.com>
From: marcelo bagnulo braun <marcelo@it.uc3m.es>
Subject: Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
Date: Wed, 08 Mar 2006 14:08:49 +0200
To: Igor Gashinsky <igor@gashinsky.net>

El 08/03/2006, a las 10:28, Igor Gashinsky escribió:

> Hi Marcelo,
>
> 	My comments are in-line... sorry for the late reply, but I've been
> traveling too much lately...
>
> :: El 01/03/2006, a las 10:10, Igor Gashinsky escribió:
> :: So the effort for this case imho is putted in enabling the capacity 
> or
> :: establishing new sessions after an outage rather than in preserving
> :: established connections, do you think this makes any sense to you
>
> This makes a lot of sense, provided this happens under the hood of the
> application (ie web-browser in this case). So, right now, for example, 
> if
> a client is pulling down a web page, gets the html, and and in the 
> middle
> of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that
> the client was in the middle of x-fering will get that ugly red "X".
> (most browsers, right now, will not re-try to get the object again, and
> will just show it as unavailable). This issue is deemed important 
> enough
> that most large content providers are spending an inordinate amount of
> money on loadbalancer with active session sync to try to prevent that
> from happening in the even of a loadbalancer fail-over. So, if
> application behavior could be changed to say "if shim6 fail-over is
> possible, and connection just died (for any definition of die), then
> attempt re-establish connection through the shim, and then to re-get 
> the
> failed object", that would go a long way in making this kind of 
> fail-over
> better.
>

This is possible with the shim6 protocol, since it supports unreachable 
ulids when establishing the shim context, so i guess this would be ok. 
Probably a couple of elements that are needed, like an extended API to 
allow the apps to tell this to the shim (you probably want also to 
inform the shim which locator is not working), and the shim needs to 
remember the alternative locators obtained from the DNS even if there 
is not shim context yet, in order to have a clue about which 
alternative address to use (the other other option is to perform a 
reverse lookup for retrieving those... see the threat with Erik for 
more about this point). But in any case, i think all this issues are 
easily solvable

but, i have an additional question about this point. the point is, if 
the application is the one that will determine that there is a problem 
and will ask the shim to establish a context (which is ok and no 
problem here) wouldn't the application be better off simply retrying 
with alternative locators by itself, rather then asking the shim to do 
it?

> The difference with shim6, instead of v4 is that in v4 world, the
> connection wouldn't die, it would just hang for the duration of
> convergence (provided  convergence is fast enough, which normally it 
> is),
> and then continue on it's merry way with new tcp windows. In Shim6, if
> the client[ip1]-server connection goes down, re-establishing to
> client[ip2]-server would not be "hitless" (ie session would die), and 
> to
> solve that problem we are back at either keeping an inordinate amount 
> of
> state on the webservers (which is not very realistic), a shift in the
> way people write applications (which, in my opinion is preferred, but a
> *very* hard problem to solve), or to somehow figure out how to hide 
> this
> in the stack with minimal performance hit (let's say sub 1% memory hit)
> when you have 30k+ simultaneous connections per server...

well if you use the shim approach that you suggest above, the server 
does not have to store any shim state while things are doing fine and 
if a client detects a problem it can trigger the creation of the shim 
context from the client to the server. At this point, the server will 
need some shim state, but only for those connections that have failed 
(of course if one of the links to the server went down, then all the 
clients connecting through that link will attempt to create a shim 
state)

I guess that this could be a reasonable trade-off between state in the 
server and response time when outages occur

>
> :: > 3) While TE has been discussed at length already, but it is 
> something
> :: > which is absolutely required for a content provider to deploy 
> shim6. There
> :: > has been quite a bit of talk about what TE is used for, but it 
> seems that
> :: > few people recognize it as a way of expressing "business/financial
> :: > policies". For example, in the v4 world, the (multi-homed) 
> end-user maybe
> :: > visible via both a *paid* Transit path (say UUNET), and a *free* 
> peering
> :: > link (say Cogent), and I would wager that most content providers 
> would
> :: > choose the free link (even if performance on that link is (not 
> hugely)
> :: > worse). That capability all but disappears in the v6 world if the 
> Client
> :: > ID was sourced from their UUnet ip address (since that's who they 
> chose
> :: > to use for outbound traffic), and the (web) server does not know 
> that
> :: > that locator also corresponds to a Cogent IP (which they can 
> reach for
> :: > free).
> ::
> :: I fail to understand the example the you are presenting here...
> ::
> :: are you considering the case where both the client and the server 
> are both
> :: multihomed to Cognet and UUnet?
> :: something like
> ::
> :: UUnet
> :: /     \
> :: C       S
> :: \     /
> :: Cognet
>
> Yes, but now imagine the the "C" in this case is a client using shim6 
> with
> multiple IP's, and the server is in IPv6 PI space. Also, if it wasn't 
> in
> PI space, the connection to the server *can* be influenced via SRV 
> (although
> that's trying to shoehorn DNS into where perhaps it shouldn't go -- 
> since
> now the DNS server needs to be aware of link-state in the network to
> determine if the UUnet/Cogent connections are even up, and for a
> sufficiently large "S", that could be 10's, or even 100's of links, 
> which
> presents a very interesting scaling problem for DNS.. (even more
> interesting is that most large content providers are actually in the
> 1000's, and that's why they can get PI space -- they are effectively 
> (at
> least) a tier-2 ISP). But, back to the example at hand.. so, for the
> sake of this example, let's say that the UUnet port is $20/Mbps, and
> the Cogent port is a SFI (free) peer. So, the client (with ips of
> IP-uunet and IP-cogent) picks IP-uunet (because they want to use their
> UUnet connection outbound) to initiate a connection to the server, the
> problem now comes from the fact that the server, when replying to the
> client is unaware that IP-cogent IP is associated with the client,
> (since the shim layer has not kicked in on initial connect) and will 
> have
> to send traffic through the very expensive UUnet port.

that i don't follow

suppose that the server has v6 PI addresses, which for very big sites 
makes sense imho

The server can send traffic with destiantion address belonging to UUNet 
through Cognet, right? I mean i am assuming that UUNet and Cognet have 
connectivity that is not through S

I mean, The client can choose to use the IP from UUNet (that is his 
choice and he has the right to do so, because he is paying for it) This 
choice, affects the ISP used to get _to_ the client and it shouldn't 
determine the ISP used to get to the Server

So in this case the traffic would flow:
 From the client to the Internet through UUNet
 From the internet to the Server through Cognet

agree?

Now the problem is when the server also has PA blocks

In this case, the destiantion address selected by the client will 
determine the ISP of the server

Without shim, the server don't have many options, basically what he 
could do is to use the DNS to prioritize the Cognet addresses.
With the shim, the server can rehome any communication that is using 
UUnet addresses to Cognet and start using Cognet locators. This of 
course does not prevent the client to keep on using the UUnet 
destination addresses. In this case, the server can inform the client 
about his preferences using a shim protocol option, but even in this 
case the client can prefer other than what is expressed by S in the 
preferences. In any case, in this model, each can always choose the 
path used to send packets. I guess that in IPv4 is somehow different 
because the decision belongs to the intermediate ASes, which are to 
ones that can select which path to use (note that in this case, is not 
S who is in charge to select the incoming path neither)

>  With v4, on the
> other hand, the router was aware that Client is reachable via both 
> Cogent
> and UUnet, and could have had a localpref configured that would just 
> say
> "anything reachable over cogent, use cogent". One way to fix that 
> would be
> to do a shim6 init in the 3way handshake, but the problem then becomes
> that *every* "S" would have to have a complete routing table, and
> basically, perform the logic that is done in today's routers.

why is that?

I mean if S prefers Cognet, all he has to do is:
- In the PI case, route its outgoing packet through Cognet and do the 
same v4 bgp magic to direct incoming packet through cognet
- In the PA case, always use cognet addresses and try to convince the 
clients to  use the server's IP address of the cognet prefix (through 
SRV and/or shim preferences)

> Obviously
> running Zebra w/ full routes on a server is a non-trivial performance 
> hit,
> and multiplied that out by the number of servers, and it gets
> very expensive, very fast. All to re-gain capabilities we have right 
> now
> in ipv4 for free...
>
> Now, of course, the "so called easy" answer would be "let's introduce a
> routing policy middleware box that would handle that part". That box 
> would
> have the full routing tables, the site policies, and when queried with
> "I'm server X, and this is the client and all his locators, which one 
> do I
> use?" would spit back an answer to that server that would be a fully
> informed decision, and the TE problem becomes mostly solved. I say
>

but there seems to be two different problems here (at least :-)
- one: which are the TE capabilities available with the PA addressing 
model + the shim tool. This is what this can be done in this case.
- second: who is in control of these capabilties and how are they 
managed i.e. who controls the policy and who manages the devices that 
are in control of the policy. Is it possible to have a centralized 
policy management? is it possible to enforce the usage of policy (at 
least within the multihomed site)?

I guess that before we were considering the problem one and now the 
second one...

This server idea that you are considering was presented by Cedric de 
Launois in a work called NAROS a while ago

The other option is what we are discussing below about using a 
DHCP/RAdv option to distribute the policy information among the hosts

The other option is to move to a scheme based on rewriting source 
prefixes

Or a combination of those

> "mostly", because now there are these pesky issues of a) do I trust 
> that
> the server is going to obey by this decision (either hacked, or is a 
> box
> outside of my administrative control, yet is within the scope of my
> network control); b) how do transit ISP's "influence" that decision (at
> some point I cross their network, and they should be able to control 
> how
> the packets are flowing through their network; c) how do I verify that
> their "influencing" doesn't negate mine, and is legitimate; d) how much
> "lag" does it introduce into every session establishment, and is it
> acceptable; d) can this proxy scale to the number of queries fired at 
> it,
> and the real-time computations that would have to happen on each one
> (since we can't precompute the answers); and finally is it *really* 
> more
> cost-effective then doing all this in routers.
>
> So far, I'd rather pay for bigger routers...
>
> :: I mean in this case, the selection of the server provider is 
> determined by
> :: the server's address not by the client address, right?
> :: The server can influence such decision using SRV records in the 
> DNS, but not
> :: sure yet if this is the case you are considering
>
> See above about difficulties of scaling DNS to meet this goal...
>

but the problem with the DNS that you have considered above is about 
having to achieve that the DNS publish information that reflect the 
state of the links.
This seems indeed very dificult, especially becuase of cached 
information and so on. But as far as i know, no one is proposing this. 
The idea is to use SRV records to express policy, and a not very 
dynamic one. I mean, you can express that like 30% of the 
communications needs to use a given address and the others the other 
address and so on, but the idea is not to allow the DNS to reflect the 
state of the network
Actually, it may happen that some of the addresses in the DNS are down. 
In this case, the idea is to let the hosts to detect this an retry 
using alternative addresses. whether this retrial is visible or not to 
the apps, is still an open issue

> :: > This change alone would add millions to the bw bills of said
> :: > content providers, and well, reduce the likelyhood of adoption of 
> the
> :: > protocol by them. Now, if the shim6 init takes place in the 3way
> :: > handshake process, then the servers "somewhat" know what all 
> possible
> :: > paths to reach that locator are, but then would need some sort of 
> a
> :: > policy server telling them who to talk to on what ip, and that's 
> something
> :: > which will not simply scale for 100K+ machines.
> :: >
> ::
> :: I am not sure i understand the scaling problem here
> :: Suppose that you are using a DHCP option for distributing the SHIM6
> :: preferences of the RFC3484 policy table, are you saying that DHCP 
> does not
> :: scale for 100K+ machines? or is there something else other than 
> DHCP that
>
> Well, first, show me a content provider who thinks that dhcp scales 
> for a
> datacenter (other then initial pxeboot/kickstart/jumpstart, whatever), 
> but
> that aside, running zebra/quagga + synchronizing policy updates among
> 100K+ machines simply does not scale (operationally).
>

So, you are considering here the case where policy is changed according 
to the state of the network, right?
So that BGP information is used as feedback to the TE decision, is that 
correct?
Is this possible today? how is it done? could you provide an example of 
how you use this dynamic TE setting?


> :: > 4) As has also been discussed before, the initial connect time 
> has to be
> :: > *very* low. Anything that takes longer then 4-5 seconds the 
> end-users have
> :: > a funny way of clicking "stop" in their browser, deeming that "X 
> is down,
> :: > let me try Y", which is usually not a very acceptable scenario 
> :-) So,
> :: > whatever methodology we use to do the initial set-up has to 
> account for
> :: > that, and be able to get a connection that is actually starting 
> to do
> :: > something in under 2 seconds, along with figuring out which 
> sourceIP and
> :: > destIP pairs actually can talk to each other.
> ::
> :: As i mentioned above, we are working in other mechanisms than the 
> shim6
> :: protocol itself that can be used for establishing new communication 
> through
> :: outages.
> ::
> :: you can find some work in this area in
> ::
> :: 
> ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-bagnulo-ipv6-
> :: rfc3484-update-00.txt
>
> It's a fairly good idea of negotiating which SRC and DEST ips to pick, 
> but
> it has to happen *fast* (ie sub 2 seconds), or the end-users will lose
> patience, and declare the site dead. Perhaps racing SYNs?

yes, this is an option and it is nice because you actually get not only 
to detect which ones are actually working but also to pick the fastest 
one. But clearly there is the cost of the additional SYNs you send that 
is basically overhead... would you be willing to pay for this multiple 
SYNs?

>
> Now, I'm not saying that all these problems can't be solved for people 
> to
> consider shim6 a viable solution, but so far, they aren't solved, and
> until they are, I just don't see recommending to my employer to take 
> shim6
> seriously,

I may well agree with you here, but remeber that we are still defining 
the protocol :-)


I guess the point here is how we can manage to provide a solution that 
fits the site's requirements, hence your feedback is very valuable

Regards, marcelo


>  since it seems like all it's going to do is to move the costs
> elsewhere, and quite possibly increase them quite a bit in the 
> process...
>
> -igor

Re: shim6 @ NANOG (forwarded note from John Payne) David Meyer
shim6 @ NANOG (forwarded note from John Payne) Geoff Huston
Fwd: shim6 @ NANOG (forwarded note from John Payn… Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Geoff Huston
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Harold Grovesteen
Re: shim6 @ NANOG (forwarded note from John Payne) Per Heldal
Re: shim6 @ NANOG (forwarded note from John Payne) Tony Li
Re: shim6 @ NANOG (forwarded note from John Payne) Jason Schiller (schiller@uu.net)
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Erik Nordmark
Re: shim6 @ NANOG (forwarded note from John Payne) Per Heldal
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Jason Schiller (schiller@uu.net)
Re: shim6 @ NANOG (forwarded note from John Payne) Mikael Abrahamsson
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Jason Schiller (schiller@uu.net)
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne) Mikael Abrahamsson
Re: shim6 @ NANOG (forwarded note from John Payne) Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne… Igor Gashinsky
Re: shim6 @ NANOG (forwarded note from John Payne… Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne… marcelo bagnulo braun
Re: shim6 @ NANOG (forwarded note from John Payne… Igor Gashinsky
Re: shim6 @ NANOG (forwarded note from John Payne… marcelo bagnulo braun
Re: shim6 @ NANOG (forwarded note from John Payne… Igor Gashinsky
Re: shim6 @ NANOG (forwarded note from John Payne) Erik Nordmark
Re: shim6 @ NANOG (forwarded note from John Payne) Paul Jakma
Re: shim6 @ NANOG (forwarded note from John Payne… Geoff Huston
Re: shim6 @ NANOG (forwarded note from John Payne… Kurt Erik Lindqvist
Re: shim6 @ NANOG (forwarded note from John Payne… Kurt Erik Lindqvist
Re: shim6 @ NANOG (forwarded note from John Payne… Kurt Erik Lindqvist
Re: shim6 @ NANOG (forwarded note from John Payne) Erik Nordmark
Re: shim6 @ NANOG (forwarded note from John Payne) marcelo bagnulo braun
Re: shim6 @ NANOG (forwarded note from John Payne… marcelo bagnulo braun
Re: shim6 @ NANOG (forwarded note from John Payne… Mikael Abrahamsson
Re: shim6 @ NANOG (forwarded note from John Payne) Erik Nordmark
Re: shim6 @ NANOG (forwarded note from John Payne) Erik Nordmark
Re: shim6 @ NANOG (forwarded note from John Payne… Iljitsch van Beijnum
Re: shim6 @ NANOG (forwarded note from John Payne… Igor Gashinsky