[Stackevo] Thoughts on draft-thaler-appsawg-multi-transport-uris-02

Martin Thomson <martin.thomson@gmail.com> Mon, 13 August 2018 07:16 UTC

MIME-Version: 1.0
From: Martin Thomson <martin.thomson@gmail.com>
Date: Mon, 13 Aug 2018 17:16:42 +1000
Message-ID: <CABkgnnWNh2VmrA6SwhLVonV=z0Pt86AZ7QFBC7hCqPuG+Ur_Tg@mail.gmail.com>
To: Stackevo <stackevo@iab.org>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/stackevo/LZRlTjFBOWtzUSB4KMrChaRdxxQ>
Subject: [Stackevo] Thoughts on draft-thaler-appsawg-multi-transport-uris-02
Precedence: list

This is a really hard problem to talk about.  I think that this draft takes a
few important steps, but I found that it struggled a little with focus and
clarity at a few points.  I also struggled to produce this review, so
I apologize
for how rough it is.

The primary concern I have is that it doesn't really recognize the full extent
of the role that URLs can take in the processes they support.  Concentrating
more on that aspect might help us more clearly distinguish between use cases for
different URI schemes and provide more concrete advice about how to design URLs
that work for those use cases.

Right now, URL design seems to be completely adhoc and arbitrary.  I think that
the CoAP schemes are a great example of a case where a scheme is minted without
understanding the usage context properly.  Using the schemes we know as a model
is easy, but I think that this pattern of cargo-culting has made the problem
worse and not better.  For instance, HTTP is not a great role-model, though it
has probably reached something of a comfortable equilibrium.

At some point, this work probably needs to feed into a URI-bis that better
serves non-HTTP uses.  In writing this up, I started to think that RFC 3986
isn't really a great baseline.  (In case it's not obvious, this is not me
volunteering to lead that effort, though I would happily contribute some
effort.)


# Some Easy Points

I wonder if this should talk about URLs rather than URIs.  That is, the use of
identifiers that are not also locators aren't really in scope.  The use of URIs
as a general purpose identifier is probably too broad for a context where you
are taking a string as input to a protocol.

The use of "transports" here is easy to grasp, but it risks missing some of the
critical usage models.  This really needs to concern itself with the entirety of
the resolution mechanism, whatever that is.  I would use the word "protocol"
rather than "transport", because these usages refer to an entire stack, which
might - as I point out below - encompass discovery mechanisms in addition to
simply switching at one layer of the stack.

For instance, it is possible to use HTTP to access a CoAP URL, even if that
might not work in all the ways that you might expect.  In fact, some still hold
hopes for HTTP being usable for URIs of all schemes (just try sending a mailto:
URI to a web server and see how far that gets you...).

((At some point we should probably start to discuss the notion of a limited
vocabulary with respect to what you can *do* with a URL.  That is, a lot of
protocols assume that an undecorated reference means "GET" when no other context
is provided.  Is there some cross-scheme interpretation of URL "resolution", or
is that best left to the context in which a URL appears? Of course, that doesn't
belong here because it would explode the scope.  I note here that SIP (RFC 3261)
defines a "method" parameter, which suggests that sometimes this intent is not
inherited from context.))


# The Meat

I wonder if the focus here shouldn't be on the URL as a protocol element.  What
information can be bound into context (i.e., the scheme), what needs to be
explicit, and what can be recovered using other mechanisms once the protocol
interactions are bootstrapped.  After all, a URL serves primarily as a
bootstrapping mechanism.

Structurally that means thinking of a URL as a container of protocol
information.  The scheme establishes a context, which then defines the
structure.  Most URLs also include an authority using a common syntax, though
that part is convention as much as it is standardized.  The remainder of the URL
is information that is interpreted in some scheme-specific fashion.  Some of
that other information is designed to be passed to the authority, other
information is designed for consumption by the user, and that doesn't have to be
exclusive.

The realization we had with HTTP was that the combination of scheme and
authority was the most important part to preserve (that is, the notion of an
origin as scheme+host+port as defined in RFC 6454).  You can use any number of
protocols to talk to a server, but the scheme identifies a specific set of
semantics and the authority establishes the identity of the entity that can
answer to those semantics.

The discussion about the 's' in 'https://' and other such things is buried in
Section 3.3, but I think that it's foundational.  The idea of baking
expectations regarding authentication into identifiers was an ugly necessity in
HTTPS, but it had the regrettable effect of forking the identifier space at the
root.  Most in the community simultaneously hate this split and recognize how it
was necessary.  We're nudging ever closer to repairing this by gradually turning
'http://' into a ghetto, but it has been extraordinarily difficult.

As an aside, it seems like this is one reason the choices in CoAP generate
negative reactions.  The level of similarity makes it seem like a giant mistake
to mint coap and coaps (I still think that the former is an error). To then
double down with so many others, when it took so much hard effort to get HTTP as
far as it has, seems crazy coming from that frame of reference.  However, a
coap:// URL serves a very different purpose to an HTTP URL, so while maybe this
is unaesthetic, it's not an inherently bad choice.

HTTP is moving toward a more abstract conceptualization of the role of the URL.
A single URL is a common identifier, but it does not directly identify
protocols, except to the extent that discovery of alternative protocols is
necessarily opportunistic.  (More on this later.)  The URL remains a critical
protocol element, but its binding to a particular protocol stack is far more
loose.  And while that might be suitable for HTTP, some of that is because of
its position as a mature and well-established protocol.  Other protocols might
have different constraints.

For instance, the question about how to signal choice of transport in CoAP
should have suggested a bigger question as to whether a URL format in the style
of HTTP was entirely appropriate for use with CoAP.  If we accept the abstract
notion of scheme+authority as being a goal, then it makes it seem like the CoAP
choices were poor.  However, if you instead admit the possibility that a URL
might be used for different purposes, then there are different questions to ask.

For CoAP, the notion of scheme+authority is one that might suggest a similar
design to HTTP, but that cargo-culting ignored the true needs of the protocol,
where avoiding uncertainty about protocol is important.  Servers in CoAP can't
really afford to support multiple protocols.  The uses that rely on URLs really
benefit from a discovery process, which implies additional complexity and
latency for common operations.  Even an opportunistic discovery like with HTTP
might not be feasible, because that design implies support for multiple
protocols.

That suggests a different taxonomy to the one you describe:

1. Implicit discovery - in this, the URL has a default protocol, but alternative
   protocols might be discovered.  This allows a URI scheme that defines just
   one option to develop new means of reaching the same resource, implying that
   fixed identification isn't necessarily a terminal condition.  Implicit
   discovery admits the possibility of inline upgrades (HTTP Upgrade),
   opportunistic upgrade techniques like Alternative Services (RFC 7838).  The
   worst drawback of this scheme is that it implies indefinite support of the
   default protocol in order to either support this discovery process and
   provide continuity in cases where discovery fails.  You also can't change
   expectations about semantics (the http/https problem).

2. Explicit discovery - a URL is a more abstract concept that requires the use
   of a discovery step.  Of course, this isn't fundamentally any different to
   having a single protocol if you consider the discovery process to be part of
   that protocol.  The drawback of having this sort of indirection is that it
   naturally delays the resolution process.

3. Definitive protocol identification - the URL contains an explicit protocol
   identifier.  This is likelybetter suited to uses where the role of a URL is
   entirely functional.  This could be the right choice for purely function
   URLs, where the lack of ambiguity is a feature.  For instance, if the URL is
   used to configure a database server (postgresql://) or contact a CoAP
   endpoint (coaps://).

   This leaves the question of how to identify the protocol in the URL.  We
   currently have a real mess of options across different URI schemes and
   protocols.  You could even imagine providing a menu of options in the same
   URI with the user of that URI being able to choose.  That is less inflexible,
   but more user-hostile.  (Maybe this is what is implied by Section 3.2 ... ?)

Not sure where happy eyeballs fits in this taxonomy.  It's sort of implicit, but
then you don't need to support an IPv4 endpoint indefinitely.  In that way, it
is also a retrofitting of an explicit discovery step.

If you step back, this list really reduces to a question of how the respective
protocols start.  Obviously, there is a bunch of information in a URL that is
essential to a protocol interaction.  A URL that you decide to actuate
represents the start of a protocol interaction and the question is how much
information needs to be present in the URL to successfully bootstrap that
process.

Or, differently stated:

   How much information about the resolution mechanism is appropriate to include
   in a URL?  If information is included, what is the best means for providing
   that information?

I also think that it would be useful to ask another question, the answer to
which is partly in the above taxonomy:

   How do we manage when the information present is limited, either because the
   URL already exists, or because we have constraints that encourage the
   limitation of information that we express in a URI?

Or:

   Is the idea of identifying only in the abstract (i.e., by scheme and
   authority) and assuming a single protocol, a good choice for every protocol?


Protocol Design

The question of URL design as protocol design is not one that could be more
firmly emphasized.  I suspect that most of what we have today we arrived at by a
series of accidents.  I'm not sure that I believe that we will do better in
future, but that's why I think that documents like this can do a lot to help.

The next question is then the performance and usability characteristics that
result from these choices.

For a protocol that decides to embed additional information, that could
externalize some of its costs in terms of usability.  A sip: URI with transport
parameters is fairly user-hostile, but that form is OK when used internally
within the protocol (less so when it leaks, and configuring a proxy a SIP UA is
awful).  A postgresql: URL can happily include that sort of information either
because it is copied and pasted more often than typed, or because it is intended
for use by experts.

A protocol that involves a mandatory discovery step (as opposed to one that is
performed opportunistically, in parallel to some sort of default mode), adds the
latency of that discovery to resolution.  This can complicate the protocol, but
it provides opportunities for defining new means of accessing resources without
changing the scheme.


Retrofitting Into Existing Schemes

Happy Eyeballs retrofits a discovery step into a lot of protocols.  Running
parallel name resolution and connection establishment processes means that you
don't have to support both stacks on the server, you shift that responsibility
to the client.  Racing means that you don't have to pay a latency tax for the
new feature.  It's a neat hack, but rather expensive.

.onion and friends retrofit a parallel name resolution and authority system into
the authority component by stealing a label.  There's ongoing discussion about
this with the IAB and IESG (the last state I have is that there is renewed
enthusiasm for reserving a label at the top level for alternative resolution
contexts: .arc).  The parallel here is in stealing multiple scheme names for
different protocol stacks.

.local is another such land-grab that has problematic implications.  The primary
one being that it explicitly renders the notion of authority unusable for names
in that tree.


Some more nit-picky things:

Section 1.

I think that the point on comparison is only really valuable as a footnote.
That https://authority.example/a and https://authority.example/b/../a are
equivalent isn't really that useful an observation.  It might also be the case
that /c is equivalent to those two, but that's something that is only really
apparent to the authority.  And that doesn't cover changes in authority or
protocol that might still result in a resource that is considered equivalent by
its authority.

Factual error: a URI with the https:// scheme is never equivalent to one with a
http:// scheme.  Though the same content might eventually manifest, the notion
of the authority is not the same.  This is also an error with the sip/sips
comparison; a problem that is at the root of the problems in reconciling sip and
sips.

Section 2.

A better way to present this issue might be to recognize that a URL might
represent a resource for any number of purposes and at any layer in a stack.
Insert our earlier stackevo discussion around "what is an endpoint", because a
URL is intended to identify an endpoint; and the view of what an endpoint is
depends on context.  Even for the same type of URL.  For instance, HTTP URLs are
often used at the top of the stack, in a context where higher layers don't
exist.  Some web URLs fit that description, but there are also uses of HTTP URLs
that are only part of another protocol.

My thesis here is that, how that URL is consumed is critical to the
recommendations we might make.  For instance, SIP URLs are used in multiple ways
and they are often decorated with all sorts of information about how to reach
the destination.  On the other hand, TFTP URLs are - as stated - pretty limited
in expressiveness.

Section 3.

The role of discovery mechanisms here is a reasonable breakdown of the options,
but I'm finding it hard to connect this with the requirements that might
motivate a particular choice.  If our ultimate goal is to identify some
scenarios in which we can give concrete advice, then this might not be the best
structure to support that goal.

Section 3.2.

The note here about "sometimes assumed that multiple transport protocols would
use the same port number" makes me wonder what mechanism might be deployed to
support this.  In other words, I think that this is a bit of a pipe-dream in RFC
6335.

That note and the discussion about parallel port registrations is a distraction
here.  Similarly, the final note about ephemeral ports doesn't really help the
case here and could be removed.  The important point is that because port
numbers (or other elements of the manner of resolving a locator) could be
different upon each instantiation of a service, this adds the necessity to
signal that information using the URL.  That isn't really compatible with the
philosophy of <https://www.w3.org/Provider/Style/URI.html>.

That suggests either that use of ephemeral ports would only be applicable in
cases where the stability of a reference is not critical, or where there is a
discovery process such that the port number were not in the URL.  That's the key
point this makes.  I don't find that the talk about RFC 6335 helps with this
point.  Though the overall discussion of the role of discovery is interesting,
it's a badly dated concept and might be better stated with new words.

[Stackevo] Thoughts on draft-thaler-appsawg-multi… Martin Thomson