Re: Comments on draft-seantek-rdf-urn-00

Sean Leonard <dev+ietf@seantek.com> Thu, 13 November 2014 16:19 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: urn-nid@ietfa.amsl.com
Delivered-To: urn-nid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3BB541A8A6A for <urn-nid@ietfa.amsl.com>; Thu, 13 Nov 2014 08:19:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kENzeE9LB87I for <urn-nid@ietfa.amsl.com>; Thu, 13 Nov 2014 08:19:18 -0800 (PST)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 55B091A89A2 for <urn-nid@ietf.org>; Thu, 13 Nov 2014 08:19:18 -0800 (PST)
Received: from dhcp-8c2d.meeting.ietf.org (unknown [31.133.140.45]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 5E626509B6; Thu, 13 Nov 2014 11:19:15 -0500 (EST)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: Comments on draft-seantek-rdf-urn-00
From: Sean Leonard <dev+ietf@seantek.com>
In-Reply-To: <201411130237.sAD2behU001729@hobgoblin.ariadne.com>
Date: Thu, 13 Nov 2014 06:19:13 -1000
Content-Transfer-Encoding: quoted-printable
Message-Id: <8A2ADB1A-C01E-4110-8BD3-206C8FC1E67E@seantek.com>
References: <05E89947-5180-40BB-A14A-9D97E92DDAB1@seantek.com> <201411130237.sAD2behU001729@hobgoblin.ariadne.com>
To: "Dale R. Worley" <worley@ariadne.com>
X-Mailer: Apple Mail (2.1878.6)
Archived-At: http://mailarchive.ietf.org/arch/msg/urn-nid/BkYNEnm1lUaXDKjU4-PMGT_QIbo
Cc: urn-nid@ietf.org
X-BeenThere: urn-nid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: discussion of new namespace identifiers for URNs <urn-nid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn-nid>, <mailto:urn-nid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/urn-nid/>
List-Post: <mailto:urn-nid@ietf.org>
List-Help: <mailto:urn-nid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn-nid>, <mailto:urn-nid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Nov 2014 16:19:22 -0000

Once again thanks for the detailed comments. I will try to focus on a couple of initial, key points.

On Nov 12, 2014, at 4:37 PM, Dale R. Worley <worley@ariadne.com> wrote:

> (Many of these comments apply to draft-seantek-xmlns-urn-00 as well.)
> 
>   1.  Introduction
> 
>   The Resource Description Framework [RDF] is a framework for
>   representing information in the web.  RDF contains nodes that are
>   identified by URI references.  The URI reference is basically an
>   opaque string with semantics applied onto it by the RDF standard; RDF
>   applications are not required or expected to dereference the URI.
> 
> You almost certainly mean "URI" not "URI reference" ("RDF contains
> nodes that are identified by URIs.") -- a "URI reference" is an
> appearance of a URI in a particular place, in the same way a footnote
> in a book is a reference.  There can be many URI references to the
> same URI.

When writing this draft-00 I used the RDF Concepts and Abstract Syntax spec <http://www.w3.org/TR/2004/REC-rdf-concepts-20040210> ~1.0 (from 2004).

In that spec, the term “URI reference” (also “RDF URI reference”) is used prominently. Basically it just means URI in the RFC 3986 sense. Probably the reason why it is called “URI reference” is because in RDF parlance, 

I note that in the newer RDF 1.1 Concepts and Abstract Syntax spec <http://www.w3.org/TR/rdf11-concepts/> (2014), the term “RDF URI reference” has been replaced with “IRI”. For example:
***
1.0:
6.1 RDF Triples

An RDF triple contains three components:

	• the subject, which is an RDF URI reference or a blank node
	• the predicate, which is an RDF URI reference
	• the object, which is an RDF URI reference, a literal or a blank node


1.1:
3.1 Triples

An RDF triple consists of three components:

	• the subject, which is an IRI or a blank node
	• the predicate, which is an IRI
	• the object, which is an IRI, a literal or a blank node

***
The definition in 1.0 is closely aligned with the URN syntax [RFC 2141]:

A URI reference within an RDF graph (an RDF URI reference) is a Unicode string [UNICODE] that:

	• does not contain any control characters ( #x00 - #x1F, #x7F-#x9F)
	• and would produce a valid URI character sequence (per RFC2396 [URI], sections 2.1) representing an absolute URI with optional fragment identifier when subjected to the encoding described below. 
The encoding consists of:

	• encoding the Unicode string as UTF-8 [RFC-2279], giving a sequence of octet values.
	• %-escaping octets that do not correspond to permitted US-ASCII characters.

Compare with 1.1:

An IRI (Internationalized Resource Identifier) within an RDF graph is a Unicode string [UNICODE] that conforms to the syntax defined in RFC 3987 [RFC3987].

IRIs in the RDF abstract syntax must be absolute, and may contain a fragment identifier.

IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of [RFC3987]. Further normalization must not be performed when comparing IRIs for equality.

***

From what I understand, there is a history of less than goodness with regard to RFC 3987…so I tried to avoid the whole topic by sticking with the RDF 1.0 spec.

I am happy to do what makes the most sense here, however, in light of the history.

> [...]
> Probably this section only intends to define the RDF URN NSS, leaving
> implicit the syntax of the full RDF URN.  That should be clarified by
> providing an explicit production for <rdf-urn>.
> 
>   When encoded in a URN, Unicode code points beyond U+007F
>   are encoded as percent-encoded UTF-8. Conveniently, all XML name
>   characters in the US-ASCII range are in the [RFC3986] unreserved set.
> 
> Describing the syntax of the NSS by specifying a set of Unicode
> strings and then an encoding to be applied to that set of strings to
> produce the URNs is formally correct but puts a burden on an
> implementer.  It would be better if that aspect of the syntax was also
> described as a combination of an informal description of the intention
> and a complete and correct ABNF.

> […]
> 
> The second and third sentence describes one-, two-, and three-
> character "names".  It's not clear what "name" means here.  By
> default, I expect it to be the same as "URN", but of course all URNs
> have at least 7 characters.  So perhaps "name" means "NSS".  But the
> syntax definition restricts NSSs to have at least 4 characters.
> 

These series of concerns relate directly to the UTF-8 encoding issue.

Based on my understanding (which could well be flawed—I seek education on the matter) of URI, IRI, URN, etc., there are many different rules and heuristics that define “equivalence” and “transformation”. For example, “lexical equivalence” of a URN is not the same as “equivalence” of a URI. In fact RFC 3986 defines at least seven different equivalence algorithms in what it calls a “comparison ladder”, and also freely admits that: “We use the terms "different” and "equivalent" to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence.”

My inference is that lexical equivalence of a URN is simply a different kind of equivalence relationship than that described in RFC 3986. It is one of the “application-dependent versions” (if you were to shoehorn it in).

When I wrote these drafts, in my mind (which should have been more explicitly stated), I drew distinctions between:
<<<
the name and the resource: the name is an abstract identifier that is unique etc. in the space of names for a set of resources; the resource is an abstract thingamajig of interest to applications on the Internet
 (I have provided more commentary about this on urn@ietf.org)

the Name, which (in the case of RDF URNs, at least) is a Unicode string that complies mostly with the XML Name production (with the changes discussed in this document)

the NSS, which is the sequence of characters that is syntactically compliant with RFC 2141 (URN Syntax), with an obvious relationship to the Name

the URN, which is the NID + NSS, and has “URN semantics” (whatever that is—basically the stuff about being persistent, unique, and resolvable with a URN resolver)

the urn: URI, which is a RFC 3986-compliant URI that looks like <urn:xmlns:acme:foodlv2#bar>.
<<<

With these distinctions (vaguely) in mind, I felt that the term “lexical equivalence” primarily is relevant to the NSS and URN concepts, not to the urn: URI concept. Thus the urn: URI is a syntactic way to adapt a URN to a URI world, viz.:

urn:<URN>[#REF]

i.e., the fragment is not part of the URN, although it is part of the urn: URI. The fragment depends on the resource, the resource is defined by the preceding part of the URI, and in the case of URNs, the resource is this abstract thingamajig of interest to applications on the Internet.

Furthermore, the URN is only comprised of US-ASCII characters, but the Name is UTF-8. This provides a path to transform or use UTF-8 characters (outside US-ASCII) in the RDF URN…which I guess when not percent-encoded, would make it a RDF IRN (Internationalized Resource Name?) which is compatible with RDF 1.1. That approaches the twilight zone because among other disgruntlements, such an RDF IRN in RDF 1.1 could be transformed back to a RDF URN in RDF 1.0 (for analysis and comparison purposes), but it’s not obvious if a URN/URI/IRI parser needs to treat the percent-encoded characters as irrelevant to lexical equivalence. I just saw all of this and said…”let’s deal with that in a later draft…”.

Consider the homograph problem, such as “é” (e with acute). In UTF-8 there are two code sequences that can produce such a grapheme: U+00E9 and U+0065 U+0301 (e + combining acute). XML says that Names SHOULD use Normalization Form C, but it doesn’t require that. In any event there are plenty of examples of homographs where picking a normalization form doesn’t help.

In a [RFC 2141] URN, the code sequences need to be converted to %E9 and e%CC%81 respectively. The distinction is patently obvious. But when you allow UTF-8 codes directly (as RDF 1.1 makes more explicit, in any event) the distinction is lost on display, and could well be lost in transcription. This is the origin of the IANA registration process that is supposed to look at confusingly similar names in an automated fashion. It is the same problem that plagues IDNA registrations in the DNS.

We could avoid the problem entirely by only using characters in the RESERVED set (which conveniently correspond mostly to the US-ASCII range of XML Name). But then we lose the Unicode goodness. I don’t know which is worse but seeing everything I opted in draft-00 to include non-ASCII support, to start a conversation about it.

Sean