Re: [urn] gbs Name space identifier

Philip R Brenan <philiprbrenan@gmail.com> Mon, 21 October 2019 16:35 UTC

MIME-Version: 1.0
References: <CALhwFRmk_XzXHdpQXCDCp95cUGEpd9tmLTi4yjg+AAtpNhPg9g@mail.gmail.com> <87mue0m8sb.fsf@hobgoblin.ariadne.com>
In-Reply-To: <87mue0m8sb.fsf@hobgoblin.ariadne.com>
From: Philip R Brenan <philiprbrenan@gmail.com>
Date: Mon, 21 Oct 2019 17:35:11 +0100
Message-ID: <CALhwFR=LKzGK2So6S1LfHroFLFi0H12WevEzNJC_qTMsZtr0+w@mail.gmail.com>
To: "Dale R. Worley" <worley@ariadne.com>, "Hakala, Juha E" <juha.hakala@helsinki.fi>, urn@ietf.org, lars.svensson@web.de
Content-Type: multipart/mixed; boundary="00000000000078fd5105956e45bc"
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/racjNrSgLDjESJP5xHk3KovIkFo>
Subject: Re: [urn] gbs Name space identifier
Precedence: list

Hi *Dale*:

Thank you for raising these important issues.  I have updated the text of
the
proposal (attached) in light of your helpful comments.

I have corrected the error made on my part regarding MD5 collisions that you
kindly noted. It now reads:

    Equivalence is determined by comparing (ignoring case) just the <B>
    components of the two topics to be compared.  If they are equal the two
    topics are considered to be equal, even if, as a result of an MD5
collision
    the content of the underlying documents is in fact different. The
    characteristics of the MD5 sum make such an occurrence extremely
unlikely,
    see for example Mead:

    "Unique File Identification in the National Software
     Reference Library"

    at:

    https://www.nist.gov/sites/default/files/draft-060530.pdf

    Authors who have concerns over the possible impact of an MD5 collision
    on their work should not use this name space.

My rationale for this change is as follows:

All URN name spaces that do not contain the content of the actual topic in
question are, in effect, digests that seek to use a finite string to name
one
topic drawn from a potentially uncountable set of topics. For each such
digest
there must then be an uncountable number of other topics that could also
yield
that digest.  Consequently, in theory, all such name spaces suffer from
collisions. In practice, only a finite number of topics need to be named, so
the design of a URN name space becomes one of engineering: is the collision
rate sufficiently low to be acceptable in the proposed application?

In developing this proposal I initially considered various incremental
numbering schemes. These schemes were deemed failures as they required
collections of documents to communicate in real time to avoid name
collisions.

I then considered the CRC-32, MD4, MD5 and SHA-2 series of hash functions.
Of
these, MD5 was the shortest digest that appeared to offer an acceptable
collision rate, following the study done by Mead:
https://www.nist.gov/sites/default/files/draft-060530.pdf

As the URN identifying a document is likely to be frequently embedded in XML
and to appear in reports, presentations, program stack trace backs, spread
sheets, web pages, database indices etc. a shorter digest is preferable to a
longer one as long as the collision rate is acceptable.

MD5 is a useful compromise between the need to produce names that can be
used
in a variety of applications and contexts by both humans and computers while
maintaining a low enough collision rate. CRC-32 and MD4 were deemed too
unreliable.  SHA-2 was rejected on the grounds that the minimal improvement
in
reliability obtained in the proposed application space failed to justify the
extra overhead induced by the longer digests.

Consequently MD5 was chosen as the appropriate digest based on operational
experience obtained after trying a number of other possibilities.  If in the
light of further operational experience it becomes apparent that some other
digest would be more appropriate the <T> component provides space within
which
to adopt the new digest over time.

May I therefore urge that MD5 is an appropriate choice for now and that
adoption of this particular proposal should not be made dependent on finding
a perfect digest?

In Security and Privacy, following your suggestion, I have clarified that
access to the content named by the URN is required to check the validity of
the
URN as follows:

   Access to the content of the topic named by the URN is required to check
the
   validity of the URN. The validity of the URN for a topic can be checked
as
   follows: ...

May I propose that the need to access the content of a URL or URN to fully
validate the URI is a common feature of all name spaces as evinced by the
existence of HTTP code 404 and thus should not, in of itself, impede the
adoption of this particular proposal?

On Inter-operability, following your suggestion, I have clarified why case
is
often immaterial in anticipated operations on this name space as follows:

   For many computations, the case of the letters in the URN is immaterial
and
   can be safely ignored  because only the <B> component is authoritative:
   although the preferred representation of the <B> component is in
lowercase
   to minimize its visual impact on human readers, the lower case form can
   easily be recovered from a mixed case form making the <B> component, in
   effect, insensitive to case.

   The MD5 sum represented by the <B> component can undergo degradation in,
   copying, storage or transmission yet still be recoverable by querying
   significant collections of topics for topics with similar <B> components
   given that the anticipated size of the topic space is of order 1e10
versus
   an MD5 space size of order 1e38.

   Ideally, the <G> component should be presented in the case designated by
the
   original author of the topic.  In cases where this is not possible, the
<G>
   component can undergo degradation and still remain useful, for example:
high
   degradation rates in the <G> component have been noticed when the <G>
   component is spoken out loud by people collaborating in a shared work
space
   on less than 1e2 topics. The <G> component has a space size of at least
1e6
   as evinced by the number of Wikipedia articles in English:

   https://www.wikipedia.org/

   Operational experience has confirmed, so far, that the <G> component is
   capable of tolerating the significant degradation of case and spelling
that
   normally occurs in human speech.

   If the <G> component of two topics is intentionally identical or
identical
   after degradation then the identity of a specific topic can be confirmed
by
   saying the first few characters of its <B> component using a phonetic
   alphabet, such as the one used by NATO:

   https://en.wikipedia.org/wiki/NATO_phonetic_alphabet.

   It is anticipated that names from the proposed name space will be
embedded
   in XML topic references, file names, URL queries and commands entered
via the
   command line.  XML is sensitive to spaces and the following characters:

     <>'"=&

   File systems are often sensitive to file names containing:

     :/\.

   The query portion of a URL is sensitive to:

     &=#

   The command line is sensitive to spaces and:

     "'\$()[]{}

   The <T>, <G>, <B> components avoid these characters to facilitate
   inter-operability between these systems.

   To facilitate the construction of file names and URLs containing
references
   to topics named by this proposed name space, the formal name assigned by
   this proposal may be intentionally degraded by omitting the words 'urn'
and
   'gbs', omitting the <T> component, replacing the colon between the <G>
and
   <B> components with an underscore and adding a file name extension if the
   formal URN can be reliably recovered from the degraded version in the
   context within which the degraded version is being used. In such a
context,
   a formal URN:

 urn:gbs:dita:Introduction_to_the_GB_Standard:dddb7e2c29d2c8b9d87187fdf52a2702

   may be intentionally degraded to:

     Introduction_to_the_GB_Standard_dddb7e2c29d2c8b9d87187fdf52a2702.xml

   Other acceptable degradation will be published as updates to this
document.

   Dita topics that do not contain ASCII characters suitable for
constructing
   the <G> component will be accommodated by adding a new value to the list
of
   values accepted by the <T> component and specifying the corresponding
   algorithm for computing the <G> component in an update to this document.

During the development of the proposed name space I experimented with
alternative representations of the <B> component, specifically:
representing it
in base64 and in decimal.  Base64 had the disadvantage that the
predominance of
letters over numbers made it harder for humans to separate it from the <G>
component.  The decimal representation overcame this problem but at the
expense
of a longer <B> component.  Just as MD5 seems to offer a useful balance
between
length and collision rate in computations on the <B> component,
representation
of the <B> component in lowercase hexadecimal seems to offer a useful
balance
between length and the speed at which humans can separate it from the <G>
component.  If, as a result of further operational experience, it becomes
apparent that a better representation of the <B> component is available then
the <T> component allows names withing the proposed name space to be
upgraded
over time.

May I therefore represent that the proposed name space is usefully
resilient to
changes in case and the other degradations likely to occur during usage
while
maintaining a usable balance between the needs of humans versus computers?

In summary: may I urge that the combination of the <G> and <B> components in
the proposed name space represents a useful advance in the art of naming
Dita
topics and that consequently the proposed name space can be safely approved?

On Thu, Oct 17, 2019 at 2:59 AM Dale R. Worley <worley@ariadne.com> wrote:

> I see the new version as an enormous improvement.  One aspect that
> you've made clear is that for any given <B> value, there can be only one
> <G> value (although you cannot compute <G> from <B> unless you happen to
> have a copy of the topic to hand).
>
> As has been noted, URNs are *names* while URLs are *locators*, and URNs
> do not require a resolution process to locate the named resource.
>
> However, I think there are some aspects of the exposition that should be
> improved:
>
> Under "Resolution", it says:
>
>     Equivalence is determined by comparing (ignoring case) the <B>
> components
>     of the two topics to be compared.  If they are equal the two topics are
>     considered to be equal. Otherwise they are considered to be unequal
> even if
>     the underlying content is in fact identical. The characteristics of
> the MD5
>     sum ensure that only a small number of topics will be unnecessarily
>     duplicated as a result of such false positive equivalences.
>
> I do not see how "the underlying content is in fact identical" without
> generating the same <G> and <B>.  I suspect you mean that there can be
> two topics that have no *meaningful* difference but nonetheless have
> different <B> or <G>.
>
> Under "Security and Privacy", it says:
>
>    The validity of the URN can be checked as follows:
>
> However, these actions can only be done if one possesses not just the
> URN but also the content it refers to.  Usually "validity checking"
> names a procedure that only requires the URN as input.  So it would help
> if you expanded this sentence.  Perhaps, "The validity of the URN for a
> topic can be checked as follows:"
>
> Under "Inter-operability", it says:
>
>    The case of the letters chosen is immaterial and can be safely ignored
> in
>    all computations on the proposed URN as only the <B> component is used
> for
>    comparisons.
>
> The case of letters can be ignored for comparing URNs, but when copying
> or transmitting URNs, case has to be preserved, because only lower-case
> letters are valid in the <B> part of the URN, and the case of the
> letters in the <G> part must match that of the text from which it was
> derived -- so if you send the URN to another process, you have to get
> the case right.  So you want to narrow this statement.  Perhaps, "For
> many computations, the case of the letters in the URN is immaterial and
> can be safely ignored..."
>
> Dale
>

-- 
Thanks,

Phil <https://opentokrtc.com/room/phil>

Philip R Brenan <https://opentokrtc.com/room/phil>

Attachment: gbStandardUrnRegistration.txt

[urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Dale R. Worley
Re: [urn] gbs Name space identifier lars.svensson
Re: [urn] gbs Name space identifier Hakala, Juha E
Re: [urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Hakala, Juha E
Re: [urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Hakala, Juha E
Re: [urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Dale R. Worley
Re: [urn] gbs Name space identifier Dale R. Worley
Re: [urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Dale R. Worley
Re: [urn] gbs Name space identifier Philip R Brenan
Re: [urn] gbs Name space identifier Dale R. Worley

Re: [urn] gbs Name space identifier

Attachment: gbStandardUrnRegistration.txt