Re: Message-ID draft, comments please

Graham Klyne <GK@dial.pipex.com> Tue, 08 June 1999 12:25 UTC

Received: from CS.UTK.EDU (CS.UTK.EDU [128.169.94.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id IAA24108 for <drums-archive@odin.ietf.org>; Tue, 8 Jun 1999 08:25:46 -0400 (EDT)
Received: from localhost (daemon@localhost) by CS.UTK.EDU with SMTP (cf v2.9s-UTK) id IAA11123; Tue, 8 Jun 1999 08:25:20 -0400 (EDT)
Received: by cs.cs.utk.edu (bulk_mailer v1.12); Tue, 8 Jun 1999 08:25:18 -0400
Received: by CS.UTK.EDU (cf v2.9s-UTK) id IAA11105; Tue, 8 Jun 1999 08:25:16 -0400 (EDT)
Received: from pegasus.group5.co.uk (LOCALHOST.cs.utk.edu [127.0.0.1]) by CS.UTK.EDU with ESMTP (cf v2.9s-UTK) id IAA11054; Tue, 8 Jun 1999 08:24:59 -0400 (EDT)
Received: from pegasus.group5.co.uk (193.128.238.226 -> mailhost.group5.co.uk) by CS.UTK.EDU (smtpshim v1.0); Tue, 8 Jun 1999 08:25:03 -0400
Received: from GK-Portable (unverified [62.188.142.229]) by pegasus.group5.co.uk (Rockliffe SMTPRA 2.1.5) with SMTP id <B0000814864@pegasus.group5.co.uk>; Tue, 08 Jun 1999 13:16:08 +0100
Message-Id: <3.0.32.19990608121054.017790b0@pop.dial.pipex.com>
X-Sender: maiw03@pop.dial.pipex.com
X-Mailer: Windows Eudora Pro Version 3.0 (32)
Date: Tue, 08 Jun 1999 13:24:42 +0100
To: Matt Curtin <cmcurtin@interhack.net>
From: Graham Klyne <GK@dial.pipex.com>
Subject: Re: Message-ID draft, comments please
Cc: drums@cs.utk.edu
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
List-Unsubscribe: <mailto:drums-request@cs.utk.edu?Subject=unsubscribe>

Matt,

I think a document of this kind is useful to have.  Also, I think having a
document that takes account of news and mail system idiosyncrasies is a
Good Thing.  I have a few comments on this one:

At 18:11 07/06/99 -0400, Matt Curtin wrote:
>	      Recommendations for generating Message IDs

[...]
>2. Message-ID formatting

[...]
>Domain MUST be a reasonable, resolvable fully-qualified domain name
>(FQDN).  It is noteworthy that it is not always possible for a host to
>determine its FQDN and sometimes site policy dictates that internal
>hostnames not be exposed to the public Internet.  In these cases,
>Message-ID generators MUST NOT use an unqualified hostname.

I don't see that the FQDN needs to be resolvable.  I agree that it must be
properly administratively scoped, but (I think) "resolvable" implies that a
SND entry for the name must exist.  For example, the suggestion of using
something like (say) "<uuid>@id.microsoft.com" seems entirely reasonable.

I'll take a stab at an alternative to the 1st sentence above:

# Domain MUST be a reasonable, fully-qualified domain name (FQDN) within a
# DNS zone for which an authoritative DNS server can be determined.

[...]
>3.2. "Local part"
>

[...]
>3.2.2. Using a psuedorandom number generator
>
>One could take 64 bits from a good, well-seeded pseudorandom number
>generator [PRNG] in order to increase significantly the uniqueness of
>the Message-ID.  The advantage of this method is that it is fast and
>effective.  The disadvantage is that not all pseudorandom number
>generators are created equal: just because you've generated a 64-bit
>random number does not mean that you have actually gotten 64 bits of
>randomness.  The quality of your random number generator is important.

Good points, but...

>As long as the probability of the same number being generated twice is
>smaller than the probability that the machine will mis-execute an
>instruction, your job is done.

... I'm uneasy with statements like this which appeal to some unknown event
that seems to have little relationship with the problem at hand.

I think that stochastic algorithms that do not guarantee 100% perfect
results need to be presented carefully as such.  To this end, I applaud the
inclusion of an appendix that deals with actual probabilities (more of that
later).

The criterion you give is not exactly one that is demonstrably satisfied by
some code.  A possible alternative criterion (which admitedly is also a bit
vague) is to present the probability of failure due to random ID collisions
as being very small in relation to other possible failure modes in the
system as a whole.  I imagine that statistics for such system failures
could be found.  Also, it may be that the consequences of failure may need
to be taken into account:  is this ID the basis for correct operation of a
safety-critical system, non-repudiation of a high-value monetary transfer,
or a system diagnostic tool?


>3.2.3. Using a hash
>
>Another approach would be to generate a hash of the message and use
>that after the timestamp.  If this is done well, this can also
>significantly reduce the opportunity for collision, and will generate
>a value that is relatively unique.  Note that, in practice, this is
>more difficult than it sounds.  It is recommended that a
>cryptographically secure hash function [SHA1, MD5] be used, as
>others, such as CRC, are likely to have higher instances of collision.

It might be worth also noting that a hash-based identifier might reduce the
potential  for insertion of or replacement by fake identifiers.


>3.3. Bringing it all together

[...]
>  3. Generate additional data to prevent Message-ID collision on two
>     messages processed by the same host at the same clock-tick.
>     Suggestions for sources of these additional data are described
>     in section 3.2.  We recommend a good 64-bit random number.

Or a hash (MD5, say) of the message?

>  4. For each of the numbers generated in steps 2 and 3, convert the
>     numbers to base 36 (using the characters 0-9 and A-Z), and write
>     the numbers, with a "." between them.
>
>     (Base36 is recommended in order to reduce the overall length of
>     the message ID; economy in message ID length is a good thing,
>     given how prevalent they are.)

Is there an RFC-series document describing construction of Base36 numbers?


>Appendix A: Probability of Message-ID Collision
>
>Let's suppose that a given Internet Service Provider has one million
>customers who will all have the same hostname on their messages.  Now,
>let's suppose that each one of those customers generates one thousand
>messages per day.  Assume a timestamp as part of the message ID that
>includes the time to the nearest second.  There are 86,400 seconds in
>a day, making the probability that a given customer will be sending a
>message at a given second about 0.012. There will be, on average,
>12,000 customers generating a message ID at the same time to the limit
>of accuracy of the clock.

I think the statistics here are rather suspect, since they assume that
messages are sent (statistically) independently of the time-of-day.  I
would suggest that is unlikely.  I think it would be better to _assume_ a
maximum number of messages prepared per second and perform statistical
calculations on that basis;  Then the formula noted below would apply.
Better still, I'd assume a range of values that go well beyond current
expectations so that some idea of the limits within which the mechanism
will provide the expected behaviour can be seen.

(When I did similar calculations for collision probability of 128-bit
hashes, I was surprised to find that the probability of collision
approaches unity when just 1 in 10^18 of the value space is used -- see
below).

>A truly random 64 bit number added to the Message-ID means that the
>probability of a collision is incredibly miniscule.  Specifically, the
>probability of collosion per per day is:
>
>      (10^9)!        (10^9)!
>    -----------  x  ---------
>    (24x60x60)!     (2^{64})!

That formula looks suspect to me.  (Given that I suggest a slightly
different approach above, I've not tried to fully work it out).
 
>As soon as we find a way to calculate that, we'll spell it out.

I have set up some similar calculations for 128-bit MD5 hashes that are to
be described and tabulated in a future revision of
<draft-ietf-conneg-feature-hash>.  This is a summary of my results:

    Number of          Probability of two
    hash values        hash values the same
    -----------        --------------------                       
      1                  0
      2                  3E-39
     10                  1E-37
      1E3                1E-33
      1E6                1E-27
      1E9                1E-21
      1E12               1E-15
      1E15               1E-9
      1E18               1E-3

The above probability computations are approximate, being 
performed using logarithms of a Gamma function approximation 
by Lanczos.  The probability formula is 'P=1-(m!/((m-
n)! m^n))', where 'm' is the total number of possible hash 
values (2^128) and 'n' is the number of values in use.

The calculations were repeated using a range of different precisions (up to
several thousand digits, and using a more accurate LogGamma function than
that cited), and also some graphical trend analysis was performed, to be
confident that the relatively large probabilities at the bottom of the
table were not being caused by numerical rounding errors.

#g

------------
Graham Klyne
(GK@ACM.ORG)