Re: Message-ID draft, comments please
Graham Klyne <GK@dial.pipex.com> Tue, 08 June 1999 12:25 UTC
Received: from CS.UTK.EDU (CS.UTK.EDU [128.169.94.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id IAA24108 for <drums-archive@odin.ietf.org>; Tue, 8 Jun 1999 08:25:46 -0400 (EDT)
Received: from localhost (daemon@localhost) by CS.UTK.EDU with SMTP (cf v2.9s-UTK) id IAA11123; Tue, 8 Jun 1999 08:25:20 -0400 (EDT)
Received: by cs.cs.utk.edu (bulk_mailer v1.12); Tue, 8 Jun 1999 08:25:18 -0400
Received: by CS.UTK.EDU (cf v2.9s-UTK) id IAA11105; Tue, 8 Jun 1999 08:25:16 -0400 (EDT)
Received: from pegasus.group5.co.uk (LOCALHOST.cs.utk.edu [127.0.0.1]) by CS.UTK.EDU with ESMTP (cf v2.9s-UTK) id IAA11054; Tue, 8 Jun 1999 08:24:59 -0400 (EDT)
Received: from pegasus.group5.co.uk (193.128.238.226 -> mailhost.group5.co.uk) by CS.UTK.EDU (smtpshim v1.0); Tue, 8 Jun 1999 08:25:03 -0400
Received: from GK-Portable (unverified [62.188.142.229]) by pegasus.group5.co.uk (Rockliffe SMTPRA 2.1.5) with SMTP id <B0000814864@pegasus.group5.co.uk>; Tue, 08 Jun 1999 13:16:08 +0100
Message-Id: <3.0.32.19990608121054.017790b0@pop.dial.pipex.com>
X-Sender: maiw03@pop.dial.pipex.com
X-Mailer: Windows Eudora Pro Version 3.0 (32)
Date: Tue, 08 Jun 1999 13:24:42 +0100
To: Matt Curtin <cmcurtin@interhack.net>
From: Graham Klyne <GK@dial.pipex.com>
Subject: Re: Message-ID draft, comments please
Cc: drums@cs.utk.edu
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
List-Unsubscribe: <mailto:drums-request@cs.utk.edu?Subject=unsubscribe>
Matt, I think a document of this kind is useful to have. Also, I think having a document that takes account of news and mail system idiosyncrasies is a Good Thing. I have a few comments on this one: At 18:11 07/06/99 -0400, Matt Curtin wrote: > Recommendations for generating Message IDs [...] >2. Message-ID formatting [...] >Domain MUST be a reasonable, resolvable fully-qualified domain name >(FQDN). It is noteworthy that it is not always possible for a host to >determine its FQDN and sometimes site policy dictates that internal >hostnames not be exposed to the public Internet. In these cases, >Message-ID generators MUST NOT use an unqualified hostname. I don't see that the FQDN needs to be resolvable. I agree that it must be properly administratively scoped, but (I think) "resolvable" implies that a SND entry for the name must exist. For example, the suggestion of using something like (say) "<uuid>@id.microsoft.com" seems entirely reasonable. I'll take a stab at an alternative to the 1st sentence above: # Domain MUST be a reasonable, fully-qualified domain name (FQDN) within a # DNS zone for which an authoritative DNS server can be determined. [...] >3.2. "Local part" > [...] >3.2.2. Using a psuedorandom number generator > >One could take 64 bits from a good, well-seeded pseudorandom number >generator [PRNG] in order to increase significantly the uniqueness of >the Message-ID. The advantage of this method is that it is fast and >effective. The disadvantage is that not all pseudorandom number >generators are created equal: just because you've generated a 64-bit >random number does not mean that you have actually gotten 64 bits of >randomness. The quality of your random number generator is important. Good points, but... >As long as the probability of the same number being generated twice is >smaller than the probability that the machine will mis-execute an >instruction, your job is done. ... I'm uneasy with statements like this which appeal to some unknown event that seems to have little relationship with the problem at hand. I think that stochastic algorithms that do not guarantee 100% perfect results need to be presented carefully as such. To this end, I applaud the inclusion of an appendix that deals with actual probabilities (more of that later). The criterion you give is not exactly one that is demonstrably satisfied by some code. A possible alternative criterion (which admitedly is also a bit vague) is to present the probability of failure due to random ID collisions as being very small in relation to other possible failure modes in the system as a whole. I imagine that statistics for such system failures could be found. Also, it may be that the consequences of failure may need to be taken into account: is this ID the basis for correct operation of a safety-critical system, non-repudiation of a high-value monetary transfer, or a system diagnostic tool? >3.2.3. Using a hash > >Another approach would be to generate a hash of the message and use >that after the timestamp. If this is done well, this can also >significantly reduce the opportunity for collision, and will generate >a value that is relatively unique. Note that, in practice, this is >more difficult than it sounds. It is recommended that a >cryptographically secure hash function [SHA1, MD5] be used, as >others, such as CRC, are likely to have higher instances of collision. It might be worth also noting that a hash-based identifier might reduce the potential for insertion of or replacement by fake identifiers. >3.3. Bringing it all together [...] > 3. Generate additional data to prevent Message-ID collision on two > messages processed by the same host at the same clock-tick. > Suggestions for sources of these additional data are described > in section 3.2. We recommend a good 64-bit random number. Or a hash (MD5, say) of the message? > 4. For each of the numbers generated in steps 2 and 3, convert the > numbers to base 36 (using the characters 0-9 and A-Z), and write > the numbers, with a "." between them. > > (Base36 is recommended in order to reduce the overall length of > the message ID; economy in message ID length is a good thing, > given how prevalent they are.) Is there an RFC-series document describing construction of Base36 numbers? >Appendix A: Probability of Message-ID Collision > >Let's suppose that a given Internet Service Provider has one million >customers who will all have the same hostname on their messages. Now, >let's suppose that each one of those customers generates one thousand >messages per day. Assume a timestamp as part of the message ID that >includes the time to the nearest second. There are 86,400 seconds in >a day, making the probability that a given customer will be sending a >message at a given second about 0.012. There will be, on average, >12,000 customers generating a message ID at the same time to the limit >of accuracy of the clock. I think the statistics here are rather suspect, since they assume that messages are sent (statistically) independently of the time-of-day. I would suggest that is unlikely. I think it would be better to _assume_ a maximum number of messages prepared per second and perform statistical calculations on that basis; Then the formula noted below would apply. Better still, I'd assume a range of values that go well beyond current expectations so that some idea of the limits within which the mechanism will provide the expected behaviour can be seen. (When I did similar calculations for collision probability of 128-bit hashes, I was surprised to find that the probability of collision approaches unity when just 1 in 10^18 of the value space is used -- see below). >A truly random 64 bit number added to the Message-ID means that the >probability of a collision is incredibly miniscule. Specifically, the >probability of collosion per per day is: > > (10^9)! (10^9)! > ----------- x --------- > (24x60x60)! (2^{64})! That formula looks suspect to me. (Given that I suggest a slightly different approach above, I've not tried to fully work it out). >As soon as we find a way to calculate that, we'll spell it out. I have set up some similar calculations for 128-bit MD5 hashes that are to be described and tabulated in a future revision of <draft-ietf-conneg-feature-hash>. This is a summary of my results: Number of Probability of two hash values hash values the same ----------- -------------------- 1 0 2 3E-39 10 1E-37 1E3 1E-33 1E6 1E-27 1E9 1E-21 1E12 1E-15 1E15 1E-9 1E18 1E-3 The above probability computations are approximate, being performed using logarithms of a Gamma function approximation by Lanczos. The probability formula is 'P=1-(m!/((m- n)! m^n))', where 'm' is the total number of possible hash values (2^128) and 'n' is the number of values in use. The calculations were repeated using a range of different precisions (up to several thousand digits, and using a more accurate LogGamma function than that cited), and also some graphical trend analysis was performed, to be confident that the relatively large probabilities at the bottom of the table were not being caused by numerical rounding errors. #g ------------ Graham Klyne (GK@ACM.ORG)
- Message-ID draft, comments please Matt Curtin
- Re: Message-ID draft, comments please Graham Klyne
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Dave Sill
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Graham Klyne
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Paul Overell
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Graham Klyne
- Re: Message-ID draft, comments please Matt Curtin
- Re: Message-ID draft, comments please Matt Curtin
- Re: Message-ID draft, comments please Robert Elz
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Matt Curtin
- Re: Message-ID draft, comments please Robert Elz
- Re: Message-ID draft, comments please Brad Knowles
- Re: Message-ID draft, comments please Graham Klyne