[rfc-i] Draft: Representation of Unicode and UTF-8 characters

henrik at levkowetz.com (Henrik Levkowetz) Sat, 15 May 2004 02:16 UTC

From: "henrik at levkowetz.com"
Date: Sat, 15 May 2004 02:16:06 -0000
Subject: [rfc-i] Draft: Representation of Unicode and UTF-8 characters
In-Reply-To: <Pine.BSF.4.58.0405142140220.3042@measurement-factory.com>
References: <40A56D25.9080307@cs.columbia.edu> <Pine.BSF.4.58.0405142140220.3042@measurement-factory.com>
Message-ID: <20040515111515.72994d20.henrik@levkowetz.com>

Hi Alex,

Friday 14 May 2004, Alex Rousskov wrote:
> > - The literal < uses the Unicode rendition <U+003C> in those cases
> > where this can be misinterpreted, i.e., where the open angle bracket
> > is followed by U+ or a hex digit.
> 
> Could somebody with direct access to a large RFC repository please
> scan it for "<U\+" and "<[A-H0-9]" patterns? I wonder what is the
> probability that an unaware RFC uses the above convention [for
> something else]?

"<U+" occurs on 7 lines in RFCs, all in the same one (RFC 3454), and all
using it to indicate Unicode.

"<[A-H0-9][A-H0-9]( [A-H0-9][A-H0-9])*>" occurs on 444 lines in RFC's,
most of which use it to indicate carriage advance, form feed, or html
markup (e.g. <H1>).  All those RFC's are numbered below 1000.

	Henrik




---- Rawer data -------------------------------------------------------

 $ grep "<U+" rfc*.txt
rfc3454.txt:   "hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English.
rfc3454.txt:   example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional
rfc3454.txt:   Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>").
rfc3454.txt:   Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
rfc3454.txt:   definitions; Latin digits (<U+0030> through <U+0039>) are examples of
rfc3454.txt:   Note that requirement 3 prohibits strings such as <U+0627><U+0031>
rfc3454.txt:   ("aleph 1") but allows strings such as <U+0627><U+0031><U+0628>
 

$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>"  rfc*.txt | wc -l
    444
 
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>"  rfc*.txt | grep "<CA>" | wc -l
    238
 
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>"  rfc*.txt | grep "<H[0-9]>" | wc -l
     60

$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>"  rfc*.txt | grep "<FF>" | wc -l
     28

$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>"  rfc*.txt | tail
rfc732.txt:    <17><IAC><SE>Telephone number:
rfc732.txt:    <IAC><SB><DET><MOVE CURSOR><32><4><IAC><SE>
rfc732.txt:    <24><IAC><SE>Social Security Number:
rfc732.txt:    <0><11><IAC><SE>                         [Establish a field that
rfc732.txt:    <IAC><SB><DET><MOVE CURSOR><32><5><IAC><SE>
rfc732.txt:  Intensity=1><0><29><IAC><SE>
rfc732.txt:    <IAC><GA>
rfc765.txt:            (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer
rfc929.txt:                  4 Format effectors  <BS> <CR> <LF> <FF> <HT> <VT>
rfc959.txt:               (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer