[rfc-i] Draft: Representation of Unicode and UTF-8 characters
henrik at levkowetz.com (Henrik Levkowetz) Sat, 15 May 2004 02:16 UTC
From: "henrik at levkowetz.com"
Date: Sat, 15 May 2004 02:16:06 -0000
Subject: [rfc-i] Draft: Representation of Unicode and UTF-8 characters
In-Reply-To: <Pine.BSF.4.58.0405142140220.3042@measurement-factory.com>
References: <40A56D25.9080307@cs.columbia.edu> <Pine.BSF.4.58.0405142140220.3042@measurement-factory.com>
Message-ID: <20040515111515.72994d20.henrik@levkowetz.com>
Hi Alex, Friday 14 May 2004, Alex Rousskov wrote: > > - The literal < uses the Unicode rendition <U+003C> in those cases > > where this can be misinterpreted, i.e., where the open angle bracket > > is followed by U+ or a hex digit. > > Could somebody with direct access to a large RFC repository please > scan it for "<U\+" and "<[A-H0-9]" patterns? I wonder what is the > probability that an unaware RFC uses the above convention [for > something else]? "<U+" occurs on 7 lines in RFCs, all in the same one (RFC 3454), and all using it to indicate Unicode. "<[A-H0-9][A-H0-9]( [A-H0-9][A-H0-9])*>" occurs on 444 lines in RFC's, most of which use it to indicate carriage advance, form feed, or html markup (e.g. <H1>). All those RFC's are numbered below 1000. Henrik ---- Rawer data ------------------------------------------------------- $ grep "<U+" rfc*.txt rfc3454.txt: "hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English. rfc3454.txt: example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional rfc3454.txt: Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>"). rfc3454.txt: Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel", rfc3454.txt: definitions; Latin digits (<U+0030> through <U+0039>) are examples of rfc3454.txt: Note that requirement 3 prohibits strings such as <U+0627><U+0031> rfc3454.txt: ("aleph 1") but allows strings such as <U+0627><U+0031><U+0628> $ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | wc -l 444 $ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<CA>" | wc -l 238 $ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<H[0-9]>" | wc -l 60 $ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<FF>" | wc -l 28 $ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | tail rfc732.txt: <17><IAC><SE>Telephone number: rfc732.txt: <IAC><SB><DET><MOVE CURSOR><32><4><IAC><SE> rfc732.txt: <24><IAC><SE>Social Security Number: rfc732.txt: <0><11><IAC><SE> [Establish a field that rfc732.txt: <IAC><SB><DET><MOVE CURSOR><32><5><IAC><SE> rfc732.txt: Intensity=1><0><29><IAC><SE> rfc732.txt: <IAC><GA> rfc765.txt: (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer rfc929.txt: 4 Format effectors <BS> <CR> <LF> <FF> <HT> <VT> rfc959.txt: (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer
- [rfc-i] Draft: Representation of Unicode and UTF-… Henning Schulzrinne
- [rfc-i] Draft: Representation of Unicode and UTF-… Alex Rousskov
- [rfc-i] Draft: Representation of Unicode and UTF-… Henrik Levkowetz
- [rfc-i] Draft: Representation of Unicode and UTF-… Paul Hoffman / VPNC
- [rfc-i] Re: Draft: Representation of Unicode and … Henning Schulzrinne