[xml2rfc] Transformations of non-ASCII characters

carl at media.org (Carl Malamud) Tue, 14 February 2006 20:38 UTC

From: "carl at media.org"
Date: Tue, 14 Feb 2006 20:38:33 +0000
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM>
Message-ID: <200602150434.k1F4YWML015409@bulk.resource.org>
X-Date: Tue Feb 14 20:38:33 2006

John -

Is there an appropriate tr of U+00F8 into ASCII that
is non-language dependent?  (E.g., would "o" make more sense)?

The code does (or at least used to the last time I read it)
a simple 'if you see character "x" substitute "y"' algorithm
and doesn't take into account the intended language.

Carl

> Hi.
> 
> I just tried pushing a proto-Internet Draft through the current
> online version of XML2RFC.   It is a discussion of
> internationalization issues and contains some non-ASCII
> characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE)
> and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  My hope was
> that I could produce HTML and, ultimately, PDF versions with the
> right characters in them.  I expected complaints from the
> processor for the text version, which would be a fine solution.
> 
> Instead, these characters are apparently quietly (no errors,
> warnings, or comments) converted in the text version to the
> sequence "oe".  That is a problem because, while U+00F6 can
> always reasonably be converted to "oe" if the language being
> used is German, the conversion is not appropriate for many of
> the other languages in which that character might appear.  And
> while I'm not an expert on the languages that use U+00F8, my
> impression is that conversion from it to "oe" is almost never
> appropriate.
> 
> Now that I know the problem, I can go back and hand-patch the
> ASCII text version, but it seems to me that this is, however
> well-intentioned, a trap for the unwary.
> 
>      john
> 
> _______________________________________________
> xml2rfc mailing list
> xml2rfc@lists.xml.resource.org
> http://lists.xml.resource.org/mailman/listinfo/xml2rfc
> 
>From lars.eggert at netlab.nec.de  Wed Feb 15 09:57:48 2006
From: lars.eggert at netlab.nec.de (Lars Eggert)
Date: Wed Feb 15 00:58:12 2006
Subject: [xml2rfc] only first author in bibliography?
In-Reply-To: <ed6d469d0602141329t1f45f059v857d2475918f3d0d@mail.gmail.com>
References: <43E89784.8020009@dial.pipex.com>
	<537D4B48-8C7E-455A-8197-9CAC88CD8493@dbc.mtview.ca.us>
	<43E8B2CE.6000806@dial.pipex.com>
	<E56ECC57-3D17-471A-BEB4-37BF7FE19493@dbc.mtview.ca.us>
	<8751BDF8-3F77-4297-8D14-0F06381A860E@netlab.nec.de>
	<8DF288D3-60B4-4A79-9F1C-3CA5BC01CB6F@netlab.nec.de>
	<ed6d469d0602141329t1f45f059v857d2475918f3d0d@mail.gmail.com>
Message-ID: <BDBB103D-449A-41ED-9B1A-30394CC4C452@netlab.nec.de>

Hi,

On Feb 14, 2006, at 22:29, Bill Fenner wrote:
> On 2/14/06, Lars Eggert <lars.eggert@netlab.nec.de> wrote:
>> Some further checking has revealed that this is only the case for
>> some references. Still, it'd be good to have full authorship
>> information for all refs.

Sure: the XML for draft-ietf-hip-base-04 has only Bob Moskowitz as  
author, while the draft has four authors.

> Is the amount of information in the version
> of the ref at http://rtg.ietf.org/~fenner/ietf/xml/bibxml3/ any
> different?

No.

> In general it's the secretariat's data (e.g., 1id-abstracts.txt) that
> is missing the desired info.

True in this case, too. 1id-abstracts.txt only lists Bob.

This seems to be a pretty common problem. A quick look at some other  
HIP IDs also shows missing authors in 1id-abstracts.txt:

	draft-ietf-hip-mm-02
	draft-irtf-hiprg-nat-01
	draft-ietf-hip-registration-01

(CC'ing the secretariat for this reason.)

Lars
-- 
Lars Eggert                                     NEC Network Laboratories


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3686 bytes
Desc: not available
Url : http://drakken.dbc.mtview.ca.us/pipermail/xml2rfc/attachments/20060215/3229d5c2/smime.bin
>From henrik at levkowetz.com  Wed Feb 15 10:29:30 2006
From: henrik at levkowetz.com (Henrik Levkowetz)
Date: Wed Feb 15 01:29:46 2006
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM>
References: <5F92589D8AF5243E5399447D@p3.JCK.COM>
Message-ID: <43F2F47A.2010307@levkowetz.com>



on 2006-02-15 01:21 John C Klensin said the following:
> Hi.
> 
> I just tried pushing a proto-Internet Draft through the current
> online version of XML2RFC.   It is a discussion of
> internationalization issues and contains some non-ASCII
> characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE)
> and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  My hope was
> that I could produce HTML and, ultimately, PDF versions with the
> right characters in them.  I expected complaints from the
> processor for the text version, which would be a fine solution.
> 
> Instead, these characters are apparently quietly (no errors,
> warnings, or comments) converted in the text version to the
> sequence "oe".  That is a problem because, while U+00F6 can
> always reasonably be converted to "oe" if the language being
> used is German, the conversion is not appropriate for many of
> the other languages in which that character might appear.  And
> while I'm not an expert on the languages that use U+00F8, my
> impression is that conversion from it to "oe" is almost never
> appropriate.

As a native speaker of Norwegian, which uses o with stroke, and
Swedish, which uses o with diaeresis, I have to agree.  In Norwegian
you can use "oe" as an oddball fallback mode if you're stuck with
a typewriter with only ASCII symbols, but I'd only expect that
from a nerd, not as a general occurrence.  Most people would
prefer to backspace and add a slash over the o.  In Swedish, most
people seem to prefer simply dropping the diaeresis if there is
absolutely no way of getting the proper character.  In neither
language the 'oe' form is accepted in the same way it is in
German today.

> Now that I know the problem, I can go back and hand-patch the
> ASCII text version, but it seems to me that this is, however
> well-intentioned, a trap for the unwary.

I believe I agree.


	Henrik



>      john
>From henrik at levkowetz.com  Wed Feb 15 10:36:03 2006
From: henrik at levkowetz.com (Henrik Levkowetz)
Date: Wed Feb 15 01:36:20 2006
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <200602150434.k1F4YWML015409@bulk.resource.org>
References: <200602150434.k1F4YWML015409@bulk.resource.org>
Message-ID: <43F2F603.6080302@levkowetz.com>


on 2006-02-15 05:34 Carl Malamud said the following:
> John -
> 
> Is there an appropriate tr of U+00F8 into ASCII that
> is non-language dependent?  (E.g., would "o" make more sense)?

I'd say no, based on knowledge of Norwegian, Swedish and German.
"oe" is pretty clearly right for German, somewhat doubtful for
Norwegian and more so for Swedish.


	Henrik

> The code does (or at least used to the last time I read it)
> a simple 'if you see character "x" substitute "y"' algorithm
> and doesn't take into account the intended language.
> 
> 
> Carl
> 
>> Hi.
>> 
>> I just tried pushing a proto-Internet Draft through the current
>> online version of XML2RFC.   It is a discussion of
>> internationalization issues and contains some non-ASCII
>> characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE)
>> and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  My hope was
>> that I could produce HTML and, ultimately, PDF versions with the
>> right characters in them.  I expected complaints from the
>> processor for the text version, which would be a fine solution.
>> 
>> Instead, these characters are apparently quietly (no errors,
>> warnings, or comments) converted in the text version to the
>> sequence "oe".  That is a problem because, while U+00F6 can
>> always reasonably be converted to "oe" if the language being
>> used is German, the conversion is not appropriate for many of
>> the other languages in which that character might appear.  And
>> while I'm not an expert on the languages that use U+00F8, my
>> impression is that conversion from it to "oe" is almost never
>> appropriate.
>> 
>> Now that I know the problem, I can go back and hand-patch the
>> ASCII text version, but it seems to me that this is, however
>> well-intentioned, a trap for the unwary.
>> 
>>      john
>> 
>> _______________________________________________
>> xml2rfc mailing list
>> xml2rfc@lists.xml.resource.org
>> http://lists.xml.resource.org/mailman/listinfo/xml2rfc
>> 
> _______________________________________________
> xml2rfc mailing list
> xml2rfc@lists.xml.resource.org
> http://lists.xml.resource.org/mailman/listinfo/xml2rfc
> 
>From julian.reschke at gmx.de  Wed Feb 15 10:56:21 2006
From: julian.reschke at gmx.de (Julian Reschke)
Date: Wed Feb 15 01:59:07 2006
Subject: XML parsing, was: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM>
References: <5F92589D8AF5243E5399447D@p3.JCK.COM>
Message-ID: <43F2FAC5.5030708@gmx.de>

Hi,

slightly related:

the document editor of a working group I'm active in just managed to 
submit a document that doesn't even parse in a compliant XML parser, yet 
was accepted by the online version of xml2rfc.

What happened was that an "u umlaut" ("?") was added in the source file, 
but the encoding wasn't properly declared, so that a conforming XML 
parser rejects the file based on character encoding problems already.

xml2rfc doesn't use a conforming parser, but then silently translated 
the umlaut to "ue", which appeared like "just the right thing" to the 
author, thus the problem went undetected.

If at any point of time, we want rfc2629 to become an input format for 
the RFC Editor, we *really* need to be sure that documents accepted by 
xml2rfc at a minimum pass an XML wellformedness test. If this means 
adding another pass running the document through the system's XML parser 
just for checking purposes, so be it.

Best regards, Julian
>From carl at media.org  Wed Feb 15 02:01:51 2006
From: carl at media.org (Carl Malamud)
Date: Wed Feb 15 02:02:41 2006
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <43F2F603.6080302@levkowetz.com>
Message-ID: <200602151001.k1FA1p0V017433@bulk.resource.org>

> on 2006-02-15 05:34 Carl Malamud said the following:
> > John -
> > 
> > Is there an appropriate tr of U+00F8 into ASCII that
> > is non-language dependent?  (E.g., would "o" make more sense)?
> 
> I'd say no, based on knowledge of Norwegian, Swedish and German.
> "oe" is pretty clearly right for German, somewhat doubtful for
> Norwegian and more so for Swedish.
> 
> 
> 	Henrik

Hmmm ... I suspect you're not going to see a win on this one.
The mapping effort to do every language to ASCII would be fairly
strenuous so you'll probably need to pick one mapping unless
there's a major rethink of xml2rfc on internationalization (and
the attendent intellectual heavy lifting by the office of the 
rfc editor on how this should be done).

Carl
>From swb at employees.org  Wed Feb 15 09:12:14 2006
From: swb at employees.org (Scott W Brim)
Date: Wed Feb 15 06:12:43 2006
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <43F2F603.6080302@levkowetz.com>
References: <200602150434.k1F4YWML015409@bulk.resource.org>
	<43F2F603.6080302@levkowetz.com>
Message-ID: <43F336BE.3040205@employees.org>

On 02/15/2006 04:36 AM, Henrik Levkowetz allegedly wrote:
> on 2006-02-15 05:34 Carl Malamud said the following:
>> John -
>>
>> Is there an appropriate tr of U+00F8 into ASCII that
>> is non-language dependent?  (E.g., would "o" make more sense)?
> 
> I'd say no, based on knowledge of Norwegian, Swedish and German.
> "oe" is pretty clearly right for German, somewhat doubtful for
> Norwegian and more so for Swedish.
> 
> 
> 	Henrik

I was surprised the other day in the Olympic biathlon coverage when they
showed Bjoerndallen instead of Bj?rndallen.  Apparently some community
does have a precedent for converting U+00f8 to "oe".
>From john+xml at jck.com  Wed Feb 15 09:45:58 2006
From: john+xml at jck.com (John C Klensin)
Date: Wed Feb 15 06:46:05 2006
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <200602150434.k1F4YWML015409@bulk.resource.org>
References: <200602150434.k1F4YWML015409@bulk.resource.org>
Message-ID: <800C472850A5514C00F35539@p3.JCK.COM>



--On Tuesday, 14 February, 2006 20:34 -0800 Carl Malamud
<carl@media.org> wrote:

> John -
> 
> Is there an appropriate tr of U+00F8 into ASCII that
> is non-language dependent?  (E.g., would "o" make more sense)?

Short answer: nope, it is language (and local
convention)-dependent and one can't get this right in anything
resembling the general case.  Henrik's note covers just about
all I could usefully say about the specific example but, in the
general case, it does not make much more sense to map U+00F8
into "o" than it would to map, e.g., U+03A1 into U+0050 (after
all, they look pretty much alike) or, if I recall, U+30A9 into
"o" (maybe similar sounds).   There is no way to win at that
game; non-ASCII characters simply have to be treated as
exception/error characters when going to RFC text.
 
> The code does (or at least used to the last time I read it)
> a simple 'if you see character "x" substitute "y"' algorithm
> and doesn't take into account the intended language.

Since we don't have a "language" directive, that is what I would
have expected.  But, since a language directive would get us
into even more trouble, IMO, I think this needs to be viewed as
hopeless  and the well-intentioned patches taken out.   It
_might_ be helpful to permit, via a directive, the UTF-8 that
goes in to just come out (as UTF-8 Text, rather than ASCII text)
but, for the current state of the I-D and RFC cases, the
presence of non-ASCII characters should almost certainly lead to
warnings (at least), not fix-ups.

best,
     john