[xml2rfc] Transformations of non-ASCII characters
carl at media.org (Carl Malamud) Tue, 14 February 2006 20:38 UTC
From: "carl at media.org"
Date: Tue, 14 Feb 2006 20:38:33 +0000
Subject: [xml2rfc] Transformations of non-ASCII characters
In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM>
Message-ID: <200602150434.k1F4YWML015409@bulk.resource.org>
X-Date: Tue Feb 14 20:38:33 2006
John - Is there an appropriate tr of U+00F8 into ASCII that is non-language dependent? (E.g., would "o" make more sense)? The code does (or at least used to the last time I read it) a simple 'if you see character "x" substitute "y"' algorithm and doesn't take into account the intended language. Carl > Hi. > > I just tried pushing a proto-Internet Draft through the current > online version of XML2RFC. It is a discussion of > internationalization issues and contains some non-ASCII > characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE) > and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). My hope was > that I could produce HTML and, ultimately, PDF versions with the > right characters in them. I expected complaints from the > processor for the text version, which would be a fine solution. > > Instead, these characters are apparently quietly (no errors, > warnings, or comments) converted in the text version to the > sequence "oe". That is a problem because, while U+00F6 can > always reasonably be converted to "oe" if the language being > used is German, the conversion is not appropriate for many of > the other languages in which that character might appear. And > while I'm not an expert on the languages that use U+00F8, my > impression is that conversion from it to "oe" is almost never > appropriate. > > Now that I know the problem, I can go back and hand-patch the > ASCII text version, but it seems to me that this is, however > well-intentioned, a trap for the unwary. > > john > > _______________________________________________ > xml2rfc mailing list > xml2rfc@lists.xml.resource.org > http://lists.xml.resource.org/mailman/listinfo/xml2rfc > >From lars.eggert at netlab.nec.de Wed Feb 15 09:57:48 2006 From: lars.eggert at netlab.nec.de (Lars Eggert) Date: Wed Feb 15 00:58:12 2006 Subject: [xml2rfc] only first author in bibliography? In-Reply-To: <ed6d469d0602141329t1f45f059v857d2475918f3d0d@mail.gmail.com> References: <43E89784.8020009@dial.pipex.com> <537D4B48-8C7E-455A-8197-9CAC88CD8493@dbc.mtview.ca.us> <43E8B2CE.6000806@dial.pipex.com> <E56ECC57-3D17-471A-BEB4-37BF7FE19493@dbc.mtview.ca.us> <8751BDF8-3F77-4297-8D14-0F06381A860E@netlab.nec.de> <8DF288D3-60B4-4A79-9F1C-3CA5BC01CB6F@netlab.nec.de> <ed6d469d0602141329t1f45f059v857d2475918f3d0d@mail.gmail.com> Message-ID: <BDBB103D-449A-41ED-9B1A-30394CC4C452@netlab.nec.de> Hi, On Feb 14, 2006, at 22:29, Bill Fenner wrote: > On 2/14/06, Lars Eggert <lars.eggert@netlab.nec.de> wrote: >> Some further checking has revealed that this is only the case for >> some references. Still, it'd be good to have full authorship >> information for all refs. Sure: the XML for draft-ietf-hip-base-04 has only Bob Moskowitz as author, while the draft has four authors. > Is the amount of information in the version > of the ref at http://rtg.ietf.org/~fenner/ietf/xml/bibxml3/ any > different? No. > In general it's the secretariat's data (e.g., 1id-abstracts.txt) that > is missing the desired info. True in this case, too. 1id-abstracts.txt only lists Bob. This seems to be a pretty common problem. A quick look at some other HIP IDs also shows missing authors in 1id-abstracts.txt: draft-ietf-hip-mm-02 draft-irtf-hiprg-nat-01 draft-ietf-hip-registration-01 (CC'ing the secretariat for this reason.) Lars -- Lars Eggert NEC Network Laboratories -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3686 bytes Desc: not available Url : http://drakken.dbc.mtview.ca.us/pipermail/xml2rfc/attachments/20060215/3229d5c2/smime.bin >From henrik at levkowetz.com Wed Feb 15 10:29:30 2006 From: henrik at levkowetz.com (Henrik Levkowetz) Date: Wed Feb 15 01:29:46 2006 Subject: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM> References: <5F92589D8AF5243E5399447D@p3.JCK.COM> Message-ID: <43F2F47A.2010307@levkowetz.com> on 2006-02-15 01:21 John C Klensin said the following: > Hi. > > I just tried pushing a proto-Internet Draft through the current > online version of XML2RFC. It is a discussion of > internationalization issues and contains some non-ASCII > characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE) > and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). My hope was > that I could produce HTML and, ultimately, PDF versions with the > right characters in them. I expected complaints from the > processor for the text version, which would be a fine solution. > > Instead, these characters are apparently quietly (no errors, > warnings, or comments) converted in the text version to the > sequence "oe". That is a problem because, while U+00F6 can > always reasonably be converted to "oe" if the language being > used is German, the conversion is not appropriate for many of > the other languages in which that character might appear. And > while I'm not an expert on the languages that use U+00F8, my > impression is that conversion from it to "oe" is almost never > appropriate. As a native speaker of Norwegian, which uses o with stroke, and Swedish, which uses o with diaeresis, I have to agree. In Norwegian you can use "oe" as an oddball fallback mode if you're stuck with a typewriter with only ASCII symbols, but I'd only expect that from a nerd, not as a general occurrence. Most people would prefer to backspace and add a slash over the o. In Swedish, most people seem to prefer simply dropping the diaeresis if there is absolutely no way of getting the proper character. In neither language the 'oe' form is accepted in the same way it is in German today. > Now that I know the problem, I can go back and hand-patch the > ASCII text version, but it seems to me that this is, however > well-intentioned, a trap for the unwary. I believe I agree. Henrik > john >From henrik at levkowetz.com Wed Feb 15 10:36:03 2006 From: henrik at levkowetz.com (Henrik Levkowetz) Date: Wed Feb 15 01:36:20 2006 Subject: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <200602150434.k1F4YWML015409@bulk.resource.org> References: <200602150434.k1F4YWML015409@bulk.resource.org> Message-ID: <43F2F603.6080302@levkowetz.com> on 2006-02-15 05:34 Carl Malamud said the following: > John - > > Is there an appropriate tr of U+00F8 into ASCII that > is non-language dependent? (E.g., would "o" make more sense)? I'd say no, based on knowledge of Norwegian, Swedish and German. "oe" is pretty clearly right for German, somewhat doubtful for Norwegian and more so for Swedish. Henrik > The code does (or at least used to the last time I read it) > a simple 'if you see character "x" substitute "y"' algorithm > and doesn't take into account the intended language. > > > Carl > >> Hi. >> >> I just tried pushing a proto-Internet Draft through the current >> online version of XML2RFC. It is a discussion of >> internationalization issues and contains some non-ASCII >> characters, notably U+00F8 (LATIN SMALL LETTER O WITH STROKE) >> and U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). My hope was >> that I could produce HTML and, ultimately, PDF versions with the >> right characters in them. I expected complaints from the >> processor for the text version, which would be a fine solution. >> >> Instead, these characters are apparently quietly (no errors, >> warnings, or comments) converted in the text version to the >> sequence "oe". That is a problem because, while U+00F6 can >> always reasonably be converted to "oe" if the language being >> used is German, the conversion is not appropriate for many of >> the other languages in which that character might appear. And >> while I'm not an expert on the languages that use U+00F8, my >> impression is that conversion from it to "oe" is almost never >> appropriate. >> >> Now that I know the problem, I can go back and hand-patch the >> ASCII text version, but it seems to me that this is, however >> well-intentioned, a trap for the unwary. >> >> john >> >> _______________________________________________ >> xml2rfc mailing list >> xml2rfc@lists.xml.resource.org >> http://lists.xml.resource.org/mailman/listinfo/xml2rfc >> > _______________________________________________ > xml2rfc mailing list > xml2rfc@lists.xml.resource.org > http://lists.xml.resource.org/mailman/listinfo/xml2rfc > >From julian.reschke at gmx.de Wed Feb 15 10:56:21 2006 From: julian.reschke at gmx.de (Julian Reschke) Date: Wed Feb 15 01:59:07 2006 Subject: XML parsing, was: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <5F92589D8AF5243E5399447D@p3.JCK.COM> References: <5F92589D8AF5243E5399447D@p3.JCK.COM> Message-ID: <43F2FAC5.5030708@gmx.de> Hi, slightly related: the document editor of a working group I'm active in just managed to submit a document that doesn't even parse in a compliant XML parser, yet was accepted by the online version of xml2rfc. What happened was that an "u umlaut" ("?") was added in the source file, but the encoding wasn't properly declared, so that a conforming XML parser rejects the file based on character encoding problems already. xml2rfc doesn't use a conforming parser, but then silently translated the umlaut to "ue", which appeared like "just the right thing" to the author, thus the problem went undetected. If at any point of time, we want rfc2629 to become an input format for the RFC Editor, we *really* need to be sure that documents accepted by xml2rfc at a minimum pass an XML wellformedness test. If this means adding another pass running the document through the system's XML parser just for checking purposes, so be it. Best regards, Julian >From carl at media.org Wed Feb 15 02:01:51 2006 From: carl at media.org (Carl Malamud) Date: Wed Feb 15 02:02:41 2006 Subject: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <43F2F603.6080302@levkowetz.com> Message-ID: <200602151001.k1FA1p0V017433@bulk.resource.org> > on 2006-02-15 05:34 Carl Malamud said the following: > > John - > > > > Is there an appropriate tr of U+00F8 into ASCII that > > is non-language dependent? (E.g., would "o" make more sense)? > > I'd say no, based on knowledge of Norwegian, Swedish and German. > "oe" is pretty clearly right for German, somewhat doubtful for > Norwegian and more so for Swedish. > > > Henrik Hmmm ... I suspect you're not going to see a win on this one. The mapping effort to do every language to ASCII would be fairly strenuous so you'll probably need to pick one mapping unless there's a major rethink of xml2rfc on internationalization (and the attendent intellectual heavy lifting by the office of the rfc editor on how this should be done). Carl >From swb at employees.org Wed Feb 15 09:12:14 2006 From: swb at employees.org (Scott W Brim) Date: Wed Feb 15 06:12:43 2006 Subject: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <43F2F603.6080302@levkowetz.com> References: <200602150434.k1F4YWML015409@bulk.resource.org> <43F2F603.6080302@levkowetz.com> Message-ID: <43F336BE.3040205@employees.org> On 02/15/2006 04:36 AM, Henrik Levkowetz allegedly wrote: > on 2006-02-15 05:34 Carl Malamud said the following: >> John - >> >> Is there an appropriate tr of U+00F8 into ASCII that >> is non-language dependent? (E.g., would "o" make more sense)? > > I'd say no, based on knowledge of Norwegian, Swedish and German. > "oe" is pretty clearly right for German, somewhat doubtful for > Norwegian and more so for Swedish. > > > Henrik I was surprised the other day in the Olympic biathlon coverage when they showed Bjoerndallen instead of Bj?rndallen. Apparently some community does have a precedent for converting U+00f8 to "oe". >From john+xml at jck.com Wed Feb 15 09:45:58 2006 From: john+xml at jck.com (John C Klensin) Date: Wed Feb 15 06:46:05 2006 Subject: [xml2rfc] Transformations of non-ASCII characters In-Reply-To: <200602150434.k1F4YWML015409@bulk.resource.org> References: <200602150434.k1F4YWML015409@bulk.resource.org> Message-ID: <800C472850A5514C00F35539@p3.JCK.COM> --On Tuesday, 14 February, 2006 20:34 -0800 Carl Malamud <carl@media.org> wrote: > John - > > Is there an appropriate tr of U+00F8 into ASCII that > is non-language dependent? (E.g., would "o" make more sense)? Short answer: nope, it is language (and local convention)-dependent and one can't get this right in anything resembling the general case. Henrik's note covers just about all I could usefully say about the specific example but, in the general case, it does not make much more sense to map U+00F8 into "o" than it would to map, e.g., U+03A1 into U+0050 (after all, they look pretty much alike) or, if I recall, U+30A9 into "o" (maybe similar sounds). There is no way to win at that game; non-ASCII characters simply have to be treated as exception/error characters when going to RFC text. > The code does (or at least used to the last time I read it) > a simple 'if you see character "x" substitute "y"' algorithm > and doesn't take into account the intended language. Since we don't have a "language" directive, that is what I would have expected. But, since a language directive would get us into even more trouble, IMO, I think this needs to be viewed as hopeless and the well-intentioned patches taken out. It _might_ be helpful to permit, via a directive, the UTF-8 that goes in to just come out (as UTF-8 Text, rather than ASCII text) but, for the current state of the I-D and RFC cases, the presence of non-ASCII characters should almost certainly lead to warnings (at least), not fix-ups. best, john
- [xml2rfc] Transformations of non-ASCII characters John C Klensin
- [xml2rfc] Transformations of non-ASCII characters Carl Malamud
- [xml2rfc] Transformations of non-ASCII characters Henrik Levkowetz
- [xml2rfc] Transformations of non-ASCII characters Clive D.W. Feather
- [xml2rfc] Re: XML parsing, was: Transformations o… Julian Reschke
- [xml2rfc] Transformations of non-ASCII characters Dave Cridland
- [xml2rfc] Re: Transformations of non-ASCII charac… Stephane Bortzmeyer
- [xml2rfc] Re: Transformations of non-ASCII charac… Carl Malamud
- [xml2rfc] Re: Transformations of non-ASCII charac… Julian Reschke
- [xml2rfc] Re: Transformations of non-ASCII charac… Frank Ellermann
- [xml2rfc] Re: Transformations of non-ASCII charac… Julian Reschke
- [xml2rfc] Re: Transformations of non-ASCII charac… Frank Ellermann