Re: [Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)

Sandy Ginoza <sginoza@amsl.com> Wed, 10 August 2022 21:45 UTC

Return-Path: <sginoza@amsl.com>
X-Original-To: xml-sg-cmt@ietfa.amsl.com
Delivered-To: xml-sg-cmt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 90355C15C537 for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 14:45:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.907
X-Spam-Level:
X-Spam-Status: No, score=-6.907 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5Fh7f5mlrILN for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 14:45:48 -0700 (PDT)
Received: from c8a.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 637A3C157908 for <xml-sg-cmt@ietf.org>; Wed, 10 Aug 2022 14:45:48 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id 47965424B455; Wed, 10 Aug 2022 14:45:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6T91E5Awg7_0; Wed, 10 Aug 2022 14:45:48 -0700 (PDT)
Received: from smtpclient.apple (2603-8000-9603-b513-e154-1f08-e2df-d718.res6.spectrum.com [IPv6:2603:8000:9603:b513:e154:1f08:e2df:d718]) by c8a.amsl.com (Postfix) with ESMTPSA id 2306D424B44D; Wed, 10 Aug 2022 14:45:48 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
From: Sandy Ginoza <sginoza@amsl.com>
In-Reply-To: <227a71bb-29af-38fa-8f0d-a6b3ed8ac366@nostrum.com>
Date: Wed, 10 Aug 2022 14:45:33 -0700
Cc: "xml-sg-cmt@ietf.org" <xml-sg-cmt@ietf.org>, Jean Mahoney <jmahoney@amsl.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <9A0407B8-88A4-4938-B577-9D833B47A7F2@amsl.com>
References: <ietf-tools/bibxml-service/issues/261/1210634090@github.com> <57596c04-18d8-1628-0967-96db9f3e32bb@amsl.com> <46931C92-61FD-4F1A-A899-F30F3342EE75@amsl.com> <227a71bb-29af-38fa-8f0d-a6b3ed8ac366@nostrum.com>
To: Robert Sparks <rjsparks@nostrum.com>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml-sg-cmt/rQfgxERxa4iOUudoOaCfJYS4GaE>
Subject: Re: [Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
X-BeenThere: xml-sg-cmt@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Working list for the xml and style guide change management team <xml-sg-cmt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml-sg-cmt/>
List-Post: <mailto:xml-sg-cmt@ietf.org>
List-Help: <mailto:xml-sg-cmt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Aug 2022 21:45:50 -0000

Great - thanks!

> On Aug 10, 2022, at 12:06 PM, Robert Sparks <rjsparks@nostrum.com> wrote:
> 
> 
> On 8/10/22 1:43 PM, Sandy Ginoza wrote:
>> Hi CMT,
>> 
>> We previously talked about possibly a) allowing non-ASCII everywhere and b) normalizing the XML.  Based on our discussion yesterday, I believe these items may eventually be handled by the RSWG.  In the meantime, we intend to request that bib.ietf.org use ASCII equivalents in <refcontent>.  We believe this mostly impacts punctuation (e.g., em dash, smart quotes) at this time, so as Jean notes below, using ASCII equivalents would be consistent with current practice.
> This is how I plan to steer bib.ietf.org
>> 
>> Thanks,
>> Sandy
>> 
>>> On Aug 10, 2022, at 6:38 AM, Jean Mahoney <jmahoney@amsl.com> wrote:
>>> 
>>> Hi Sandy,
>>> 
>>> Please see below. The developers are suggesting either using <u> elements or replacing unicode with ASCII equivalents to handle non-ASCII in reference elements that can't contain it (like <refcontent>). I lean toward using ASCII equivalents because that's what we've been doing internally, but let me know what you think.
>>> 
>>> Thanks!
>>> Jean
>>> 
>>> 
>>> -------- Forwarded Message --------
>>> Subject:	Re: [ietf-tools/bibxml-service] bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
>>> Date:	Wed, 10 Aug 2022 05:54:26 -0700
>>> From:	Ronald Tse <notifications@github.com>
>>> Reply-To:	ietf-tools/bibxml-service <reply+ACENAES2O64SVD7FLBQVIV6BADOQFEVBNHHE66NMB4@reply.github.com>
>>> To:	ietf-tools/bibxml-service <bibxml-service@noreply.github.com>
>>> CC:	Jean Mahoney <jmahoney@amsl.com>, Mention <mention@noreply.github.com>
>>> 
>>> 
>>> Thank you @ajeanmahoney for raising this.
>>> 
>>> DOI CrossRef data allows Unicode, so in order to achieve the strict subset of XML elements, the BibXML Service will be required to "normalize" or "strip" non-compliant characters.
>>> 
>>> According to your link, these elements can contain Unicode directly:
>>> 
>>> 	• <author>, <contact>, <organization>, <street>, <city>, <region>, <city>, <country> and <email>
>>> However, <refcontent> does not support Unicode.
>>> 
>>> There are two ways to handle this.
>>> 
>>> @ajeanmahoney any thoughts?
>>> 
>>> Wrapping with <u> element
>>> 
>>> According to the link:
>>> 
>>> Other than in the resricted elements, non-ASCII characters must be wrapped by the <u> element with the format attribute specifying how it is represented.
>>> 
>>> Sanitization of symbols
>>> 
>>> We could implement a sanitization mapping for converting Unicode characters to their equivalents in ASCII, e.g. :
>>> 
>>> "—": "-"
>>> "’": "'"
>>> 
>>> ...
>>> 
>>> —
>>> Reply to this email directly, view it on GitHub, or unsubscribe.
>>> You are receiving this because you were mentioned.
>>> 
> 
> -- 
> Xml-sg-cmt mailing list
> Xml-sg-cmt@ietf.org
> https://www.ietf.org/mailman/listinfo/xml-sg-cmt