[Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)

Sandy Ginoza <sginoza@amsl.com> Wed, 10 August 2022 18:43 UTC

Return-Path: <sginoza@amsl.com>
X-Original-To: xml-sg-cmt@ietfa.amsl.com
Delivered-To: xml-sg-cmt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5922BC14F74A for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 11:43:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.907
X-Spam-Level:
X-Spam-Status: No, score=-6.907 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dCksflUEgz55 for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 11:43:37 -0700 (PDT)
Received: from c8a.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8B7C8C14F719 for <xml-sg-cmt@ietf.org>; Wed, 10 Aug 2022 11:43:37 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id 74EBA4243EC0; Wed, 10 Aug 2022 11:43:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vrBIWYVrcn2o; Wed, 10 Aug 2022 11:43:37 -0700 (PDT)
Received: from smtpclient.apple (2603-8000-9603-b513-e154-1f08-e2df-d718.res6.spectrum.com [IPv6:2603:8000:9603:b513:e154:1f08:e2df:d718]) by c8a.amsl.com (Postfix) with ESMTPSA id 58932424B44B; Wed, 10 Aug 2022 11:43:37 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
From: Sandy Ginoza <sginoza@amsl.com>
In-Reply-To: <57596c04-18d8-1628-0967-96db9f3e32bb@amsl.com>
Date: Wed, 10 Aug 2022 11:43:22 -0700
Cc: Jean Mahoney <jmahoney@amsl.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <46931C92-61FD-4F1A-A899-F30F3342EE75@amsl.com>
References: <ietf-tools/bibxml-service/issues/261/1210634090@github.com> <57596c04-18d8-1628-0967-96db9f3e32bb@amsl.com>
To: "xml-sg-cmt@ietf.org" <xml-sg-cmt@ietf.org>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml-sg-cmt/5oi27-ORMjU4NXSTyTWmwtLFTno>
Subject: [Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
X-BeenThere: xml-sg-cmt@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Working list for the xml and style guide change management team <xml-sg-cmt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml-sg-cmt/>
List-Post: <mailto:xml-sg-cmt@ietf.org>
List-Help: <mailto:xml-sg-cmt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Aug 2022 18:43:39 -0000

Hi CMT,

We previously talked about possibly a) allowing non-ASCII everywhere and b) normalizing the XML.  Based on our discussion yesterday, I believe these items may eventually be handled by the RSWG.  In the meantime, we intend to request that bib.ietf.org use ASCII equivalents in <refcontent>.  We believe this mostly impacts punctuation (e.g., em dash, smart quotes) at this time, so as Jean notes below, using ASCII equivalents would be consistent with current practice.  

Thanks,
Sandy 

> On Aug 10, 2022, at 6:38 AM, Jean Mahoney <jmahoney@amsl.com> wrote:
> 
> Hi Sandy,
> 
> Please see below. The developers are suggesting either using <u> elements or replacing unicode with ASCII equivalents to handle non-ASCII in reference elements that can't contain it (like <refcontent>). I lean toward using ASCII equivalents because that's what we've been doing internally, but let me know what you think.
> 
> Thanks!
> Jean
> 
> 
> -------- Forwarded Message --------
> Subject:	Re: [ietf-tools/bibxml-service] bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
> Date:	Wed, 10 Aug 2022 05:54:26 -0700
> From:	Ronald Tse <notifications@github.com>
> Reply-To:	ietf-tools/bibxml-service <reply+ACENAES2O64SVD7FLBQVIV6BADOQFEVBNHHE66NMB4@reply.github.com>
> To:	ietf-tools/bibxml-service <bibxml-service@noreply.github.com>
> CC:	Jean Mahoney <jmahoney@amsl.com>, Mention <mention@noreply.github.com>
> 
> 
> Thank you @ajeanmahoney for raising this.
> 
> DOI CrossRef data allows Unicode, so in order to achieve the strict subset of XML elements, the BibXML Service will be required to "normalize" or "strip" non-compliant characters.
> 
> According to your link, these elements can contain Unicode directly:
> 
> 	• <author>, <contact>, <organization>, <street>, <city>, <region>, <city>, <country> and <email>
> However, <refcontent> does not support Unicode.
> 
> There are two ways to handle this.
> 
> @ajeanmahoney any thoughts?
> 
> Wrapping with <u> element
> 
> According to the link:
> 
> Other than in the resricted elements, non-ASCII characters must be wrapped by the <u> element with the format attribute specifying how it is represented.
> 
> Sanitization of symbols
> 
> We could implement a sanitization mapping for converting Unicode characters to their equivalents in ASCII, e.g. :
> 
> "—": "-"
> "’": "'"
> 
> ...
> 
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you were mentioned.
>