Re: [Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)

Robert Sparks <rjsparks@nostrum.com> Wed, 10 August 2022 19:06 UTC

Return-Path: <rjsparks@nostrum.com>
X-Original-To: xml-sg-cmt@ietfa.amsl.com
Delivered-To: xml-sg-cmt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8FD7FC14F73F for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 12:06:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.687
X-Spam-Level:
X-Spam-Status: No, score=-1.687 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, NICE_REPLY_A=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_HELO_PERMERROR=0.01, T_SPF_PERMERROR=0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=fail (1024-bit key) reason="fail (message has been altered)" header.d=nostrum.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3myfS9ir-ecn for <xml-sg-cmt@ietfa.amsl.com>; Wed, 10 Aug 2022 12:06:33 -0700 (PDT)
Received: from nostrum.com (raven-v6.nostrum.com [IPv6:2001:470:d:1130::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8835AC14F723 for <xml-sg-cmt@ietf.org>; Wed, 10 Aug 2022 12:06:33 -0700 (PDT)
Received: from [192.168.1.114] ([47.186.48.51]) (authenticated bits=0) by nostrum.com (8.17.1/8.17.1) with ESMTPSA id 27AJ6U3M095420 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Wed, 10 Aug 2022 14:06:31 -0500 (CDT) (envelope-from rjsparks@nostrum.com)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=nostrum.com; s=default; t=1660158392; bh=O/0AXeTlSepxrG9hxxIJ5m7k1RGdfenRmlfsVF6VsgE=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=i9NAfp6B2Q24cTedO2M9gdenOT+EgynyRR8234G+iP2KKBG6fb8l9AxH1PleCcoA9 jElRuXvR04cl+obt7zGH9A8ERqMuhMk+4mnnWjusssSzOhiQKtcYEv17dK3xj5JZzm eaUPZ51E+a40fVDe3rCFZCIFd4aeGNCaYWYwoSWU=
X-Authentication-Warning: raven.nostrum.com: Host [47.186.48.51] claimed to be [192.168.1.114]
Message-ID: <227a71bb-29af-38fa-8f0d-a6b3ed8ac366@nostrum.com>
Date: Wed, 10 Aug 2022 14:06:25 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.11.0
Content-Language: en-US
To: Sandy Ginoza <sginoza@amsl.com>, "xml-sg-cmt@ietf.org" <xml-sg-cmt@ietf.org>
Cc: Jean Mahoney <jmahoney@amsl.com>
References: <ietf-tools/bibxml-service/issues/261/1210634090@github.com> <57596c04-18d8-1628-0967-96db9f3e32bb@amsl.com> <46931C92-61FD-4F1A-A899-F30F3342EE75@amsl.com>
From: Robert Sparks <rjsparks@nostrum.com>
In-Reply-To: <46931C92-61FD-4F1A-A899-F30F3342EE75@amsl.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml-sg-cmt/KgoWeAz0Hv6x4LUTaZDoQvhhVT0>
Subject: Re: [Xml-sg-cmt] XML citation library - bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
X-BeenThere: xml-sg-cmt@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Working list for the xml and style guide change management team <xml-sg-cmt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml-sg-cmt/>
List-Post: <mailto:xml-sg-cmt@ietf.org>
List-Help: <mailto:xml-sg-cmt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Aug 2022 19:06:37 -0000

On 8/10/22 1:43 PM, Sandy Ginoza wrote:
> Hi CMT,
>
> We previously talked about possibly a) allowing non-ASCII everywhere and b) normalizing the XML.  Based on our discussion yesterday, I believe these items may eventually be handled by the RSWG.  In the meantime, we intend to request that bib.ietf.org use ASCII equivalents in <refcontent>.  We believe this mostly impacts punctuation (e.g., em dash, smart quotes) at this time, so as Jean notes below, using ASCII equivalents would be consistent with current practice.
This is how I plan to steer bib.ietf.org
>
> Thanks,
> Sandy
>
>> On Aug 10, 2022, at 6:38 AM, Jean Mahoney <jmahoney@amsl.com> wrote:
>>
>> Hi Sandy,
>>
>> Please see below. The developers are suggesting either using <u> elements or replacing unicode with ASCII equivalents to handle non-ASCII in reference elements that can't contain it (like <refcontent>). I lean toward using ASCII equivalents because that's what we've been doing internally, but let me know what you think.
>>
>> Thanks!
>> Jean
>>
>>
>> -------- Forwarded Message --------
>> Subject:	Re: [ietf-tools/bibxml-service] bibxml7: non-ASCII found in XML elements that should not contain it (Issue #261)
>> Date:	Wed, 10 Aug 2022 05:54:26 -0700
>> From:	Ronald Tse <notifications@github.com>
>> Reply-To:	ietf-tools/bibxml-service <reply+ACENAES2O64SVD7FLBQVIV6BADOQFEVBNHHE66NMB4@reply.github.com>
>> To:	ietf-tools/bibxml-service <bibxml-service@noreply.github.com>
>> CC:	Jean Mahoney <jmahoney@amsl.com>, Mention <mention@noreply.github.com>
>>
>>
>> Thank you @ajeanmahoney for raising this.
>>
>> DOI CrossRef data allows Unicode, so in order to achieve the strict subset of XML elements, the BibXML Service will be required to "normalize" or "strip" non-compliant characters.
>>
>> According to your link, these elements can contain Unicode directly:
>>
>> 	• <author>, <contact>, <organization>, <street>, <city>, <region>, <city>, <country> and <email>
>> However, <refcontent> does not support Unicode.
>>
>> There are two ways to handle this.
>>
>> @ajeanmahoney any thoughts?
>>
>> Wrapping with <u> element
>>
>> According to the link:
>>
>> Other than in the resricted elements, non-ASCII characters must be wrapped by the <u> element with the format attribute specifying how it is represented.
>>
>> Sanitization of symbols
>>
>> We could implement a sanitization mapping for converting Unicode characters to their equivalents in ASCII, e.g. :
>>
>> "—": "-"
>> "’": "'"
>>
>> ...
>>
>> —
>> Reply to this email directly, view it on GitHub, or unsubscribe.
>> You are receiving this because you were mentioned.
>>