Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)

John C Klensin <john-ietf@jck.com> Tue, 31 January 2023 01:43 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5648AC1B18B3 for <ietf@ietfa.amsl.com>; Mon, 30 Jan 2023 17:43:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.106
X-Spam-Level: *
X-Spam-Status: No, score=1.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_RUURL=3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LqJajTbnfTBb for <ietf@ietfa.amsl.com>; Mon, 30 Jan 2023 17:43:18 -0800 (PST)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 23609C17CEB1 for <ietf@ietf.org>; Mon, 30 Jan 2023 17:43:17 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1pMfg0-000EM7-GA; Mon, 30 Jan 2023 20:43:12 -0500
Date: Mon, 30 Jan 2023 20:43:06 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Raghu Saxena <poiasdpoiasd@live.com>, ietf@ietf.org
Subject: Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)
Message-ID: <BB06B0A008D9EEDF65040D77@PSB>
In-Reply-To: <0ad6ad81-a9af-9be0-e2bf-b04e33bd0525@it.aoyama.ac.jp>
References: <MEYP282MB3564A385B6CECB0E9E92A630A3CE9@MEYP282MB3564.AUSP282.PROD.OUTLOOK.COM> <634ad97a-9081-1831-9c07-999a3c8e1bbf@gmx.de> <MEYP282MB3564CAEFF922DFEEEE32813DA3CE9@MEYP282MB3564.AUSP282.PROD.OUTLOOK.COM> <0830aa47-0dbb-5911-dfbe-26ca86f0b04a@it.aoyama.ac.jp> <FF6F03C450BBED62D3A23839@PSB> <0ad6ad81-a9af-9be0-e2bf-b04e33bd0525@it.aoyama.ac.jp>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/9w0Lxd-Po-9ywOOgQXijgw9NQkU>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "IETF-Discussion. This is the most general IETF mailing list, intended for discussion of technical, procedural, operational, and other topics for which no dedicated mailing lists exist." <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Jan 2023 01:43:22 -0000

Martin,

Yes, that was the point.  Thanks for further clarifying.

   john


--On Tuesday, January 31, 2023 10:03 +0900 "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:

> On 2023-01-30 17:04, John C Klensin wrote:
>> Let me add one small comment to Martin's:
>> 
>> A generic HTTP library that allows character encodings other
>> than UTF-8 (i.e., does not enforce UTF-8) needs to be very
>> careful that it does not make guesses about what non-UTF-8
>> encodings might mean, e.g., attempt to translate them to UTF-8
>> or other forms.  His comment about Windows-1251 is almost
>> certainly correct in this case.
> 
> Assuming you know the Cyrillic alphabet, or have an
> explanatory document at hand, you can easily confirm that it
> IS correct if you go to the site. But you're right that it's
> only for this specific case.
> 
>> However, assuming that, for a
>> URL whose domain-part is a subdomain of RU, the coding, if not
>> UTF-8, is necessarily Windows-1251 would be unreasonable and
>> dangerous: nothing in either 3986 or externally imposed rules
>> about what the RU TLD can register requires that non-ASCII
>> characters in URI tails be in Cyrillic and not Latin or, for
>> that matter, Mongolian, Arabic, or any other script.
> 
> Yes of course it could be something else than Cyrillic. Even
> for Cyrillic, it could be KOI-8 or one of its variants, or it
> could be Windows-1251, or it could be ISO 8859-5.
> 
> Regards,   Martin.
> 
> 
>>     john
>> 
>>   
>> --On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"
>> <duerst@it.aoyama.ac.jp> wrote:
>> 
>>> For the record, here's what I posted to the relevant github
>>> issue, for those who aren't subscribed to it:
>>> 
>>>   >>>> 
>>> For a generic HTTP library, not enforcing http/https URLs to
>>> be UTF-8 is the right decision. But such a library should
>>> make it easy to use UTF-8 for URIs, And wherever possible,
>>> servers should use UTF-8 for their URIs if they contain
>>> non-ASCII characters, and should use a suitable baseXX
>>> encoding for binary data such as digital signatures and the
>>> like.
>>> 
>>> Btw, contrary to what @brandon93 says at the start of this
>>> thread,
>>> https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/
>>> is not in Windows-1252 (Western Europe), but in Windows-1251
>>> (Russia). This of course makes sense because the site has a
>>> Russian domain name. The city is Таллин, in Latin
>>> letters this is Tallin. You can easily check this by using
>>> the URL in a browser. Using  Windows-1252 makes no sense
>>> because there is no language that contains words like
>>> "Òàëëèí" (accented vowels only).
>>> 
>>> This shows the advantage of using UTF-8. It avoids the mess
>>> of regional encodings, and because of its internal structure
>>> cannot easily be mistaken for some other encoding.
>>>   >>>> 
>>> 
>>> Regards,   Martin.
>>> 
>>> On 2023-01-25 19:54, Raghu Saxena wrote:
>>>> 
>>>> On 1/25/23 17:47, Julian Reschke wrote:
>>>>> On 25.01.2023 10:04, Raghu Saxena wrote:
>>>>>> To whomever it may concern,
>>>>>> 
>>>>>> I am writing to seek clarification regarding the URI spec
>>>>>> (RFC3986) followed by HTTP, specifically about
>>>>>> percent-encoding arbitrary octets (which do not comprise a
>>>>>> valid UTF08 sequence). In the last paragraph of RFC3986
>>>>>> Section 2.5
>>>>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
>>>>>> it says,  quote:
>>>>>> 
>>>>>>   >  When a new URI scheme defines a component that
>>>>>> represents textual     data consisting of characters
>>>>>> from the Universal Character Set  [UCS],
>>>>>>      the data should first be encoded as octets
>>>>>>  according to the UTF-8     character encoding
>>>>>> [STD63]; then only those octets that do not    
>>>>>> correspond to characters in the unreserved set should be
>>>>>> percent-     encoded.
>>>>>> 
>>>>>> This implies that URI schemes defined after RFC3986 must
>>>>>> follow UTF-8 encoding in their URIs. However, the original
>>>>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not
>>>>>> have had to "abide" by the UTF-8 rule.
>>>>>> 
>>>>>> In fact, many web servers allow and process GET requests
>>>>>> with percent-encoded octets, which they decode as raw
>>>>>> bytes and have the application level logic handle how to
>>>>>> process them.
>>>>>> 
>>>>>> However, since HTTP's latest RFC is 9110, dated June 2022
>>>>>> (post RFC3986), does it mean the UTF-8 rule now applies to
>>>>>> it? I would think not, since this would be a breaking
>>>>>> change. But some comments on github indicate that this is
>>>>>> as per the spec ()
>>>>> 
>>>>> Pointer?
>>>>> 
>>>> My apologies, the comment is here:
>>>> https://github.com/sindresorhus/got/issues/420#issuecomment
>>>> -3 45416645
>>>> 
>>>> 
>>>>>> tl;dr - Is it compliant with the HTTP specification to
>>>>>> send arbitrary bytes, which do not represent a valid UTF-8
>>>>>> sequence, via percent-encoding in the URL query parameter?
>>>>> 
>>>>> Yes.
>>>>> 
>>>>> The http scheme was not re-definey by RFCs after RFC 2616
>>>>> (in fact, it was defined even before that).
>>>>> 
>>>>> Best regards, Julian
>>>>> 
>>>> Thanks for the clarification regarding schemes not being
>>>> re-defined. I  will ask the library author to reconsider
>>>> 
>>>> Regards,
>>>> 
>>>> Raghu Saxena
>>>> 
>>>> (P.S. Sorry for the personal reply prior to this - my first
>>>> time using  mailing lists)
>>