Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)
John C Klensin <john-ietf@jck.com> Tue, 31 January 2023 01:43 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5648AC1B18B3 for <ietf@ietfa.amsl.com>; Mon, 30 Jan 2023 17:43:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.106
X-Spam-Level: *
X-Spam-Status: No, score=1.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_RUURL=3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LqJajTbnfTBb for <ietf@ietfa.amsl.com>; Mon, 30 Jan 2023 17:43:18 -0800 (PST)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 23609C17CEB1 for <ietf@ietf.org>; Mon, 30 Jan 2023 17:43:17 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1pMfg0-000EM7-GA; Mon, 30 Jan 2023 20:43:12 -0500
Date: Mon, 30 Jan 2023 20:43:06 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Raghu Saxena <poiasdpoiasd@live.com>, ietf@ietf.org
Subject: Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)
Message-ID: <BB06B0A008D9EEDF65040D77@PSB>
In-Reply-To: <0ad6ad81-a9af-9be0-e2bf-b04e33bd0525@it.aoyama.ac.jp>
References: <MEYP282MB3564A385B6CECB0E9E92A630A3CE9@MEYP282MB3564.AUSP282.PROD.OUTLOOK.COM> <634ad97a-9081-1831-9c07-999a3c8e1bbf@gmx.de> <MEYP282MB3564CAEFF922DFEEEE32813DA3CE9@MEYP282MB3564.AUSP282.PROD.OUTLOOK.COM> <0830aa47-0dbb-5911-dfbe-26ca86f0b04a@it.aoyama.ac.jp> <FF6F03C450BBED62D3A23839@PSB> <0ad6ad81-a9af-9be0-e2bf-b04e33bd0525@it.aoyama.ac.jp>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/9w0Lxd-Po-9ywOOgQXijgw9NQkU>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "IETF-Discussion. This is the most general IETF mailing list, intended for discussion of technical, procedural, operational, and other topics for which no dedicated mailing lists exist." <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Jan 2023 01:43:22 -0000
Martin, Yes, that was the point. Thanks for further clarifying. john --On Tuesday, January 31, 2023 10:03 +0900 "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote: > On 2023-01-30 17:04, John C Klensin wrote: >> Let me add one small comment to Martin's: >> >> A generic HTTP library that allows character encodings other >> than UTF-8 (i.e., does not enforce UTF-8) needs to be very >> careful that it does not make guesses about what non-UTF-8 >> encodings might mean, e.g., attempt to translate them to UTF-8 >> or other forms. His comment about Windows-1251 is almost >> certainly correct in this case. > > Assuming you know the Cyrillic alphabet, or have an > explanatory document at hand, you can easily confirm that it > IS correct if you go to the site. But you're right that it's > only for this specific case. > >> However, assuming that, for a >> URL whose domain-part is a subdomain of RU, the coding, if not >> UTF-8, is necessarily Windows-1251 would be unreasonable and >> dangerous: nothing in either 3986 or externally imposed rules >> about what the RU TLD can register requires that non-ASCII >> characters in URI tails be in Cyrillic and not Latin or, for >> that matter, Mongolian, Arabic, or any other script. > > Yes of course it could be something else than Cyrillic. Even > for Cyrillic, it could be KOI-8 or one of its variants, or it > could be Windows-1251, or it could be ISO 8859-5. > > Regards, Martin. > > >> john >> >> >> --On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst" >> <duerst@it.aoyama.ac.jp> wrote: >> >>> For the record, here's what I posted to the relevant github >>> issue, for those who aren't subscribed to it: >>> >>> >>>> >>> For a generic HTTP library, not enforcing http/https URLs to >>> be UTF-8 is the right decision. But such a library should >>> make it easy to use UTF-8 for URIs, And wherever possible, >>> servers should use UTF-8 for their URIs if they contain >>> non-ASCII characters, and should use a suitable baseXX >>> encoding for binary data such as digital signatures and the >>> like. >>> >>> Btw, contrary to what @brandon93 says at the start of this >>> thread, >>> https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ >>> is not in Windows-1252 (Western Europe), but in Windows-1251 >>> (Russia). This of course makes sense because the site has a >>> Russian domain name. The city is Таллин, in Latin >>> letters this is Tallin. You can easily check this by using >>> the URL in a browser. Using Windows-1252 makes no sense >>> because there is no language that contains words like >>> "Òàëëèí" (accented vowels only). >>> >>> This shows the advantage of using UTF-8. It avoids the mess >>> of regional encodings, and because of its internal structure >>> cannot easily be mistaken for some other encoding. >>> >>>> >>> >>> Regards, Martin. >>> >>> On 2023-01-25 19:54, Raghu Saxena wrote: >>>> >>>> On 1/25/23 17:47, Julian Reschke wrote: >>>>> On 25.01.2023 10:04, Raghu Saxena wrote: >>>>>> To whomever it may concern, >>>>>> >>>>>> I am writing to seek clarification regarding the URI spec >>>>>> (RFC3986) followed by HTTP, specifically about >>>>>> percent-encoding arbitrary octets (which do not comprise a >>>>>> valid UTF08 sequence). In the last paragraph of RFC3986 >>>>>> Section 2.5 >>>>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5), >>>>>> it says, quote: >>>>>> >>>>>> > When a new URI scheme defines a component that >>>>>> represents textual data consisting of characters >>>>>> from the Universal Character Set [UCS], >>>>>> the data should first be encoded as octets >>>>>> according to the UTF-8 character encoding >>>>>> [STD63]; then only those octets that do not >>>>>> correspond to characters in the unreserved set should be >>>>>> percent- encoded. >>>>>> >>>>>> This implies that URI schemes defined after RFC3986 must >>>>>> follow UTF-8 encoding in their URIs. However, the original >>>>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not >>>>>> have had to "abide" by the UTF-8 rule. >>>>>> >>>>>> In fact, many web servers allow and process GET requests >>>>>> with percent-encoded octets, which they decode as raw >>>>>> bytes and have the application level logic handle how to >>>>>> process them. >>>>>> >>>>>> However, since HTTP's latest RFC is 9110, dated June 2022 >>>>>> (post RFC3986), does it mean the UTF-8 rule now applies to >>>>>> it? I would think not, since this would be a breaking >>>>>> change. But some comments on github indicate that this is >>>>>> as per the spec () >>>>> >>>>> Pointer? >>>>> >>>> My apologies, the comment is here: >>>> https://github.com/sindresorhus/got/issues/420#issuecomment >>>> -3 45416645 >>>> >>>> >>>>>> tl;dr - Is it compliant with the HTTP specification to >>>>>> send arbitrary bytes, which do not represent a valid UTF-8 >>>>>> sequence, via percent-encoding in the URL query parameter? >>>>> >>>>> Yes. >>>>> >>>>> The http scheme was not re-definey by RFCs after RFC 2616 >>>>> (in fact, it was defined even before that). >>>>> >>>>> Best regards, Julian >>>>> >>>> Thanks for the clarification regarding schemes not being >>>> re-defined. I will ask the library author to reconsider >>>> >>>> Regards, >>>> >>>> Raghu Saxena >>>> >>>> (P.S. Sorry for the personal reply prior to this - my first >>>> time using mailing lists) >>
- Clarification regarding URI (RFC3986) spec follow… Raghu Saxena
- Re: Clarification regarding URI (RFC3986) spec fo… Julian Reschke
- Re: Clarification regarding URI (RFC3986) spec fo… Raghu Saxena
- Re: Clarification regarding URI (RFC3986) spec fo… Robert Sparks
- Re: Clarification regarding URI (RFC3986) spec fo… Martin J. Dürst
- Re: Clarification regarding URI (RFC3986) spec fo… John C Klensin
- Re: Clarification regarding URI (RFC3986) spec fo… Patrik Fältström
- Re: Clarification regarding URI (RFC3986) spec fo… Martin J. Dürst
- Re: Clarification regarding URI (RFC3986) spec fo… John C Klensin