Re: [apps-discuss] Fun with URLs and regex

Bjoern Hoehrmann <derhoermi@gmx.net> Wed, 07 January 2015 23:23 UTC

Return-Path: <derhoermi@gmx.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AFA011A7D81 for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 15:23:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.91
X-Spam-Level:
X-Spam-Status: No, score=-1.91 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PXjdX0MZsZ2I for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 15:23:50 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DBA741A6F04 for <apps-discuss@ietf.org>; Wed, 7 Jan 2015 15:23:49 -0800 (PST)
Received: from netb ([89.204.130.54]) by mail.gmx.com (mrgmx102) with ESMTPSA (Nemesis) id 0LyB6P-1XlTbJ2mHd-015WU8; Thu, 08 Jan 2015 00:23:46 +0100
From: Bjoern Hoehrmann <derhoermi@gmx.net>
To: Mark Nottingham <mnot@mnot.net>
Date: Thu, 08 Jan 2015 00:23:45 +0100
Message-ID: <vperaa1clvfrj9hajpjhl7h3senipqdam6@hive.bjoern.hoehrmann.de>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net>
In-Reply-To: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net>
X-Mailer: Forte Agent 3.3/32.846
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
X-Provags-ID: V03:K0:NSSVWW53R0lGycfC00DdtI38a8IX0TnaJYU48rUckqvXYXkw2W4 f67nMfFKwLkQH9sjw3l9bwWsBg5hvpLC0LSbN0UXiqMFFhloIC99bzkL8XHH731MhoTi8UZ +n3t6GYhIY+nhG6+hF/JFt4DhKlCPELMv5BqfvqkXhOl5gDR8c8Nu7ropEmxr2ayRGHaZQe V0WxiHLe6slsXyTh9j4QQ==
X-UI-Out-Filterresults: notjunk:1;
Archived-At: http://mailarchive.ietf.org/arch/msg/apps-discuss/pKr_ITkdbEJp09t5xYVlgK6mRNY
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Jan 2015 23:23:56 -0000

* Mark Nottingham wrote:
>I’ve updated my Python script that serves as a translation of ABNF for 
>URIs into regex.

I note that there are a number of automated tools for this; I have not
tried any of them recently though, but perhaps other can suggest one.

>https://gist.github.com/mnot/138549
>
>It now validates the following URI schemes according to their respective 
>specifications:
>  - http
>  - https
>  - file
>  - data
>  - gopher
>  - ws
>  - wss
>  - mailto
>
>I didn’t finish mailto or data, because they allow quoted-string inside 
>of URLs, and that makes my head hurt.

The bigger problem might be that URI scheme grammars typically do not
account for %xx-encoding. It should be fine to write `dAtA:;B%41se64,`,
so you cannot use literals like `;base64` directly, unless you apply
some syntax-preserving pre-processing. RFC 6068 also re-writes the RFC
5322 productions it uses in prose, and considering how complex the RFC
5322 grammar is, it is probably not wise to attempt to do this manually.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
D-10243 Berlin · PGP Pub. KeyID: 0xA4357E78 · http://www.bjoernsworld.de
 Available for hire in Berlin (early 2015)  · http://www.websitedev.de/