Re: [apps-discuss] Fun with URLs and regex

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Thu, 08 January 2015 03:01 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A4BF01A00F7 for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 19:01:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.199
X-Spam-Level:
X-Spam-Status: No, score=0.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vhVuhveNCm7C for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 19:01:19 -0800 (PST)
Received: from scintmta01-14.scbb.aoyama.ac.jp (scintmta.scbb.aoyama.ac.jp [133.2.253.64]) by ietfa.amsl.com (Postfix) with ESMTP id 13C8C1A87AC for <apps-discuss@ietf.org>; Wed, 7 Jan 2015 19:01:10 -0800 (PST)
Received: from scmeg01-14.scbb.aoyama.ac.jp (scmeg01-14.scbb.aoyama.ac.jp [133.2.253.15]) by scintmta01-14.scbb.aoyama.ac.jp (Postfix) with ESMTP id 589B332E52F; Thu, 8 Jan 2015 12:00:25 +0900 (JST)
Received: from itmail2.it.aoyama.ac.jp (unknown [133.2.206.134]) by scmeg01-14.scbb.aoyama.ac.jp with smtp id 4836_34ce_a05b8a53_98c6_4127_9ccb_9fc507fd88bf; Thu, 08 Jan 2015 12:00:24 +0900
Received: from [133.2.210.64] (unknown [133.2.210.64]) by itmail2.it.aoyama.ac.jp (Postfix) with ESMTP id CAE93BF505; Thu, 8 Jan 2015 12:00:24 +0900 (JST)
Message-ID: <54ADF2C8.8010604@it.aoyama.ac.jp>
Date: Thu, 08 Jan 2015 12:00:24 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Bjoern Hoehrmann <derhoermi@gmx.net>, Mark Nottingham <mnot@mnot.net>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <vperaa1clvfrj9hajpjhl7h3senipqdam6@hive.bjoern.hoehrmann.de>
In-Reply-To: <vperaa1clvfrj9hajpjhl7h3senipqdam6@hive.bjoern.hoehrmann.de>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Archived-At: http://mailarchive.ietf.org/arch/msg/apps-discuss/dnSwP91fV1QzS2kIRNu5vP326-o
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Jan 2015 03:01:25 -0000

On 2015/01/08 08:23, Bjoern Hoehrmann wrote:
> * Mark Nottingham wrote:
>> I’ve updated my Python script that serves as a translation of ABNF for
>> URIs into regex.
>
> I note that there are a number of automated tools for this; I have not
> tried any of them recently though, but perhaps other can suggest one.
>
>> https://gist.github.com/mnot/138549
>>
>> It now validates the following URI schemes according to their respective
>> specifications:
>>   - http
>>   - https
>>   - file
>>   - data
>>   - gopher
>>   - ws
>>   - wss
>>   - mailto
>>
>> I didn’t finish mailto or data, because they allow quoted-string inside
>> of URLs, and that makes my head hurt.

It'd probably make my head hurt, too, but I'll try to do it 
(automatically or manually) to help verify or update 
http://tools.ietf.org/html/rfc6068.

> The bigger problem might be that URI scheme grammars typically do not
> account for %xx-encoding. It should be fine to write `dAtA:;B%41se64,`,

By RFC 3986 at least, it is.

> so you cannot use literals like `;base64` directly, unless you apply
> some syntax-preserving pre-processing. RFC 6068 also re-writes the RFC
> 5322 productions it uses in prose, and considering how complex the RFC
> 5322 grammar is, it is probably not wise to attempt to do this manually.

Did you mean 'automatically'? I would agree in that case.

Regards,   Martin.