Re: [apps-discuss] Fun with URLs and regex

Sam Ruby <rubys@intertwingly.net> Wed, 07 January 2015 22:45 UTC

Return-Path: <rubys@intertwingly.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F09EC1A1B94 for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 14:45:25 -0800 (PST)
X-Quarantine-ID: <pu573Yy0AZzd>
X-Virus-Scanned: amavisd-new at amsl.com
X-Amavis-Alert: BANNED, message contains text/plain,.exe
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pu573Yy0AZzd for <apps-discuss@ietfa.amsl.com>; Wed, 7 Jan 2015 14:45:23 -0800 (PST)
Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com [107.14.166.225]) by ietfa.amsl.com (Postfix) with ESMTP id 32B1D1A1B93 for <apps-discuss@ietf.org>; Wed, 7 Jan 2015 14:45:23 -0800 (PST)
Received: from [98.27.51.253] ([98.27.51.253:25264] helo=rubix) by cdptpa-oedge03 (envelope-from <rubys@intertwingly.net>) (ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP id A7/E2-22136-207BDA45; Wed, 07 Jan 2015 22:45:22 +0000
Received: from [192.168.159.48] (unknown [104.132.4.105]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: rubys) by rubix (Postfix) with ESMTPSA id 103AE14032C; Wed, 7 Jan 2015 17:45:21 -0500 (EST)
Message-ID: <54ADB701.4000309@intertwingly.net>
Date: Wed, 07 Jan 2015 17:45:21 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Mark Nottingham <mnot@mnot.net>, IETF Apps Discuss <apps-discuss@ietf.org>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net>
In-Reply-To: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-RR-Connecting-IP: 107.14.168.142:25
X-Cloudmark-Score: 0
Archived-At: http://mailarchive.ietf.org/arch/msg/apps-discuss/fy32kRi0Rn_zCt3s1g1P2i83-yo
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Jan 2015 22:45:26 -0000

On 01/07/2015 04:35 PM, Mark Nottingham wrote:
> I’ve updated my Python script that serves as a translation of ABNF for URIs into regex.
>
> https://gist.github.com/mnot/138549
>
> It now validates the following URI schemes according to their respective specifications:
>    - http
>    - https
>    - file
>    - data
>    - gopher
>    - ws
>    - wss
>    - mailto

My test data:

http://intertwingly.net/stories/2014/10/05/urltestdata.json

A program to test each input:

import uri_validate
import json
import re

f = open('urltestdata.json')
tests = json.load(f)
f.close()

valid = {}
for test in tests:
   instr = test['input']
   valid[instr] = False
   if re.match("^%s$" % uri_validate.URI_reference, instr, re.VERBOSE):
     try:
       scheme_validator = "%s_URI" % instr.split(":", 1)[0].lower()
       validator = getattr(uri_validate, scheme_validator)
       if re.match("^%s#" % validator, instr, re.VERBOSE):
         valid[instr] = True
     except AttributeError:
       valid[instr] = True

print json.dumps(valid, indent=2)