Re: [rfc-i] looking for a volunteer to write a simple script

Joe Touch <touch@strayalpha.com> Sat, 13 July 2019 00:19 UTC

Return-Path: <rfc-interest-bounces@rfc-editor.org>
X-Original-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Delivered-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1B9DC120058 for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Fri, 12 Jul 2019 17:19:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.75
X-Spam-Level:
X-Spam-Status: No, score=-4.75 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=fail (2048-bit key) reason="fail (message has been altered)" header.d=strayalpha.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WDzhvQbjWROF for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Fri, 12 Jul 2019 17:19:39 -0700 (PDT)
Received: from rfc-editor.org (rfc-editor.org [4.31.198.49]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 65BAC12000E for <rfc-interest-archive-eekabaiReiB1@ietf.org>; Fri, 12 Jul 2019 17:19:39 -0700 (PDT)
Received: from rfcpa.amsl.com (localhost [IPv6:::1]) by rfc-editor.org (Postfix) with ESMTP id B67C0B80D62; Fri, 12 Jul 2019 17:19:33 -0700 (PDT)
X-Original-To: rfc-interest@rfc-editor.org
Delivered-To: rfc-interest@rfc-editor.org
Received: from localhost (localhost [127.0.0.1]) by rfc-editor.org (Postfix) with ESMTP id 8621BB80D62; Fri, 12 Jul 2019 17:19:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at rfc-editor.org
Authentication-Results: rfcpa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=strayalpha.com
Received: from rfc-editor.org ([127.0.0.1]) by localhost (rfcpa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JTSgUDm_U2TZ; Fri, 12 Jul 2019 17:19:30 -0700 (PDT)
Received: from server217-3.web-hosting.com (server217-3.web-hosting.com [198.54.115.226]) by rfc-editor.org (Postfix) with ESMTPS id A7203B80D5E; Fri, 12 Jul 2019 17:19:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=strayalpha.com; s=default; h=To:References:Message-Id:Cc:Date:In-Reply-To: From:Subject:Mime-Version:Content-Type:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=F8j2fYK3pk1TNDIAFA2aH/3JoyhswZyqXM4+wwhNHIs=; b=oSax+8w8MML7aEHAr75a03fWN FkImGxL2RkNo7rRaOY1t2UtfMjmJ/6VymkS+huS/L3S3hP9a5NdAR57d1+JPhncNX5DjEt7MaSCNX HhA73BThG0z+HBvkghzaSHR2XuZvIfHZc1n2l/2dWJmAMdc+Kb16HtpuErB+0hBFASCIh+lh9hv3O QkIqiU/HTcf6QXPifWdrfBsIntxtwfkhPZDqbnriglTAAm391J48s5NocRAyUDGi/zVsGCwiIjrpq qWzpNkBVbCWeQqM7stfU1HfBIpxUjQUan+wZZQLW1GtlwEJPOImbQR3rbJfAdvuHINNH8JA+HM3Ll j+oDB/7Dw==;
Received: from cpe-172-250-225-198.socal.res.rr.com ([172.250.225.198]:51879 helo=[192.168.1.10]) by server217.web-hosting.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92) (envelope-from <touch@strayalpha.com>) id 1hm5l9-000MvU-TF; Fri, 12 Jul 2019 20:19:33 -0400
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Joe Touch <touch@strayalpha.com>
In-Reply-To: <176910E2-56E1-4B3B-8510-797C3560E90A@strayalpha.com>
Date: Fri, 12 Jul 2019 17:19:27 -0700
Message-Id: <116BAD01-3331-4D9F-A3DA-D3B2C58B7ED0@strayalpha.com>
References: <62c8413d-c735-4ec3-8b22-eb0fa5356636@Spark> <38d0704f-348c-4ec0-9d94-340747960201@Spark> <e86b8894-4d7a-4c9d-3476-0221a94c9eb0@gmx.de> <13A89BE6-8654-49C4-9FBA-2F709EE0BA1B@rfc-editor.org> <0504f606252c476f66804e338fa460b4@strayalpha.com> <c23139a7261e58cbfc93ac18a3815bad@strayalpha.com> <01ADB89D-90AF-4672-A8B9-54F5B09E82D4@amsl.com> <176910E2-56E1-4B3B-8510-797C3560E90A@strayalpha.com>
To: Sandy Ginoza <sginoza@amsl.com>
X-Mailer: Apple Mail (2.3445.9.1)
X-OutGoing-Spam-Status: No, score=-1.0
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server217.web-hosting.com
X-AntiAbuse: Original Domain - rfc-editor.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - strayalpha.com
X-Get-Message-Sender-Via: server217.web-hosting.com: authenticated_id: touch@strayalpha.com
X-Authenticated-Sender: server217.web-hosting.com: touch@strayalpha.com
X-Source:
X-Source-Args:
X-Source-Dir:
X-From-Rewrite: unmodified, already matched
Subject: Re: [rfc-i] looking for a volunteer to write a simple script
X-BeenThere: rfc-interest@rfc-editor.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "A list for discussion of the RFC series and RFC Editor functions." <rfc-interest.rfc-editor.org>
List-Unsubscribe: <https://www.rfc-editor.org/mailman/options/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=unsubscribe>
List-Archive: <http://www.rfc-editor.org/pipermail/rfc-interest/>
List-Post: <mailto:rfc-interest@rfc-editor.org>
List-Help: <mailto:rfc-interest-request@rfc-editor.org?subject=help>
List-Subscribe: <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=subscribe>
Cc: Julian Reschke <julian.reschke@gmx.de>, RFC Interest <rfc-interest@rfc-editor.org>, Heather Flanagan <rse@rfc-editor.org>
Content-Type: multipart/mixed; boundary="===============0798519810569709377=="
Errors-To: rfc-interest-bounces@rfc-editor.org
Sender: rfc-interest <rfc-interest-bounces@rfc-editor.org>

I looked at this a bit and getting it closer to 100% is tricky. There are a LOT of corner cases, including the possibility that the <bcp14> tags are outside other tags, e.g., bold, etc.

I don’t know fi it’ll be worth the effort, given the need for post-checking anyway.

But I’ll take a look over the weekend and let you know if there’s a solution that gets closer without increasing falses.

Joe

> On Jul 12, 2019, at 11:33 AM, Joe Touch <touch@strayalpha.com> wrote:
> 
> I can fix that in a few minutes...
> 
> On Jul 12, 2019, at 11:29 AM, Sandy Ginoza <sginoza@amsl.com <mailto:sginoza@amsl.com>> wrote:
> 
>> Thanks Joe!
>> 
>> I just tested this updated script and it seems to work well.  I am happy to see it catches things like “MAY” (within quotes), but does not tag items Carsten noted (e.g., “MARSHALL”).   
>> 
>> I note that it does double tag keywords if they’ve already been tagged, but that should be something we can check for before running the script (i.e., check whether the keywords have already been tagged).  
>> 
>> Thanks!
>> Sandy
>> 
>>> On Jul 12, 2019, at 11:02 AM, Joe Touch <touch@strayalpha.com <mailto:touch@strayalpha.com>> wrote:
>>> 
>>> Quick update:
>>> 
>>> perl -0777 -pe "s/(\b(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)\b)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>> 
>>> 
>>> 
>>> Joe
>>> 
>>>  
>>> On 2019-07-12 10:43, Joe Touch wrote:
>>> 
>>>> This will do the trick:
>>>> 
>>>> 
>>>> 
>>>> perl -0777 -pe "s/(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>>> 
>>>> (replace INFILE.xml and OUTFILE.xml with your filenames)
>>>> 
>>>> If you want it to edit in-place (riskly, but simpler if you work on a copy anyway):
>>>> 
>>>> perl -0777 -i -pe "s/(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>>> 
>>>>  
>>>> 
>>>> 
>>>> On 2019-07-12 10:26, Heather Flanagan wrote:
>>>> 
>>>> 
>>>> 
>>>> On Jul 12, 2019, at 10:23 AM, Julian Reschke <julian.reschke@gmx.de <mailto:julian.reschke@gmx.de>> wrote:
>>>> 
>>>> On 12.07.2019 18:55, Heather Flanagan wrote:
>>>> Hola a todos!
>>>> 
>>>> The RFC Editor has the need for a comparatively simple script that would
>>>> automatically add <bcp14></bcp14> tags to requirement language in v3 RFCs.
>>>> 
>>>> Specifically, this would take a v3 XML input file, and create a v3 XML
>>>> output file with <bcp14></bcp14> added around each instance of a 2119
>>>> keyword in the file. (MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT,
>>>> SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL)
>>>> 
>>>> Anyone up for helping us out with that?
>>>> 
>>>> Thanks! Heather
>>>> ...
>>>> 
>>>> The tricky part is to find the right instances. For instance, what if it
>>>> appears in a quote, or in artwork? Or if "SHALL NOT" is across a line
>>>> break...
>>>> 
>>>> So the output will require sanity checking.
>>>>  
>>>> Well, yes, of course. We're aiming for a rough pass to catch maybe 80% of the situations. Everything will still need to be reviewed.
>>>> 
>>>> 
>>>> I assume that the tool is supposed to preserve whitespace, line breaks
>>>> etc? This essentially rules out running the input through an XML parser...
>>>> Seriously, we're not aiming for that robust right now. It doesn't have to be perfect, it just has to help.
>>>>  
>>>> -Heather
>>>>  
>>>> 
>>>> 
>>>> _______________________________________________
>>>> rfc-interest mailing list
>>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>
>>>> _______________________________________________
>>>> rfc-interest mailing list
>>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>_______________________________________________
>>> rfc-interest mailing list
>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>
>> 
> _______________________________________________
> rfc-interest mailing list
> rfc-interest@rfc-editor.org
> https://www.rfc-editor.org/mailman/listinfo/rfc-interest

_______________________________________________
rfc-interest mailing list
rfc-interest@rfc-editor.org
https://www.rfc-editor.org/mailman/listinfo/rfc-interest