Re: [rfc-i] looking for a volunteer to write a simple script

Heather Flanagan <rse@rfc-editor.org> Sun, 14 July 2019 16:49 UTC

Return-Path: <rfc-interest-bounces@rfc-editor.org>
X-Original-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Delivered-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1494612012C for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Sun, 14 Jul 2019 09:49:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -5.199
X-Spam-Level:
X-Spam-Status: No, score=-5.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kSNkVYqZbm9z for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Sun, 14 Jul 2019 09:49:53 -0700 (PDT)
Received: from rfc-editor.org (rfc-editor.org [4.31.198.49]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C757212012B for <rfc-interest-archive-eekabaiReiB1@ietf.org>; Sun, 14 Jul 2019 09:49:53 -0700 (PDT)
Received: from rfcpa.amsl.com (localhost [IPv6:::1]) by rfc-editor.org (Postfix) with ESMTP id 1C732B80C75; Sun, 14 Jul 2019 09:49:52 -0700 (PDT)
X-Original-To: rfc-interest@rfc-editor.org
Delivered-To: rfc-interest@rfc-editor.org
Received: from localhost (localhost [127.0.0.1]) by rfc-editor.org (Postfix) with ESMTP id 430C5B80C75 for <rfc-interest@rfc-editor.org>; Sun, 14 Jul 2019 09:49:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at rfc-editor.org
Received: from rfc-editor.org ([127.0.0.1]) by localhost (rfcpa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id W2G8mD2NOtJ2 for <rfc-interest@rfc-editor.org>; Sun, 14 Jul 2019 09:49:49 -0700 (PDT)
Received: from mail.amsl.com (c8a.amsl.com [4.31.198.40]) by rfc-editor.org (Postfix) with ESMTPS id BE136B80C73 for <rfc-interest@rfc-editor.org>; Sun, 14 Jul 2019 09:49:49 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id BBB031C134B; Sun, 14 Jul 2019 09:49:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4QbgNqUuNyBY; Sun, 14 Jul 2019 09:49:48 -0700 (PDT)
Received: from [10.198.42.38] (c-71-231-216-10.hsd1.wa.comcast.net [71.231.216.10]) by c8a.amsl.com (Postfix) with ESMTPSA id 6572B1C1349; Sun, 14 Jul 2019 09:49:48 -0700 (PDT)
From: Heather Flanagan <rse@rfc-editor.org>
Message-Id: <5B9EE844-B13E-419C-9A57-8D10F9AC7BE0@rfc-editor.org>
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Date: Sun, 14 Jul 2019 09:49:49 -0700
In-Reply-To: <116BAD01-3331-4D9F-A3DA-D3B2C58B7ED0@strayalpha.com>
To: Joe Touch <touch@strayalpha.com>
References: <62c8413d-c735-4ec3-8b22-eb0fa5356636@Spark> <38d0704f-348c-4ec0-9d94-340747960201@Spark> <e86b8894-4d7a-4c9d-3476-0221a94c9eb0@gmx.de> <13A89BE6-8654-49C4-9FBA-2F709EE0BA1B@rfc-editor.org> <0504f606252c476f66804e338fa460b4@strayalpha.com> <c23139a7261e58cbfc93ac18a3815bad@strayalpha.com> <01ADB89D-90AF-4672-A8B9-54F5B09E82D4@amsl.com> <176910E2-56E1-4B3B-8510-797C3560E90A@strayalpha.com> <116BAD01-3331-4D9F-A3DA-D3B2C58B7ED0@strayalpha.com>
X-Mailer: Apple Mail (2.3445.104.11)
Subject: Re: [rfc-i] looking for a volunteer to write a simple script
X-BeenThere: rfc-interest@rfc-editor.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "A list for discussion of the RFC series and RFC Editor functions." <rfc-interest.rfc-editor.org>
List-Unsubscribe: <https://www.rfc-editor.org/mailman/options/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=unsubscribe>
List-Archive: <http://www.rfc-editor.org/pipermail/rfc-interest/>
List-Post: <mailto:rfc-interest@rfc-editor.org>
List-Help: <mailto:rfc-interest-request@rfc-editor.org?subject=help>
List-Subscribe: <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=subscribe>
Cc: RFC Interest <rfc-interest@rfc-editor.org>
Content-Type: multipart/mixed; boundary="===============9136553709063188929=="
Errors-To: rfc-interest-bounces@rfc-editor.org
Sender: rfc-interest <rfc-interest-bounces@rfc-editor.org>


> On Jul 12, 2019, at 5:19 PM, Joe Touch <touch@strayalpha.com> wrote:
> 
> I looked at this a bit and getting it closer to 100% is tricky. There are a LOT of corner cases, including the possibility that the <bcp14> tags are outside other tags, e.g., bold, etc.

The IETF absolutely *lives* for corner cases. :-) We are expecting those, and it’s ok. There is no expectation of 100% here.

> 
> I don’t know fi it’ll be worth the effort, given the need for post-checking anyway.
> 
> But I’ll take a look over the weekend and let you know if there’s a solution that gets closer without increasing falses.

You’re a gem! Thanks!

Heather

> 
> Joe
> 
>> On Jul 12, 2019, at 11:33 AM, Joe Touch <touch@strayalpha.com <mailto:touch@strayalpha.com>> wrote:
>> 
>> I can fix that in a few minutes...
>> 
>> On Jul 12, 2019, at 11:29 AM, Sandy Ginoza <sginoza@amsl.com <mailto:sginoza@amsl.com>> wrote:
>> 
>>> Thanks Joe!
>>> 
>>> I just tested this updated script and it seems to work well.  I am happy to see it catches things like “MAY” (within quotes), but does not tag items Carsten noted (e.g., “MARSHALL”).   
>>> 
>>> I note that it does double tag keywords if they’ve already been tagged, but that should be something we can check for before running the script (i.e., check whether the keywords have already been tagged).  
>>> 
>>> Thanks!
>>> Sandy
>>> 
>>>> On Jul 12, 2019, at 11:02 AM, Joe Touch <touch@strayalpha.com <mailto:touch@strayalpha.com>> wrote:
>>>> 
>>>> Quick update:
>>>> 
>>>> perl -0777 -pe "s/(\b(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)\b)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>>> 
>>>> 
>>>> 
>>>> Joe
>>>> 
>>>>  
>>>> On 2019-07-12 10:43, Joe Touch wrote:
>>>> 
>>>>> This will do the trick:
>>>>> 
>>>>> 
>>>>> 
>>>>> perl -0777 -pe "s/(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>>>> 
>>>>> (replace INFILE.xml and OUTFILE.xml with your filenames)
>>>>> 
>>>>> If you want it to edit in-place (riskly, but simpler if you work on a copy anyway):
>>>>> 
>>>>> perl -0777 -i -pe "s/(((MUST|SHOULD|SHALL)(\s+NOT)?)|((NOT\s+)?RECOMMENDED)|MAY|OPTIONAL|REQUIRED)/<bcp14>\$1<\/bcp14>/g" INFILE.xml > OUTFILE.xml
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> On 2019-07-12 10:26, Heather Flanagan wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jul 12, 2019, at 10:23 AM, Julian Reschke <julian.reschke@gmx.de <mailto:julian.reschke@gmx.de>> wrote:
>>>>> 
>>>>> On 12.07.2019 18:55, Heather Flanagan wrote:
>>>>> Hola a todos!
>>>>> 
>>>>> The RFC Editor has the need for a comparatively simple script that would
>>>>> automatically add <bcp14></bcp14> tags to requirement language in v3 RFCs.
>>>>> 
>>>>> Specifically, this would take a v3 XML input file, and create a v3 XML
>>>>> output file with <bcp14></bcp14> added around each instance of a 2119
>>>>> keyword in the file. (MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT,
>>>>> SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL)
>>>>> 
>>>>> Anyone up for helping us out with that?
>>>>> 
>>>>> Thanks! Heather
>>>>> ...
>>>>> 
>>>>> The tricky part is to find the right instances. For instance, what if it
>>>>> appears in a quote, or in artwork? Or if "SHALL NOT" is across a line
>>>>> break...
>>>>> 
>>>>> So the output will require sanity checking.
>>>>>  
>>>>> Well, yes, of course. We're aiming for a rough pass to catch maybe 80% of the situations. Everything will still need to be reviewed.
>>>>> 
>>>>> 
>>>>> I assume that the tool is supposed to preserve whitespace, line breaks
>>>>> etc? This essentially rules out running the input through an XML parser...
>>>>> Seriously, we're not aiming for that robust right now. It doesn't have to be perfect, it just has to help.
>>>>>  
>>>>> -Heather
>>>>>  
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> rfc-interest mailing list
>>>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>
>>>>> _______________________________________________
>>>>> rfc-interest mailing list
>>>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>_______________________________________________
>>>> rfc-interest mailing list
>>>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>>>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>
>>> 
>> _______________________________________________
>> rfc-interest mailing list
>> rfc-interest@rfc-editor.org <mailto:rfc-interest@rfc-editor.org>
>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest
> 

_______________________________________________
rfc-interest mailing list
rfc-interest@rfc-editor.org
https://www.rfc-editor.org/mailman/listinfo/rfc-interest