Re: [TOOLS-DEVELOPMENT] Preview release of Text Submission Converter, id2xml

Megan Ferguson <mferguson@amsl.com> Fri, 14 July 2017 16:48 UTC

Return-Path: <mferguson@amsl.com>
X-Original-To: tools-development@ietfa.amsl.com
Delivered-To: tools-development@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0DFA9126C3D for <tools-development@ietfa.amsl.com>; Fri, 14 Jul 2017 09:48:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.202
X-Spam-Level:
X-Spam-Status: No, score=-4.202 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id coUcSythVzkW for <tools-development@ietfa.amsl.com>; Fri, 14 Jul 2017 09:48:35 -0700 (PDT)
Received: from mail.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1E746126C22 for <tools-development@ietf.org>; Fri, 14 Jul 2017 09:48:35 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id A48DC1CA540; Fri, 14 Jul 2017 09:48:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KliNKnbNXsyV; Fri, 14 Jul 2017 09:48:29 -0700 (PDT)
Received: from meganfeiussmbp2.fios-router.home (unknown [47.144.154.234]) by c8a.amsl.com (Postfix) with ESMTPA id 64B761CA52C; Fri, 14 Jul 2017 09:48:29 -0700 (PDT)
Content-Type: text/plain; charset="windows-1252"
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
From: Megan Ferguson <mferguson@amsl.com>
In-Reply-To: <b0cf8c9d-0694-fa0d-0c3f-646ac91de1fc@levkowetz.com>
Date: Fri, 14 Jul 2017 09:48:34 -0700
Cc: tools-development@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <239FFB30-9E76-479E-88E0-D1099A4CE5A9@amsl.com>
References: <8158A447-3AE2-413F-8BF0-6EDA08B5B121@amsl.com> <A8EC1B4D-A999-4848-B7E6-ABFE199921D7@amsl.com> <b0cf8c9d-0694-fa0d-0c3f-646ac91de1fc@levkowetz.com>
To: Henrik Levkowetz <henrik@levkowetz.com>
X-Mailer: Apple Mail (2.1878.6)
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-development/B9dmD7HI0QUn6AsL1YKUbMPVUp8>
Subject: Re: [TOOLS-DEVELOPMENT] Preview release of Text Submission Converter, id2xml
X-BeenThere: tools-development@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Tools Development list server <tools-development.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-development>, <mailto:tools-development-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-development/>
List-Post: <mailto:tools-development@ietf.org>
List-Help: <mailto:tools-development-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-development>, <mailto:tools-development-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Jul 2017 16:48:37 -0000

Hi Henrik.

Inline below with MF.

Thanks so much for your time!

Megan

On Jul 13, 2017, at 8:47 AM, Henrik Levkowetz <henrik@levkowetz.com> wrote:

> Hi Megan,
> 
> On 2017-07-13 04:28, Megan Ferguson wrote:
>> Hi Henrik,
>> 
>> Files: non-xml2rfc-generated files generally — status check
>> Version: 1.0.3
>> 
>> This mail is comprised of a few queries about this type of file 
>> generally as well as a summary of the manual updates we have been
>> making in text files to get id2xml to parse and represent as much of
>> the text accurately as possible. We appreciate whatever feedback you
>> may have on the following.
>> 
>> 1) Would it be possible to use the citation tags [RFC…] and [I-D….] 
>> in the references section as a trigger to automatically pull those
>> from the citation library? (So that if the entry itself is poor, it
>> doesn’t really matter…). I believe you previously said this
>> information was pulled from the seriesInfo, which makes sense as
>> taking a look at (for example):
>> 
>> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3.xml, 
>> 
>> we see a full reference entry for [ARPND] aka 
>> draft-ietf-trill-arp-optimization in the references section but we
>> also see:
>> 
>> <!ENTITY I-D.ietf-trill-arp-optimization SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml3/
>> reference.I-D.draft-ietf-trill-arp-optimization.xml”>
>> 
>> at the top of the xml file.
> 
> Yes.  I'd prefer to make this happen if you specify a switch, maybe
> something like --use-citation-tags, since this requires the user to
> be aware of the effects, and having inspected and possibly fixed the
> tags (something which may not be the case for general use in the
> community).  But I can see the advantage to you, and will add this
> in the next release.

MF - That sounds like it would be really helpful if it’s possible.
> 
>> The same question for other references that live in the citation 
>> library (e.g., other SDOs like W3C in bibxml6).
> 
> To a limited extent; if the tags are sufficiently regular, as the
> tool won't be looking up individual tags while working, but will be
> relying on recognising certain patterns in the tag.
> 
MF - Right.  Assuming this might involve us updating to the known tags.

>> [Generally, updating the citation tag only would be much less time 
>> intensive than updating the info. And having the correct input would
>> be desirable (vs. re-adding the reference to the xml) so that 
>> references aren’t errantly removed.]
> 
> Ack.
> 
>> 2) The SoTM text trigger seems to be quite sensitive. The text 
>> appearing in the following files is not far off that generated by
>> xml2rfc, but even the slight variations cause this text to be 
>> unrecognized. Even copying in the text from 
>> https://www.ietf.org/ietf-ftp/1id-guidelines.html#anchor7 gives an
>> error as it is single spaced (and disclaimer that it does contain
>> some typos…).
> 
> Normalising space before doing the processing is something I've
> thought of doing.  I'll add that in the next release.
> 
> I can accommodate a number of acceptable variations on the boilerplate
> text.  I think one way forward here is also to tighten the text that
> idnits will accept, as part of the upcoming idnits rewrite.  Once the
> new idnits is in production you should see less of this problem.

MF - That sounds like a viable workaround.  The versions I see in the files below 
are all very similar and we speculate they come from either an outdated
template.  I’d have to do more research on this to get you a list of some of the most 
common forms if that is desirable instead of idnits.
> 
>> draft-ietf-mpls-app-aware-tldp-09
>> draft-ietf-pals-status-reduction-05
>> draft-ietf-ippm-6man-pdm-option-13
>> 
>> 3) Here is a list of the (current) manual changes we are making in 
>> order to make id2xml parse with the current version. Please let me
>> know if I have mischaracterized any functionality or if any of these 
>> items can be resolved using the tool in some manner I am unaware of.
> 
> Will do.  I also have at least one suggestion below of how the tool
> might be modified to ease the work.  It would be valuable if you could
> indicate which of these are most work-intensive, and most worthy of
> attention to lessen the manual work.
> 
>> 
>> Header updates:
>> 
>> -Remove any blank lines between top left 'Key word: text’ entries
>> -Update to use first initial instead of full first name (or you get 
>> "Warning: This author is listed in the Authors’ Addresses section,
>> but was not found on the first page: and the authors section will be
>> absent from the XML generated)
> 
> Right.
MF - The header updates can be tedious depending on how many authors there are, what 
information is missing from the lefthand side, etc.
> 
>> Boilerplate updates:
>> 
>> -Replace Copyright and SoTM text to ensure exactly matches output from xml2rfc
>> -Ensure use of “Copyright License” as a title exactly
> 
> Yes. If you have commonly occurring variations on the Copyright License
> title I can include those in the next release.
MF - I did a quick scan of our original files received from present back to RFCs in 
the 7000s.  Looks like these are the top hits:

Copyright
Copyright Notice
Copyright Notice and License
Copyright and License Notice
Copyright, Disclaimer, and Additional IPR Provisions
Copyright and IPR Provisions
Copyright Statement


> 
>> List format updates:
>> 
>> -Add blank lines between list items to fix numbering
>> -change (1) to 1 
> 
> Even better, "1.”
MF - ack.
> 
>> -fix indentation with - (dash) to all be inline indentation-wise
> 
> This applies to all bullet forms -- indentation after the bullet line
> needs to match the start of text column on the bullet line.
> 
>> -fix indentation generally — if things are not aligned, they don’t work
>> -update + or -iv or anything not xml2rfc-compliant as a list marker
> 
> Ack.
> 
>> Section header updates:
>> -  Change Appendix A: to be Appendix A. or just Appendix A
> 
> Yes.  If Appendix A: is common, I can add recognition of that.
MF - Yes.  That would be good as this is quite common.  
> 
>> Reference entry updates: 
>> 
>> -Change A. Nonymous to Nonymous, A. 
>> -Review comma use as missing commas will cause a reference to be missed
>> -Add double quotes around titles
>> -Date updates:
>> 	-Change February 10 2016 to 10 February 2016 -- what about commas etc.?
>> 	-Change Summer 1996 to a specific month
> 
> Right.  And no commas in the date string.

MF - Added to my list.  Updating the reference entries in text is time-intensive as 
they appear in such variant ways in non-xml2rfc-generated files.  The citation tag 
suggestion above would help mitigate some of this.
> 
>> Authors’ Addresses updates:
>> 
>> -Add a blank line before email addresses (temporary?)
> 
> Yes, that should not be needed with the next release, unless the
> email address is preceeded by a street address.  A blank line is
> needed to terminate a street/postal address.
> 
>> -Remove any number from this section heading
> 
> I don't think that's needed any more.
MF - Removing from my list.

> 
>> -Add URI: before any entry (same for email?)
> 
> True for URIs, yes.  And Email: or EMail: before email.
MF - and e-mail/E-mail too?
> 
>> -Spacing
>> 	-Need to make sure there is whitespace where needed 
>>        (e.g., title with no blank line before the figure)
>> 	- Need to review in text output from the xml because sometimes errant
>> -Review for mismatch between authors in header and Addresses section
> 
> The tool should warn if there's any author mismatch.
MF- Right.
> 
>> Misc. updates:
>> 
>> -Review for defined in “section” x and similar
>> 
>> 4) Here is a pointer to a diff file between an original and one 
>> including the manual edits we made in order to get the document to
>> parse (a file we have previously discussed, just an example):
>> 
>> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3preedits-rfcdiff.html 
> 
> This matches pretty much what I'd expect.  
> I don't think you need to
> add the "8.  References" line if you give the two references sections
> appropriate section numbers (8. and 9.)

MF - Ok, good to know.
> 
>> Here is a pointer to a diff between the original and the text created from the id2xml output of our 
>> manually updated original (another example):
>> 
>> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3-rfcdiff.html 
> 
> This also looks pretty much like I'd expect.  There is one place I'm
> surprised at the lack of proper list handling, where the output after
> the tool gives two sequential list items which both are numbere "1.",
> in section 2.5.  Did you add blank lines between paragraphs and list,
> and between list items there?

MF - The spacing is off there, and I didn’t take time to fix it (assume we would 
probably handle something like this in xml anyway).
> 
> The sublist which is numbered 2.a, 2.b, etc. won't be properly
> recognized as a numbered list unless you change to 2.1, 2.2, etc.
> Right now I suspect it gets treated like one or more hanging lists.
MF - Yes.
> 
>> 5) The following is a list of checks we are making once the file parses using xml2rfc.  The amount of time this takes 
>> is very dependent on the number of lists, figures, and references per document in addition to how clean the indentation
>> was in the original.
>> 
>> Lists:
>> -Review numbering of lists
>> -Review correct implementation of list elements
>> -Ensure no figures have been turned into lists errantly
> 
> Makes sense.
MF - This is time heavy when the document is full of lists (like an extensive terms list 
that is full of indentation problems would be a major cleanup for us).
> 
>> Figures:
>> -Review figures for alignment and missing text

MF - Again, if the document had several figures that were misappropriated as lists, this 
can get time-intensive for us to recopy in as artwork.  Thinking of specifically documents
that have several appendices full of figures describing things in text.  So far, these have
been translating pretty well, but some limitations still exist.  Here is my most recent example 
(from draft-ietf-sidr-bgpsec-algs-18) where the double spacing in the middle throw it off:

Original:
A.3.  BGPsec IPv4

   BGPSec IPv4 Update from AS(65536) to AS(65537):
   ===============================================
   Binary Form of BGPSec Update (TCP-DUMP):

   FF FF FF FF FF FF FF FF  FF FF FF FF FF FF FF FF 
   01 03 02 00 00 00 EC 40  01 01 02 80 04 04 00 00 
   00 00 80 0E 0D 00 01 01  04 C6 33 64 64 00 18 C0 
   00 02 90 1E 00 CD 00 0E  01 00 00 01 00 00 01 00 
   00 00 FB F0 00 BF 01 47  F2 3B F1 AB 2F 8A 9D 26 
   86 4E BB D8 DF 27 11 C7  44 06 EC 00 48 30 46 02 
   21 00 EF D4 8B 2A AC B6  A8 FD 11 40 DD 9C D4 5E 
   81 D6 9D 2C 87 7B 56 AA  F9 91 C3 4D 0E A8 4E AF 
   37 16 02 21 00 90 F2 C1  29 AB B2 F3 9B 6A 07 96 
   3B D5 55 A8 7A B2 B7 33  3B 7B 91 F1 66 8F D8 61 
   8C 83 FA C3 F1 AB 4D 91  0F 55 CA E7 1A 21 5E F3 
   CA FE 3A CC 45 B5 EE C1  54 00 48 30 46 02 21 00 
   EF D4 8B 2A AC B6 A8 FD  11 40 DD 9C D4 5E 81 D6 
   9D 2C 87 7B 56 AA F9 91  C3 4D 0E A8 4E AF 37 16 
   02 21 00 8E 21 F6 0E 44  C6 06 6C 8B 8A 95 A3 C0 
   9D 3A D4 37 95 85 A2 D7  28 EE AD 07 A1 7E D7 AA 
   05 5E CA 

Text output from id2xml:

A.3.  BGPsec IPv4

   BGPSec IPv4 Update from AS(65536) to AS(65537):
   ===============================================
   Binary Form of BGPSec Update (TCP-DUMP):

   FF FF FF FF FF FF FF FF  FF FF FF FF FF FF FF FF
                            01 03 02 00 00 00 EC 40 01 01 02 80 04 04 00
                            00 00 00 80 0E 0D 00 01 01 04 C6 33 64 64 00
                            18 C0 00 02 90 1E 00 CD 00 0E 01 00 00 01 00
                            00 01 00 00 00 FB F0 00 BF 01 47 F2 3B F1 AB
                            2F 8A 9D 26 86 4E BB D8 DF 27 11 C7 44 06 EC
                            00 48 30 46 02 21 00 EF D4 8B 2A AC B6 A8 FD
                            11 40 DD 9C D4 5E 81 D6 9D 2C 87 7B 56 AA F9
                            91 C3 4D 0E A8 4E AF 37 16 02 21 00 90 F2 C1
                            29 AB B2 F3 9B 6A 07 96 3B D5 55 A8 7A B2 B7
                            33 3B 7B 91 F1 66 8F D8 61 8C 83 FA C3 F1 AB
                            4D 91 0F 55 CA E7 1A 21 5E F3 CA FE 3A CC 45
                            B5 EE C1 54 00 48 30 46 02 21 00 EF D4 8B 2A
                            AC B6 A8 FD 11 40 DD 9C D4 5E 81 D6 9D 2C 87
                            7B 56 AA F9 91 C3 4D 0E A8 4E AF 37 16 02 21
                            00 8E 21 F6 0E 44 C6 06 6C 8B 8A 95 A3 C0 9D
                            3A D4 37 95 85 A2 D7 28 EE AD 07 A1 7E D7 AA
                            05 5E CA


>> 
>> Spacing:
>> -Review indentation of text around a colon followed by two spaces
>> -Review whitespace generally
> 
> Seems reasonable
> 
>> References:
>> -Include references that could not be generated through text fixes previously
> 
> Ack.  This should be eased by the --use-citation-tags switch.
MF - Exactly! 
> 
>> Authors:
>> -Review all information from original is included in the file (no missing email addresses)
> 
> Ack.
> 
> 
> Best regards,
> 
> 	Henrik
>