Re: [TOOLS-DEVELOPMENT] Preview release of Text Submission Converter, id2xml

Henrik Levkowetz <henrik@levkowetz.com> Thu, 13 July 2017 15:47 UTC

Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-development@ietfa.amsl.com
Delivered-To: tools-development@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F249D131453 for <tools-development@ietfa.amsl.com>; Thu, 13 Jul 2017 08:47:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QHZSpr9egxZj for <tools-development@ietfa.amsl.com>; Thu, 13 Jul 2017 08:47:37 -0700 (PDT)
Received: from durif.tools.ietf.org (durif.tools.ietf.org [IPv6:2001:1900:3001:11::3d]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B7DBB124BE8 for <tools-development@ietf.org>; Thu, 13 Jul 2017 08:47:37 -0700 (PDT)
Received: from h-43-30.a357.priv.bahnhof.se ([79.136.43.30]:63442 helo=[192.168.1.120]) by durif.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1dVgL1-0000FM-Nw; Thu, 13 Jul 2017 08:47:37 -0700
To: Megan Ferguson <mferguson@amsl.com>
References: <8158A447-3AE2-413F-8BF0-6EDA08B5B121@amsl.com> <A8EC1B4D-A999-4848-B7E6-ABFE199921D7@amsl.com>
Cc: tools-development@ietf.org
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <b0cf8c9d-0694-fa0d-0c3f-646ac91de1fc@levkowetz.com>
Date: Thu, 13 Jul 2017 17:47:27 +0200
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <A8EC1B4D-A999-4848-B7E6-ABFE199921D7@amsl.com>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="BFblmxmMc8gc86RO1FXPcEN00JAfLKDku"
X-SA-Exim-Connect-IP: 79.136.43.30
X-SA-Exim-Rcpt-To: tools-development@ietf.org, mferguson@amsl.com
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on durif.tools.ietf.org)
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-development/hiw0iKolVl61runRMrlA0UjGuDY>
Subject: Re: [TOOLS-DEVELOPMENT] Preview release of Text Submission Converter, id2xml
X-BeenThere: tools-development@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Tools Development list server <tools-development.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-development>, <mailto:tools-development-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-development/>
List-Post: <mailto:tools-development@ietf.org>
List-Help: <mailto:tools-development-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-development>, <mailto:tools-development-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jul 2017 15:47:40 -0000

Hi Megan,

On 2017-07-13 04:28, Megan Ferguson wrote:
> Hi Henrik,
> 
> Files: non-xml2rfc-generated files generally — status check
> Version: 1.0.3
> 
> This mail is comprised of a few queries about this type of file 
> generally as well as a summary of the manual updates we have been
> making in text files to get id2xml to parse and represent as much of
> the text accurately as possible. We appreciate whatever feedback you
> may have on the following.
> 
> 1) Would it be possible to use the citation tags [RFC…] and [I-D….] 
> in the references section as a trigger to automatically pull those
> from the citation library? (So that if the entry itself is poor, it
> doesn’t really matter…). I believe you previously said this
> information was pulled from the seriesInfo, which makes sense as
> taking a look at (for example):
> 
> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3.xml, 
> 
> we see a full reference entry for [ARPND] aka 
> draft-ietf-trill-arp-optimization in the references section but we
> also see:
> 
> <!ENTITY I-D.ietf-trill-arp-optimization SYSTEM "https://xml2rfc.ietf.org/public/rfc/bibxml3/
> reference.I-D.draft-ietf-trill-arp-optimization.xml”>
> 
> at the top of the xml file.

Yes.  I'd prefer to make this happen if you specify a switch, maybe
something like --use-citation-tags, since this requires the user to
be aware of the effects, and having inspected and possibly fixed the
tags (something which may not be the case for general use in the
community).  But I can see the advantage to you, and will add this
in the next release.

> The same question for other references that live in the citation 
> library (e.g., other SDOs like W3C in bibxml6).

To a limited extent; if the tags are sufficiently regular, as the
tool won't be looking up individual tags while working, but will be
relying on recognising certain patterns in the tag.

> [Generally, updating the citation tag only would be much less time 
> intensive than updating the info. And having the correct input would
> be desirable (vs. re-adding the reference to the xml) so that 
> references aren’t errantly removed.]

Ack.

> 2) The SoTM text trigger seems to be quite sensitive. The text 
> appearing in the following files is not far off that generated by
> xml2rfc, but even the slight variations cause this text to be 
> unrecognized. Even copying in the text from 
> https://www.ietf.org/ietf-ftp/1id-guidelines.html#anchor7 gives an
> error as it is single spaced (and disclaimer that it does contain
> some typos…).

Normalising space before doing the processing is something I've
thought of doing.  I'll add that in the next release.

I can accommodate a number of acceptable variations on the boilerplate
text.  I think one way forward here is also to tighten the text that
idnits will accept, as part of the upcoming idnits rewrite.  Once the
new idnits is in production you should see less of this problem.

> draft-ietf-mpls-app-aware-tldp-09
> draft-ietf-pals-status-reduction-05
> draft-ietf-ippm-6man-pdm-option-13
> 
> 3) Here is a list of the (current) manual changes we are making in 
> order to make id2xml parse with the current version. Please let me
> know if I have mischaracterized any functionality or if any of these 
> items can be resolved using the tool in some manner I am unaware of.

Will do.  I also have at least one suggestion below of how the tool
might be modified to ease the work.  It would be valuable if you could
indicate which of these are most work-intensive, and most worthy of
attention to lessen the manual work.

> 
> Header updates:
> 
> -Remove any blank lines between top left 'Key word: text’ entries
> -Update to use first initial instead of full first name (or you get 
>  "Warning: This author is listed in the Authors’ Addresses section,
>  but was not found on the first page: and the authors section will be
>  absent from the XML generated)

Right.

> Boilerplate updates:
> 
> -Replace Copyright and SoTM text to ensure exactly matches output from xml2rfc
> -Ensure use of “Copyright License” as a title exactly

Yes. If you have commonly occurring variations on the Copyright License
title I can include those in the next release.

> List format updates:
> 
> -Add blank lines between list items to fix numbering
> -change (1) to 1 

Even better, "1."

> -fix indentation with - (dash) to all be inline indentation-wise

This applies to all bullet forms -- indentation after the bullet line
needs to match the start of text column on the bullet line.

> -fix indentation generally — if things are not aligned, they don’t work
> -update + or -iv or anything not xml2rfc-compliant as a list marker

Ack.

> Section header updates:
> -  Change Appendix A: to be Appendix A. or just Appendix A

Yes.  If Appendix A: is common, I can add recognition of that.

> Reference entry updates: 
> 
> -Change A. Nonymous to Nonymous, A. 
> -Review comma use as missing commas will cause a reference to be missed
> -Add double quotes around titles
> -Date updates:
> 	-Change February 10 2016 to 10 February 2016 -- what about commas etc.?
> 	-Change Summer 1996 to a specific month

Right.  And no commas in the date string.

> Authors’ Addresses updates:
> 
> -Add a blank line before email addresses (temporary?)

Yes, that should not be needed with the next release, unless the
email address is preceeded by a street address.  A blank line is
needed to terminate a street/postal address.

> -Remove any number from this section heading

I don't think that's needed any more.

> -Add URI: before any entry (same for email?)

True for URIs, yes.  And Email: or EMail: before email.

> -Spacing
> 	-Need to make sure there is whitespace where needed 
>         (e.g., title with no blank line before the figure)
> 	- Need to review in text output from the xml because sometimes errant
> -Review for mismatch between authors in header and Addresses section

The tool should warn if there's any author mismatch.

> Misc. updates:
> 
> -Review for defined in “section” x and similar
> 
> 4) Here is a pointer to a diff file between an original and one 
> including the manual edits we made in order to get the document to
> parse (a file we have previously discussed, just an example):
> 
> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3preedits-rfcdiff.html 

This matches pretty much what I'd expect.  I don't think you need to
add the "8.  References" line if you give the two references sections
appropriate section numbers (8. and 9.)

> Here is a pointer to a diff between the original and the text created from the id2xml output of our 
> manually updated original (another example):
> 
> https://www.rfc-editor.org/rfc/v3test/draft-ietf-trill-directory-assist-mechanisms-12v3-rfcdiff.html 

This also looks pretty much like I'd expect.  There is one place I'm
surprised at the lack of proper list handling, where the output after
the tool gives two sequential list items which both are numbere "1.",
in section 2.5.  Did you add blank lines between paragraphs and list,
and between list items there?

The sublist which is numbered 2.a, 2.b, etc. won't be properly
recognized as a numbered list unless you change to 2.1, 2.2, etc.
Right now I suspect it gets treated like one or more hanging lists.

> 5) The following is a list of checks we are making once the file parses using xml2rfc.  The amount of time this takes 
> is very dependent on the number of lists, figures, and references per document in addition to how clean the indentation
> was in the original.
> 
> Lists:
> -Review numbering of lists
> -Review correct implementation of list elements
> -Ensure no figures have been turned into lists errantly

Makes sense.

> Figures:
> -Review figures for alignment and missing text
> 
> Spacing:
> -Review indentation of text around a colon followed by two spaces
> -Review whitespace generally

Seems reasonable

> References:
> -Include references that could not be generated through text fixes previously

Ack.  This should be eased by the --use-citation-tags switch.

> Authors:
> -Review all information from original is included in the file (no missing email addresses)

Ack.


Best regards,

	Henrik