Re: [xml2rfc-dev] xml2rfc Input Document Issues

Tony Hansen <tony@att.com> Sat, 30 July 2011 13:38 UTC

Return-Path: <tony@att.com>
X-Original-To: xml2rfc-dev@ietfa.amsl.com
Delivered-To: xml2rfc-dev@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 39ED521F8749 for <xml2rfc-dev@ietfa.amsl.com>; Sat, 30 Jul 2011 06:38:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.046
X-Spam-Level:
X-Spam-Status: No, score=-106.046 tagged_above=-999 required=5 tests=[AWL=0.553, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id z4+x+8vhWYlC for <xml2rfc-dev@ietfa.amsl.com>; Sat, 30 Jul 2011 06:38:15 -0700 (PDT)
Received: from mail120.messagelabs.com (mail120.messagelabs.com [216.82.250.83]) by ietfa.amsl.com (Postfix) with ESMTP id 69E5421F8748 for <xml2rfc-dev@ietf.org>; Sat, 30 Jul 2011 06:38:15 -0700 (PDT)
X-VirusChecked: Checked
X-Env-Sender: tony@att.com
X-Msg-Ref: server-5.tower-120.messagelabs.com!1312033095!30415011!1
X-StarScan-Version: 6.2.17; banners=-,-,-
X-Originating-IP: [144.160.20.145]
Received: (qmail 20761 invoked from network); 30 Jul 2011 13:38:16 -0000
Received: from sbcsmtp6.sbc.com (HELO mlpd192.enaf.sfdc.sbc.com) (144.160.20.145) by server-5.tower-120.messagelabs.com with DHE-RSA-AES256-SHA encrypted SMTP; 30 Jul 2011 13:38:16 -0000
Received: from enaf.sfdc.sbc.com (localhost.localdomain [127.0.0.1]) by mlpd192.enaf.sfdc.sbc.com (8.14.4/8.14.4) with ESMTP id p6UDceLd011303 for <xml2rfc-dev@ietf.org>; Sat, 30 Jul 2011 09:38:40 -0400
Received: from alpd052.aldc.att.com (alpd052.aldc.att.com [130.8.42.31]) by mlpd192.enaf.sfdc.sbc.com (8.14.4/8.14.4) with ESMTP id p6UDcbEi011286 for <xml2rfc-dev@ietf.org>; Sat, 30 Jul 2011 09:38:37 -0400
Received: from aldc.att.com (localhost.localdomain [127.0.0.1]) by alpd052.aldc.att.com (8.14.4/8.14.4) with ESMTP id p6UDcBhO022472 for <xml2rfc-dev@ietf.org>; Sat, 30 Jul 2011 09:38:11 -0400
Received: from mailgw1.maillennium.att.com (mailgw1.maillennium.att.com [135.25.114.99]) by alpd052.aldc.att.com (8.14.4/8.14.4) with ESMTP id p6UDc5PM022415 for <xml2rfc-dev@ietf.org>; Sat, 30 Jul 2011 09:38:05 -0400
Received: from [135.70.11.242] (vpn-135-70-11-242.vpn.west.att.com[135.70.11.242]) by maillennium.att.com (mailgw1) with ESMTP id <20110730133804gw100e4lm8e> (Authid: tony); Sat, 30 Jul 2011 13:38:04 +0000
X-Originating-IP: [135.70.11.242]
Message-ID: <4E34093A.5080402@att.com>
Date: Sat, 30 Jul 2011 09:38:02 -0400
From: Tony Hansen <tony@att.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0
MIME-Version: 1.0
To: xml2rfc-dev@ietf.org
References: <277044E0-588F-4A87-B773-20CAF5CADFC2@concentricsky.com> <62F897A0-3990-47BF-B597-B1729EA82D80@concentricsky.com>
In-Reply-To: <62F897A0-3990-47BF-B597-B1729EA82D80@concentricsky.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: Re: [xml2rfc-dev] xml2rfc Input Document Issues
X-BeenThere: xml2rfc-dev@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "Discussion about particulars of xml2rfc development and code." <xml2rfc-dev.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc-dev>, <mailto:xml2rfc-dev-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/xml2rfc-dev>
List-Post: <mailto:xml2rfc-dev@ietf.org>
List-Help: <mailto:xml2rfc-dev-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc-dev>, <mailto:xml2rfc-dev-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 30 Jul 2011 13:38:16 -0000

Thank you, Mike, for the updates and especially the updated code!

Going forward, this discussion should really be happening on the 
xml2rfc-dev mailing list. It does consist only of *friendly* people who 
are interested in the internal development, and who will have input on 
some of the questions below. We're past the point of "does it work at 
all" and starting to ask even harder questions.

I ran the latest code base (version 2.0.0, aka branch 465) against 2581 
XML files under tools.ietf.org/id. The results and a summary of each 
document's status (sorted by status) can be found at

     http://xml.resource.org/xml2rfc2.success-rate-expanded-2.0.0
     http://xml.resource.org/xml2rfc2.success-rate-expanded-2.0.0-summary

Of these 2581 files,

941 passed (36%)
     this is a vast improvement over the 0.4.x and 0.5.x code. Excellent 
job!

1632 gave an error message
     These are the ones that need to be analyzed

8 failed with an exception
    These need fixes to the code.

draft-atwood-mcast-user-auth-01.xml
draft-ietf-dime-diameter-qos-15.xml
draft-ietf-karp-framework-00.xml
draft-ietf-pim-sm-linklocal-10.xml
draft-krecicki-imap-move-01.xml
draft-kunze-bagit-04.xml
draft-lhotka-netconf-relaxng-00.xml
draft-lhotka-yang-dsdl-map-00.xml

0 timed out
     Yay!

0 produced a warning message without an accompanying error message.

Henrik, do you think we should have a new issue tracker for xml2rfc2? Or 
should we just use the current xml2rfc issue tracker?

More comments below.

     Tony

On 7/30/2011 2:13 AM, Mike Biglan wrote:
> Hi Henrik and Tony,
>
> Here is a little more detail on what we worked on this past week in response to the input documents. We went ahead and tested a set of 450 documents in an attempt to categorize the errors, then fix any that needed fixing. Below are the categories that had errors and the counts in parentheses; this includes issues we have fixed, errors in the document or outside our codebase, and open questions.
>
>
> FIXED
>
> 1) Include instructions are being handled properly now.
>
> 2) No DTD file was declared in the document (12)
> We had already planned to handle this but it wasn't possible until a recent change.  I will implement a function in the application to default to rfc2629.dtd if no dtd is declared.

I'll note also that the --dtd parameter didn't seem to work either.

> 3) Failed to load a local reference file using ENTITY (9)
> Until today I hadn't seen a document using<!ENTITY>  to load a local file, only network files.  I will let this use the same function as<?rfc include>  to consult the XML_LIBRARY environment variable for the file, defaulting to the input file directory if its not found.
>
>
> ERROR OUTSIDE OUR CONTROL
>
> 4) Invalid characters for an "ID" value (30)
> It errors if an attribute of type "ID" has a value that starts with a number, or contains spaces, at symbols, or a few other characters. Upon looking into this, this is an error with standard XML and something outside our codebase.

So the question then shifts for how to best handle such errors -- what 
kinds of error messages should be presented.

> 5) Documents that violate the DTD (55)
There are some meta questions that will need to be answered about the 
DTD. I know that these are out of scope for the *development* of 
xml2rfc2, but they are definitely in scope before the tool can be rolled 
out for mass use. In particular, the RFC Editor staff has various 
concerns about how best to handle RFC documents that were "legit" before 
and suddenly no longer are.

> 6) Documents that didn't properly escape&  and<  characters in XML (28)

Again the question then shifts for how to best handle such errors -- 
what kinds of error messages should be presented.

> 7) Documents that had other XML issues, mostly the wrong format of declaring entities  (11)
>
>
> OPEN QUESTIONS
>
> 8) An include instruction requested a path directories, instead of just the filename (12)
> Instead of asking for 'reference.RFC.2119.xml' it asked for 'bibxml2/reference.RFC.2119.xml'.  Most of the documents don't do this.  There are ways we could handle it -- if the full path fails, it could try just the basename in the toplevel directory. Conversely, if the instruction asks for just the basename like usual, we can either only look in the top level directory (current behavior) or do a recursive search.  Thoughts on the best way to handle this?

I'm wondering what the current xml2rfc does in these cases.

> 9) Incorrect DTD filename given (2)
> Could be a typo or intended to complete later, some files used 'rfcXXXX.dtd' for the DTD.  If we need to, we can treat this in the same way as if no DTD were given, but it might be more appropriate to display an error.

Can you be more specific about which documents displayed the above 
errors so we can see exactly what you're referring to?

> Let us know if you have any questions on this or notes about the open questions.
>
> Also, I'll send an email shortly about the setuptools and the cache locations. We are almost done redoing the HTML to pull most of it out of the Python code and into a separate set of template files.
>
> Thanks --
>
> Mike

I took a quick peek at a couple of the other errors from my test run.

ERROR: Unable to parse the XML document: draft-aayadi-6lowpan-tcphc-01.xml
   Comment not terminated , line 1192, column 8


xml2rfc2 stopped looking for a comment "-->" when it ran into "--" in 
the text. My understanding is that this is legit XML and should be handled.

ERROR: Unable to parse the XML document: draft-livingood-dns-malwareprotect-02.xml
   EntityRef: expecting ';', line 171, column 84


&quot was missing the trailing ';'. This is a case where the error 
message could be improved to indicate *why* it was expecting a ';'.

ERROR: Unable to parse the XML document: draft-livingood-woundy-p4p-experiences-10.xml
   internal error, line 6, column 70


I hadn't spotted these before. "Internal error" is just as bad as a 
exceptions.

Tons of errors like

ERROR: Unable to validate the XML document: draft-maino-lisp-sec-00.xml
   Line 407: IDREF attribute target references an unknown ID "RFC5226"


that need to be understood.

     Tony