Re: [yaco-idsubmit-tool] Title extraction failure

Henrik Levkowetz <henrik@levkowetz.com> Mon, 14 March 2011 11:53 UTC

Return-Path: <henrik@levkowetz.com>
X-Original-To: yaco-idsubmit-tool@core3.amsl.com
Delivered-To: yaco-idsubmit-tool@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id D70C93A69FF for <yaco-idsubmit-tool@core3.amsl.com>; Mon, 14 Mar 2011 04:53:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.644
X-Spam-Level:
X-Spam-Status: No, score=-102.644 tagged_above=-999 required=5 tests=[AWL=-0.044, BAYES_00=-2.599, NO_RELAYS=-0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vkCWJaalYG-l for <yaco-idsubmit-tool@core3.amsl.com>; Mon, 14 Mar 2011 04:53:21 -0700 (PDT)
Received: from merlot.tools.ietf.org (merlot.tools.ietf.org [IPv6:2a01:3f0:0:31:214:22ff:fe21:bb]) by core3.amsl.com (Postfix) with ESMTP id B76153A6928 for <yaco-idsubmit-tool@ietf.org>; Mon, 14 Mar 2011 04:53:20 -0700 (PDT)
Received: from brunello.autonomica.se ([2a01:3f0:1:0:21e:c2ff:fe13:7e3e]:58249 helo=dyn-fg124.sth.netnod.se) by merlot.tools.ietf.org with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.74) (envelope-from <henrik@levkowetz.com>) id 1Pz6M7-00035G-MI; Mon, 14 Mar 2011 12:54:37 +0100
Message-ID: <4D7E01FB.2010105@levkowetz.com>
Date: Mon, 14 Mar 2011 12:54:35 +0100
From: Henrik Levkowetz <henrik@levkowetz.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: Tony Hansen <tony@att.com>
References: <4D7A574F.8010406@levkowetz.com> <4D7BAE92.2070908@yaco.es> <4D7CF91D.6060704@levkowetz.com> <4D7D154F.3060204@att.com>
In-Reply-To: <4D7D154F.3060204@att.com>
X-Enigmail-Version: 1.1.1
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
X-SA-Exim-Connect-IP: 2a01:3f0:1:0:21e:c2ff:fe13:7e3e
X-SA-Exim-Rcpt-To: tony@att.com, yaco-idsubmit-tool@ietf.org, henrik-sent@levkowetz.com
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 22 Mar 2010 06:51:10 +0000)
X-SA-Exim-Scanned: Yes (on merlot.tools.ietf.org)
Cc: yaco-idsubmit-tool@ietf.org
Subject: Re: [yaco-idsubmit-tool] Title extraction failure
X-BeenThere: yaco-idsubmit-tool@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discussion of the Yaco / I-D Submission Tool Project <yaco-idsubmit-tool.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/yaco-idsubmit-tool>, <mailto:yaco-idsubmit-tool-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/yaco-idsubmit-tool>
List-Post: <mailto:yaco-idsubmit-tool@ietf.org>
List-Help: <mailto:yaco-idsubmit-tool-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/yaco-idsubmit-tool>, <mailto:yaco-idsubmit-tool-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Mar 2011 11:53:22 -0000

(I changed the subject line to match the new issue)

On 2011-03-13 20:04 Tony Hansen said:
> Here's a difficult one for the tool:
> 
> draft-ietf-yam-pre-evaluation-template-04.txt
> draft-ietf-yam-pre-evaluation-template-04.xml
> 
> (I adjusted the dates to get past the date checks.)
> 
> The title is being picked up as
> 
>      Group
> 
> instead of as
> 
>   Preliminary Evaluation of RFC XXX "[PLACEHOLDER: INSERT TITLE HERE]",
> for advancement from Draft Standard to Full Standard by the YAM Working
>                                   Group
> 
> Obviously, the long lines that are part of the title are messing things 
> up and the submission tool is only picking up the last line.

Yes.

Emilio, I'd suggest changing the regex to require one or more blank
lines before and after the title, and making the first group explicitly
include all lines of the {1, 3} repeat (this is what messes up this
case).  Also, we should strip out newlines and extra space from the title.

Suggested patch:

12:44 ~/src/db/yaco/idsubmit/ietf/utils
henrik@merlot $ svn diff draft.py 
Index: draft.py
===================================================================
--- draft.py	(revision 2886)
+++ draft.py	(working copy)
@@ -575,11 +575,12 @@
     def get_title(self):
         if self._title:
             return self._title
-        title_re = re.compile('(.+\n){1,3}(\s+<?draft-\S+\s*\n)')
+        title_re = re.compile('(?:\n\s*\n\s*)((.+\n){1,3})(\s+<?draft-\S+\s*\n)\s*\n')
         match = title_re.search(self.pages[0])
         if match:
             title = match.group(1)
             title = title.strip()
+            title = re.sub('\s*\n\s*', ' ', title)
             self._title = title
             return self._title
         # unusual title extract

12:45 ~/src/db/yaco/idsubmit/ietf/utils
henrik@merlot $ 

> Given that the XML is also being uploaded, why aren't the title, 
> abstract, and author info being extracted from there? I thought that was 
> going to be one of the enhancements being put into the new I-D 
> submission tool? That would avoid all issues with the issue-prone patterns.

Yes, but it's not part of this work order -- what we're trying to do with
this round is to replace the old tool as soon as possible, both to relieve
the pain of the old one, and to make it possible to migrate to the new
database schema.  There will more work done on the submission tool, to
cover _all_ of the requirements of RFC 4228, but that will come after we
have this round deployed.  Metadata extraction from XML files, if available,
is among the remaining 4228 requirements to be handled in the next round.

At least this has been the plan so far ...


Best,

	Henrik