Re: [yaco-idsubmit-tool] Title extraction failure
Henrik Levkowetz <henrik@levkowetz.com> Mon, 14 March 2011 11:53 UTC
Return-Path: <henrik@levkowetz.com>
X-Original-To: yaco-idsubmit-tool@core3.amsl.com
Delivered-To: yaco-idsubmit-tool@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id D70C93A69FF for <yaco-idsubmit-tool@core3.amsl.com>; Mon, 14 Mar 2011 04:53:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.644
X-Spam-Level:
X-Spam-Status: No, score=-102.644 tagged_above=-999 required=5 tests=[AWL=-0.044, BAYES_00=-2.599, NO_RELAYS=-0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vkCWJaalYG-l for <yaco-idsubmit-tool@core3.amsl.com>; Mon, 14 Mar 2011 04:53:21 -0700 (PDT)
Received: from merlot.tools.ietf.org (merlot.tools.ietf.org [IPv6:2a01:3f0:0:31:214:22ff:fe21:bb]) by core3.amsl.com (Postfix) with ESMTP id B76153A6928 for <yaco-idsubmit-tool@ietf.org>; Mon, 14 Mar 2011 04:53:20 -0700 (PDT)
Received: from brunello.autonomica.se ([2a01:3f0:1:0:21e:c2ff:fe13:7e3e]:58249 helo=dyn-fg124.sth.netnod.se) by merlot.tools.ietf.org with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.74) (envelope-from <henrik@levkowetz.com>) id 1Pz6M7-00035G-MI; Mon, 14 Mar 2011 12:54:37 +0100
Message-ID: <4D7E01FB.2010105@levkowetz.com>
Date: Mon, 14 Mar 2011 12:54:35 +0100
From: Henrik Levkowetz <henrik@levkowetz.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: Tony Hansen <tony@att.com>
References: <4D7A574F.8010406@levkowetz.com> <4D7BAE92.2070908@yaco.es> <4D7CF91D.6060704@levkowetz.com> <4D7D154F.3060204@att.com>
In-Reply-To: <4D7D154F.3060204@att.com>
X-Enigmail-Version: 1.1.1
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
X-SA-Exim-Connect-IP: 2a01:3f0:1:0:21e:c2ff:fe13:7e3e
X-SA-Exim-Rcpt-To: tony@att.com, yaco-idsubmit-tool@ietf.org, henrik-sent@levkowetz.com
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 22 Mar 2010 06:51:10 +0000)
X-SA-Exim-Scanned: Yes (on merlot.tools.ietf.org)
Cc: yaco-idsubmit-tool@ietf.org
Subject: Re: [yaco-idsubmit-tool] Title extraction failure
X-BeenThere: yaco-idsubmit-tool@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discussion of the Yaco / I-D Submission Tool Project <yaco-idsubmit-tool.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/yaco-idsubmit-tool>, <mailto:yaco-idsubmit-tool-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/yaco-idsubmit-tool>
List-Post: <mailto:yaco-idsubmit-tool@ietf.org>
List-Help: <mailto:yaco-idsubmit-tool-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/yaco-idsubmit-tool>, <mailto:yaco-idsubmit-tool-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Mar 2011 11:53:22 -0000
(I changed the subject line to match the new issue) On 2011-03-13 20:04 Tony Hansen said: > Here's a difficult one for the tool: > > draft-ietf-yam-pre-evaluation-template-04.txt > draft-ietf-yam-pre-evaluation-template-04.xml > > (I adjusted the dates to get past the date checks.) > > The title is being picked up as > > Group > > instead of as > > Preliminary Evaluation of RFC XXX "[PLACEHOLDER: INSERT TITLE HERE]", > for advancement from Draft Standard to Full Standard by the YAM Working > Group > > Obviously, the long lines that are part of the title are messing things > up and the submission tool is only picking up the last line. Yes. Emilio, I'd suggest changing the regex to require one or more blank lines before and after the title, and making the first group explicitly include all lines of the {1, 3} repeat (this is what messes up this case). Also, we should strip out newlines and extra space from the title. Suggested patch: 12:44 ~/src/db/yaco/idsubmit/ietf/utils henrik@merlot $ svn diff draft.py Index: draft.py =================================================================== --- draft.py (revision 2886) +++ draft.py (working copy) @@ -575,11 +575,12 @@ def get_title(self): if self._title: return self._title - title_re = re.compile('(.+\n){1,3}(\s+<?draft-\S+\s*\n)') + title_re = re.compile('(?:\n\s*\n\s*)((.+\n){1,3})(\s+<?draft-\S+\s*\n)\s*\n') match = title_re.search(self.pages[0]) if match: title = match.group(1) title = title.strip() + title = re.sub('\s*\n\s*', ' ', title) self._title = title return self._title # unusual title extract 12:45 ~/src/db/yaco/idsubmit/ietf/utils henrik@merlot $ > Given that the XML is also being uploaded, why aren't the title, > abstract, and author info being extracted from there? I thought that was > going to be one of the enhancements being put into the new I-D > submission tool? That would avoid all issues with the issue-prone patterns. Yes, but it's not part of this work order -- what we're trying to do with this round is to replace the old tool as soon as possible, both to relieve the pain of the old one, and to make it possible to migrate to the new database schema. There will more work done on the submission tool, to cover _all_ of the requirements of RFC 4228, but that will come after we have this round deployed. Metadata extraction from XML files, if available, is among the remaining 4228 requirements to be handled in the next round. At least this has been the plan so far ... Best, Henrik
- [yaco-idsubmit-tool] Testing notes / Henrik / Mar… Henrik Levkowetz
- Re: [yaco-idsubmit-tool] Testing notes / Henrik /… Emilio A. Sánchez López
- Re: [yaco-idsubmit-tool] Testing notes / Henrik /… Henrik Levkowetz
- Re: [yaco-idsubmit-tool] Testing notes / Henrik /… Tony Hansen
- Re: [yaco-idsubmit-tool] Title extraction failure Henrik Levkowetz
- Re: [yaco-idsubmit-tool] Title extraction failure Emilio A. Sánchez López
- Re: [yaco-idsubmit-tool] Title extraction failure Henrik Levkowetz
- Re: [yaco-idsubmit-tool] Title extraction failure Tony Hansen
- Re: [yaco-idsubmit-tool] Invalid version number? … Tony Hansen
- Re: [yaco-idsubmit-tool] Invalid version number? … Henrik Levkowetz
- Re: [yaco-idsubmit-tool] Invalid version number? … Emilio A. Sánchez López