[xml2rfc] Missing rfc2629.ent file

fenner at research.att.com (Bill Fenner) Sun, 10 April 2005 21:06 UTC

From: "fenner at research.att.com"
Date: Sun, 10 Apr 2005 21:06:44 +0000
Subject: [xml2rfc] Missing rfc2629.ent file
References: <2005331113210.552176@BBPRIME> <424C6E6A.1000401@gmx.de> <ed6d469d0504011305b753ca@mail.gmail.com> <ed6d469d05040414068a29095@mail.gmail.com> <ed6d469d05040614267a5a2e18@mail.gmail.com> <4259DCE1.2C9D@xyzzy.claranet.de>
Message-ID: <200504110406.j3B46T6I015359@bright.research.att.com>
X-Date: Sun Apr 10 21:06:44 2005

>We need a documented list of all symbolic character entities
>working in the plain text output

A start from the list in the source code is at
http://electricrain.com/fenner/tmp/entities.html .

>and that list should be used
>in the form of a PUBLIC rfc2629.ent file in the rfc2629.dtd.

I'd find that useful for my xxe plugin; otherwise it won't
recognize named entities.

>Okay, now I've tested to define OElig directly, no output in
>the plain text version.

It works if you don't define the entity yourself.  xml2rfc internally
defines the entities listed on the page above.  Your definition may
have overridden the internal one.

  Bill
>From fenner at research.att.com  Sun Apr 10 22:12:12 2005
From: fenner at research.att.com (Bill Fenner)
Date: Sun Apr 10 21:12:21 2005
Subject: ToC formatting (was Re: schedule for v1.30) [xml2rfc]
References: <2005410183515.764638@bbfujip7>
	<p06210203be7f8b12ee9c@[10.20.30.249]>
Message-ID: <200504110412.j3B4CC3P015492@bright.research.att.com>


At 6:35 PM -0700 4/10/05, Dave Crocker wrote:
>I would greatly prefer:
>
>    1.  Title-1
>        1.1  Sub-title-2
>        1.3  Sub-title-3
>
>    2.  Title-2
>        2.1  Sub-title-4
>        2.1  sub-title-5

This is what I originally proposed for tocindent="yes" (and
tocompact="no" if you want the vertical spacing).  It seems
to work well for up to 3 levels of subsections but at the
4th one you're already using about half of the line for
indentation.

>Or at least:
>
>   1.   TITLE-1
>   1.1  Sub-title-2
>   1.2  Sub-title-3
>
>   2.   TITLE-3
>   2.1  Sub-title-4
>   2.2  Sub-title-5

This is the current output for tocindent="no" and tocompact="no".

At 19:03:48 -0700 on Sun, 10 Apr 2005, Paul Hoffman wrote:
>I prefer the latter for the reason Bill gave: long titles will get 
>wrapped too soon.

When I looked at the toc in RFC 3261, I immediately thought it
needed a different format.  I do like the indentation option
(but again, it's an existing option).

  Bill
>From fenner at research.att.com  Mon Apr 11 00:21:18 2005
From: fenner at research.att.com (Bill Fenner)
Date: Sun Apr 10 23:21:33 2005
Subject: [xml2rfc]  Missing rfc2629.ent file (was: Specifying non-RFC-type
	references?)
References: <2005331113210.552176@BBPRIME> <424C6E6A.1000401@gmx.de>
	<ed6d469d0504011305b753ca@mail.gmail.com>
	<ed6d469d05040414068a29095@mail.gmail.com>
	<ed6d469d05040614267a5a2e18@mail.gmail.com> <4259DCE1.2C9D@xyzzy.claranet.de>
Message-ID: <200504110621.j3B6LI7i018173@bright.research.att.com>


I've created an rfc2629.ent, which I think contains all of the entities
that xml2rfc 1.29 recognizes, at
http://rtg.ietf.org/~fenner/ietf/xml2rfc-valid/rfc2629.ent .
The CGI now rewrites the DTD reference to be local, and runs
xmllint --loaddtd, so entities should work in the validator now.

  Bill
>From julian.reschke at gmx.de  Mon Apr 11 10:38:06 2005
From: julian.reschke at gmx.de (Julian Reschke)
Date: Mon Apr 11 00:38:32 2005
Subject: [xml2rfc]  Missing rfc2629.ent file
In-Reply-To: <200504110621.j3B6LI7i018173@bright.research.att.com>
References: <2005331113210.552176@BBPRIME> <424C6E6A.1000401@gmx.de>
	<ed6d469d0504011305b753ca@mail.gmail.com>
	<ed6d469d05040414068a29095@mail.gmail.com>
	<ed6d469d05040614267a5a2e18@mail.gmail.com> <4259DCE1.2C9D@xyzzy.claranet.de>
	<200504110621.j3B6LI7i018173@bright.research.att.com>
Message-ID: <425A295E.8040604@gmx.de>

Bill Fenner wrote:
> I've created an rfc2629.ent, which I think contains all of the entities
> that xml2rfc 1.29 recognizes, at
> http://rtg.ietf.org/~fenner/ietf/xml2rfc-valid/rfc2629.ent .
> The CGI now rewrites the DTD reference to be local, and runs
> xmllint --loaddtd, so entities should work in the validator now.

I know I sound like a broken record, but...: any work that makes xml2rfc 
   (or related tools) accept non-wellformed XML documents is IMHO not 
only a waste of time, but also a bad thing.

If a input file needs entities which aren't defined in XML *or* the DTD 
*or* the internal subset then it is *broken* and should be rejected. 
Adding more and more workarounds to accept non-XML stuff is actively 
harmful because it increases the amount of documents that claim to be 
XML but aren't.

Just my 2 cents,

Julian
>From clive at demon.net  Mon Apr 11 13:32:16 2005
From: clive at demon.net (Clive D.W. Feather)
Date: Mon Apr 11 04:32:21 2005
Subject: Issues with v1.29  [xml2rfc]
In-Reply-To: <20050409001139.GA23593@localhost.localdomain>
References: <20050408101550.GF94541@finch-staff-1.thus.net>
	<20050409001139.GA23593@localhost.localdomain>
Message-ID: <20050411113216.GF46784@finch-staff-1.thus.net>

Charles Levert said:
> First, I want to say that I appreciate you
> taking the time to write this feedback.  Thanks.
> Please interpret my response as a mere attempt
> to figure out what is best for xml2rfc.

Sure.

>> My document has grown from 117 pages to 129 using
>> exactly the same source. The single biggest cause seems to be lists, which
>> now have blank lines between the list items.
> Are you sure that you are comparing xml2rfc
> versions with all other things being equal?

Yes. I invoke xml2rfc via a script and I keep all my old copies around.
I ran the script, moved the output files to a holding area, edited the
script to change "28" to "29", and ran the script again.

> I ask because that (the list issue) sounds more
> like the result of changing an rfc-PI directive
> in the input xml.

I've just re-run the test and it definitely behaves differently for the same
source.

These are the first 12 lines of the source:

========
<?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc strict='yes'?>
<?rfc compact='no'?>
<?rfc editing='no'?> <!-- editing marks -->
<?rfc symrefs='yes'?>
<?rfc sortrefs='yes'?>
<?rfc emoticonic='yes'?>
<?rfc toc='yes'?>
<?rfc tocdepth='9'?>
<rfc ipr="full3667" docName="draft-ietf-nntpext-base-26">
<front>
=========

The first instance is in section 3.1, though there are many others:

========
<section title="Basic Concepts">
  <section title="Commands and Responses" anchor='basics'>
========
There are then 7 <t>...</t> clauses containing only text and <xref>s. Then:
========
<t>
All multi-line responses MUST adhere to the following format:
<list style="numbers">
<t>
The response consists of a sequence of one or more "lines",
each being a stream of octets ending with a CRLF pair.
Apart from those line endings, the stream MUST NOT
include the octets NUL, LF, or CR.
</t>
<t>
The first such line contains the response code as with a
single line response.
</t>
========

A blank line appears before each numbered item in 1.29, whereas 1.28 put no
blank space at all in the list, including between "format:" and "1. The".

As another example, further down the same section is the following:

========
<t>
The first digit of the response broadly indicates the
success, failure, or progress of the previous command:
<list style="empty">
<t>
          1xx - Informative message.
<vspace />2xx - Command completed OK.
<vspace />3xx - Command OK so far; send the rest of it.
<vspace />4xx - Command was syntactically correct but failed for some reason.
<vspace />5xx - Command unknown, unsupported, unavailable, or syntax error.
</t>
</list>
========

where a blank line has been added between "command" and "1xx".

> One thing that I have tweaked is the decision to
> start a new page on a <figure> or a <texttable>,
> so that would lengthen the document.

No, I did a diff and that's not the issue.

>> While these are normally better than what I had, there are times when they
>> just look bad. This is a list-by-list issue, and I think there may be a
>> need for a parameter to <list>, not a PI, to indicate whether items should
>> be spaced out or not.
> 
> I'll let mailing list participants discuss this
> and settle to a clear recommendation before
> doing anything in this direction, especially
> since a DTD change has consequences beyond the
> one xml2rfc tool.

What I've found is that a <list> is the obvious way to organise a list of
items. If these items are small (e.g. a list of names) then you don't want
gratuitous space between them. An alternative approach is to use <vspace />
after each item instead of making it a separate <t>, but that feels as if
presentation is being allowed to override the organisation of the material.

Another way to look at this is that <t> is being overloaded with two
meanings:
- paragraph
- list item
Perhaps we need a separate <item> which indicates a separate item (so can
have hangText and so on) but doesn't carry the semantics of a paragraph
(and so spacing). In this view, <t xxx=yyy> in a list would be shorthand
for <item xxx=yyy><t>.

Does anyone else have comments?

>> * Handling of hyphenated words has changed. Unfortunately, not always for
>> the better.
> Yes.  That's an unavoidable consequence of using
> logical rules for things that intrinsically
> cannot be fully expressed by them.

Understood.

>> Splitting "Internet-Draft" at the hyphen probably looks okay,
>> but splitting US-ASCII most definitely doesn't.
> Hard to disagree with.  But the program needs
> a rule for this, as it doesn't and can't have
> human intelligence.
> 
> So I used similar logic to that of TeX, albeit not
> parameterizable (the values TeX uses as defaults
> are hardcoded in xml2rfc).  From the source code:
> 
>                 # Bad:       a-zzzzzzz
>                 # Ok:       aa-zzzzzzz
>                 # Bad:  aaaaaa-zz
>                 # Ok:   aaaaaa-zzz
> 
> So I have to ask:  do you, as a
> carbon-not-silicon being, think splitting
> US-ASCII is mainly bad because US is only two
> characters, or because it's an identifier (a
> "charset" value in this case) and those generally
> should never be split?

Mostly the latter. However, I'd also usually apply the former as well (I
tend to be conservative about hyphenation; I'd want at least three, or
perhaps even four, characters in each part).

> If the former, parameters can be changed, but
> if the latter, human intelligence cannot fully
> be duplicated and knowledge in all fields cannot
> practically be encoded in a little script.

Understood, which is why I suggest:

>> If this additional splitting is going to remain, we badly need a "non-break
>> hyphen" facility. There's a Unicode character U+2011 with this name and
>> semantic; I don't know if it's official, but &nbhy; seems to be used in
>> at least one place for this character, and it's as good a name as any.
> 
> On principle, I like the idea of providing the
> author with an explicit override, just in case,
> to complement a program unavoidable shortcomings
> in logic.
> 
> As I mentioned in the last paragraph of
> 
>   <http://drakken.dbc.mtview.ca.us/pipermail/xml2rfc/2005-March/001784.html>,
> 
> there are technical difficulties due to the whole
> internal structure of xml2rfc with regard to
> this, although using U+2011 (just like U+00A0)
> could be easier to implement than a full-blown
> <nobreak>.

Surely the same approach as is used for &nbsp; could be used?

> It would only be a solution for
> hyphens, though, and not for the other characters
> that are also tried in line breaking.

What other characters are there that are tried and are also often used
within "tokens"?

> I too couldn't find a standardized entity name
> for U+2011, so I would rather advocate the plain
> use of &#x2011; or &#8209; for it.

Who standardises entity names? Perhaps we could just ask? [I really would
prefer not to have codes like this scattered through my sources.]

>> Incidentally, the Nroff output has, at one point:
>> 
>>                                                 ... \%Internet-
>>          Drafts.
>> 
>> which has to be a perverse use of \%, surely?
> 
> Isn't that a consequence of your own
> 
>     if {!($pre)} {
>         regsub -all "\[^ \t\n\]*-" $line "\\%\&" line
>     }
> 
> code in "proc write_line_nr"?

Yes, but it's perverse. The reason I added that code was to stop hyphenated
words being split across lines. If you're going to split them, why then
tell nroff not to?!

>> * The single- v double-space situation seems to have changed. On the one
>> hand, some "i.e. something" have been reduced to single space, which is
>> good.
> 
> Note that, as a matter of style, "i.e. "
> should never be used, but that "i.e., " should
> be used instead.

That's a matter of opinion.

>> On the other hand:
>> 
>>     Likewise the line ("." CRLF or %x2E.0D.0A) MUST NOT be
>> 
>> has gained a second space before the CRLF. I suppose I could use &nbsp;
>> here, but it might be symptomatic of another issue.
> 
> It's hard to have a general rule for this that
> would do the right thing in all cases.
> 
> To explain the logic whose use is triggered
> here, a quoted sentence should be followed by
> two spaces, just like any unquoted sentence.
> xml2rfc just doesn't know that this is to be
> treated as verbatim text/code.

Okay. I'll use &nbsp; to fix it; a break there would be infelicitous
anyway.

>> * Finally, here's an oddity for you:
>> 
>> <vspace />200 number   Success
>> <vspace />400          Failure
>> 
>> generates text that is aligned, with "Failure" under "Success". But
>> 
>> <?rfc linefile='21'?><vspace />200 number<?rfc linefile='24'?>   Success
>> <?rfc linefile='22'?><vspace />400       <?rfc linefile='25'?>   Failure
>> 
>> doesn't - the space between ">" and "Failure" is omitted.
> 
> That there is a difference looks like a bug
> (unintended of course).  I will investigate.
> 
> Can you provide me with the rest of the context
> around this in the xml input file?  I'm in fact
> wondering what makes the multiple spaces to be
> kept here in the first place since this can't
> be in an <artwork>, right?  We can discuss it
> some more before anything is changed, once I
> know about the context and understand what's
> happening better.

Here's the exact source up to an instance of the problem, with irrelevant
material omitted as marked; the lines in question are those with "211" and
"411" in them. I have many such instances.

I've checked again, and deleting all the linefile PIs fixes the problem.

========
<?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc linefile='4:unknown'?><?rfc strict='yes'?>
<?rfc compact='no'?>
<?rfc editing='no'?> <!-- editing marks -->
<?rfc symrefs='yes'?>
<?rfc sortrefs='yes'?>
<?rfc emoticonic='yes'?>
<?rfc toc='yes'?>
<?rfc tocdepth='9'?>
<?rfc linefile='13'?><rfc ipr="full3978" docName="draft-ietf-nntpext-base-26">
<front>
[...content omitted...]
</front>
<?rfc linefile='97'?><!-- The actual RFC -->
<?rfc linefile='99'?><middle>
[...5 whole sections omitted...]
<?rfc linefile='1916'?><section title="Article posting and retrieval" anchor="article.handling">
<?rfc linefile='1918'?><t>
[...plain text omitted...]
</t>
<t>
[...plain text omitted...]
</t>
<t>
[...plain text omitted...]
</t>
<?rfc linefile='1952'?>  <section title="Group and article selection">
<?rfc linefile='1954'?><t>
[...plain text omitted...]
</t>
<?rfc linefile='1962'?><?rfc linefile='1962'?><!-- N.command name="GROUP" indic="READER" -->
<?rfc linefile='1962'?><section anchor="group" title="GROUP" >
<?rfc linefile='1962'?><section toc="exclude" anchor="group.usage" title="Usage" >
<?rfc linefile='1962'?><t><list style="hanging">
<?rfc linefile='1962'?><t hangText="Indicating capability: READER" />
<?rfc linefile='1962'?>  <t hangText="Syntax">
<?rfc linefile='1964'?><?rfc linefile='1964'?><!-- N.variant args="group" -->
<?rfc linefile='1964'?><vspace />GROUP group
<?rfc linefile='1965'?><?rfc linefile='1965'?><!-- N.response code="211" args="number low high group" text="Group successfully selected" / -->
<?rfc linefile='1965'?><?rfc linefile='1965'?><?rfc linefile='1967'?><?rfc linefile='1967'?><!-- N.response code="411" text="No such newsgroup" / -->
<?rfc linefile='1967'?><?rfc linefile='1967'?><?rfc linefile='1968'?><?rfc linefile='1970'?><?rfc linefile='1970'?><!-- N.parameter name="group" text="name of newsgroup" / -->
<?rfc linefile='1970'?><?rfc linefile='1970'?><?rfc linefile='1971'?><?rfc linefile='1971'?><!-- N.parameter name="number" text="estimated number of articles in the group" / -->
<?rfc linefile='1971'?><?rfc linefile='1971'?><?rfc linefile='1972'?><?rfc linefile='1972'?><!-- N.parameter name="low" text="reported low water mark" / -->
<?rfc linefile='1972'?><?rfc linefile='1972'?><?rfc linefile='1973'?><?rfc linefile='1973'?><!-- N.parameter name="high" text="reported high water mark" / -->
<?rfc linefile='1973'?><?rfc linefile='1973'?><?rfc linefile='1975'?><?rfc linefile='1975'?><!-- N.description -->
<?rfc linefile='1975'?><?rfc linefile='1975'?></t>
<?rfc linefile='1975'?><t hangText="Responses">
<?rfc linefile='1975'?><?rfc linefile='1975'?><?rfc linefile='1975'?><?rfc linefile='1975'?><?rfc linefile='1975'?><vspace />211 number low high group<?rfc linefile='1975'?>   Group successfully selected
<?rfc linefile='1975'?><?rfc linefile='1975'?><vspace />411                      <?rfc linefile='1975'?>   No such newsgroup
<?rfc linefile='1975'?><?rfc linefile='1975'?><?rfc linefile='1975'?><?rfc linefile='1975'?></t>
<?rfc linefile='1975'?><t hangText="Parameters">
<?rfc linefile='1975'?><vspace />group  = name of newsgroup
<?rfc linefile='1975'?><vspace />number = estimated number of articles in the group
<?rfc linefile='1975'?><vspace />low    = reported low water mark
<?rfc linefile='1975'?><vspace />high   = reported high water mark
<?rfc linefile='1975'?></t>
========

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |
>From clive at demon.net  Mon Apr 11 13:38:37 2005
From: clive at demon.net (Clive D.W. Feather)
Date: Mon Apr 11 04:38:41 2005
Subject: schedule for v1.30  [xml2rfc]
In-Reply-To: <20050408072152.GA11933@localhost.localdomain>
References: <cc93fb38a2540e6df855a32b0296a584@dbc.mtview.ca.us>
	<f7456d4519124559586d557ddfee8613@dbc.mtview.ca.us>
	<4256080D.B38@xyzzy.claranet.de>
	<20050408072152.GA11933@localhost.localdomain>
Message-ID: <20050411113837.GG46784@finch-staff-1.thus.net>

Charles Levert said:
>> No problem:  unpaginated output.  It should be unnecessary, but
>> my attempt produced bogus off by one empty line "differences".
> Ok.  Let's leave nothing unspecified about its
> design before deciding to go ahead with it.
> 
> -- What should be output everywhere a page number
>    normally would be?
>    The one in the right footer is no longer
>    there.
>    But what about the ones in the ToC?
>    Any others I forget?

A suggestion for people to consider: since the commonest use of
unpaginated output is to allow comparisons without page breaks getting in
the way, instead of producing completely unpaginated output, produce output
that corresponds to infinitely long pages. That is, retain page breaks
where they're imposed by the format - at each major section - but omit them
where they'd only be required because the 56 (or whatever it is) lines had
been reached. In other words, just replace the line length of 56 by a
very-big-number and suppress the blank lines at the end of a page; apart
from that, generate the output completely normally.

I suggest that this will be *better* for comparisons, because it will show
a significant change in page breaks.

> -- From a calling convention point-of-view,
>    should this be implemented as a new rendering
>    engine (a "mode")?

Yes.

>    -- If so, what would its name be?
>       (To be used in "xml2foo" and "file.foo".)

upt ("unpaginated text") or np.txt ("no pages, text").

>       -- Should it then be specified as a new
>          rfc-PI directive?

Definitely not. Doing this means you have to change the source to generate
the final output, something that is the bane of regression testing.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |
>From fenner at research.att.com  Mon Apr 11 07:55:21 2005
From: fenner at research.att.com (Bill Fenner)
Date: Mon Apr 11 06:55:29 2005
Subject: [xml2rfc]  Missing rfc2629.ent file
References: <2005331113210.552176@BBPRIME> <424C6E6A.1000401@gmx.de>
	<ed6d469d0504011305b753ca@mail.gmail.com>
	<ed6d469d05040414068a29095@mail.gmail.com>
	<ed6d469d05040614267a5a2e18@mail.gmail.com> <4259DCE1.2C9D@xyzzy.claranet.de>
	<200504110621.j3B6LI7i018173@bright.research.att.com>
	<425A295E.8040604@gmx.de>
Message-ID: <200504111355.j3BDtLKn028395@bright.research.att.com>


>If a input file needs entities which aren't defined in [...] the DTD 

I thought Frank's proposal was to add the entities which xml2rfc accepts
to the rfc2629 DTD.  (Not just my private copy.)

  Bill
>From nobody at xyzzy.claranet.de  Mon Apr 11 19:01:30 2005
From: nobody at xyzzy.claranet.de (Frank Ellermann)
Date: Mon Apr 11 09:09:53 2005
Subject: [xml2rfc] Re: Missing rfc2629.ent file
References: <2005331113210.552176@BBPRIME> <424C6E6A.1000401@gmx.de>
	<ed6d469d0504011305b753ca@mail.gmail.com>
	<ed6d469d05040414068a29095@mail.gmail.com>
	<4259DCE1.2C9D@xyzzy.claranet.de>
	<200504110621.j3B6LI7i018173@bright.research.att.com>
	<200504111355.j3BDtLKn028395@bright.research.att.com>
Message-ID: <425A9F59.318A@xyzzy.claranet.de>

Bill Fenner wrote:

> I thought Frank's proposal was to add the entities which
> xml2rfc accepts to the rfc2629 DTD.  (Not just my private
> copy.)

Yes, of course.  There are reasons why xml2rfc 1.29 accepts
&hellip; etc., therefore it's necessary to document these
symbolic character references in the DTD (or in an additional
ENT included by the DTD).

Otherwise validator.w3.org or your xml2rfc_valid won't know
what &hellip; etc. are.  I'm not sure about %amp; &lt; &gt;
&apos; &quot - if XHTML 1.0 defines them explicitly they are
probably undefined otherwise, undefined is invalid, and
invalid is bad.

Because xml2rfc is about plain text ASCII and not the ugly
proportional HTML output an official rfc2629.ent (included
by the DTD) could list the ASCII strings for these entities.

As few as possible, <http://www.w3.org/TR/charmod/#C047> :

&hellip; is only useful because a proportional "..." in the
HTML output is ugly.  A hardwired but undocumented support
for u+2026 would be a bad idea.  If the input encoding is
not ASCII the author has already dropped the ball, xml2rfc
would be forced to try to be smart for non-ASCII input, and
tools trying to be smart are always a disaster.

Enough KISS-evangelism for today, unless Julian disagrees ;-)

                       Bye, Frank