Re: [Xml-sg-cmt] WeasyPrint Update

Sandy Ginoza <sginoza@amsl.com> Tue, 05 July 2022 20:40 UTC

Return-Path: <sginoza@amsl.com>
X-Original-To: xml-sg-cmt@ietfa.amsl.com
Delivered-To: xml-sg-cmt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4410FC15AD3B for <xml-sg-cmt@ietfa.amsl.com>; Tue, 5 Jul 2022 13:40:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.906
X-Spam-Level:
X-Spam-Status: No, score=-6.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ulw9MNDkKT5K for <xml-sg-cmt@ietfa.amsl.com>; Tue, 5 Jul 2022 13:40:41 -0700 (PDT)
Received: from c8a.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2DD59C157B3E for <xml-sg-cmt@ietf.org>; Tue, 5 Jul 2022 13:40:41 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id 151504243EC2; Tue, 5 Jul 2022 13:40:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PMJfKfyvrITC; Tue, 5 Jul 2022 13:40:41 -0700 (PDT)
Received: from smtpclient.apple (2603-8000-9603-b513-6995-09d2-dbe2-f0db.res6.spectrum.com [IPv6:2603:8000:9603:b513:6995:9d2:dbe2:f0db]) by c8a.amsl.com (Postfix) with ESMTPSA id E562A4243EC0; Tue, 5 Jul 2022 13:40:40 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
From: Sandy Ginoza <sginoza@amsl.com>
In-Reply-To: <64B87EC4-12DF-466F-960F-1A91A6C615B7@amsl.com>
Date: Tue, 05 Jul 2022 13:40:40 -0700
Cc: Robert Sparks <rjsparks@nostrum.com>, "xml-sg-cmt@ietf.org" <xml-sg-cmt@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <03229C36-83D3-422C-8DF1-2982F4CC36DD@amsl.com>
References: <299a8995-589b-8b9d-8526-21f919afb122@staff.ietf.org> <546a3330-f75e-6733-ab64-e8853ca3dd49@nostrum.com> <64B87EC4-12DF-466F-960F-1A91A6C615B7@amsl.com>
To: Alice Russo <arusso@amsl.com>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml-sg-cmt/4D9nqXiXyx3PcEI2Nc5ypzZC_jg>
Subject: Re: [Xml-sg-cmt] WeasyPrint Update
X-BeenThere: xml-sg-cmt@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Working list for the xml and style guide change management team <xml-sg-cmt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml-sg-cmt/>
List-Post: <mailto:xml-sg-cmt@ietf.org>
List-Help: <mailto:xml-sg-cmt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jul 2022 20:40:45 -0000

Hi all,

I looked at a handful of random PDFs.  Note that I did not compare these side by side with the published RFCs, but I did spot check if I wanted to see how something was handled with the earlier version of weasyprint.  I was under the impression we were going to accept that formatting changes will occur as weasyprint is updated, so I mostly just scanned the newer files for oddities.  

1) page break between toc header and the toc contents; seemingly bc content fits on one page or two pages perfectly (diff from pub version); looks odd.  
https://devbox.amsl.com/weasyprint55/rfc8650.pdf
https://devbox.amsl.com/weasyprint55/rfc8670.pdf
https://devbox.amsl.com/weasyprint55/rfc8681.pdf


2) made me wonder about table breaks; should a table be kept together when it can?  Afaik, no change from current handling.  Example: Table 4 is ugly, but I also imagine it could be just as ugly to try to force a table onto one page, so maybe no change.
https://devbox.amsl.com/weasyprint55/rfc8651.pdf 


3) table on p10 —> seemingly breaks header from content (not broken in pub version)
https://devbox.amsl.com/weasyprint55/rfc8670.pdf


4) blank p15 (pub version has blank p14)
https://devbox.amsl.com/weasyprint55/rfc8681.pdf 


5) OK; nothing stood out as better or worse
https://devbox.amsl.com/weasyprint55/rfc8706.pdf
https://devbox.amsl.com/weasyprint55/rfc8740.pdf
https://devbox.amsl.com/weasyprint55/rfc8790.pdf
https://devbox.amsl.com/weasyprint55/rfc9000.pdf


Thanks,
Sandy 



> On Jun 30, 2022, at 3:19 PM, Alice Russo <arusso@amsl.com> wrote:
> 
> Hi Robert,
> 
> Re:
>> I'm not sure we can do a real comparison without running these through the pdfaPilot step, which essentially rewrites the pdf.
>> 
>> Alice - is it easy to script running all the things at [3] through pdfaPilot? If not, could you run 8779 through (as that's the semi-random one I chose to look at first).
> 
> Posted here: https://www.rfc-editor.org/v3test/test8779.pdf
> 
> The command used was:
> pdfaPilot --collection --embedinto=A3u,rfc8779wp55.pdf --embedfile=No,Source,rfc8779.xml --outputfile=test8779.pdf
> (where rfc8779wp55.pdf is https://devbox.amsl.com/weasyprint55/rfc8779.pdf)
> 
> They look the same to me, as expected (more below on that topic). The PDF diff tool draftable shows only one change repeatedly (apparently a draftable bug; it's tripping on the characters 'fi'): https://draftable.com/compare/IAdOLoZDctSw
> 
> Perhaps more relevant is comparing rfc8779.pdf (as published) vs. rfc8779wp55.pdf:
> https://draftable.com/compare/PRHdkYkTIoHP
> (My take is nothing egregious there. Differences to page breaking expected; it's unfortunate that a one-line <t> preceding <artwork> no longer stays on the page with the artwork -- pages 11/12 and 13/14. FWIW, the source XML does not contain keepWithNext; the published rendering was good without 'forcing' it.)
> 
> Re:
>> On Jun 29, 2022, at 2:48 PM, Kesara Rathnayake <kesara@staff.ietf.org> wrote:
>> 
>>>> I'm not sure we can do a real comparison without running these through the pdfaPilot step, which essentially rewrites the pdf.
>>> It's not likely to change the appearance unless the PDF is depending on fonts other than the standard ones or ones included in the PDF which would be a bug.  PDF/A mostly fills in defaults and divides some internal data structures into smaller chunks.
> 
> I agree re: "not likely to change the appearance". At the time of the format change, we were told (and have found  during limited visual checks) that there are no changes to appearance after running pdfaPilot.  For each RFC, the PDF from before running pdfaPilot is archived internally; can post if needed.
> 
> 
> That said, to Kesara's point about looking at PDFs of more recent RFCs (produced by WeasyPrint 52.5), I'll do some comparing of a few recent ones vs. files in [3] and report back.
> 
> Thanks,
> Alice
> 
>> 
>> I'm already seeing the differences in figure/table layout that can affect where pagebreaks lie.
>> 
>> Most of the other differences I see are in indentation, spacing between paragraphs, etc - makes me wonder if the css is being honored as intended. These add up over pages to change the overall length of the document in pages (though the pagebreak algorithm change makes that unavoidable). Again, I'm curious to see if these go away when run through pdfaPilot.
>> 
>> 
>> On 6/28/22 8:28 PM, Kesara Rathnayake wrote:
>>> Hi all,
>>> 
>>> I have draft PR [1] for the WeasyPrint update.
>>> This updates WeasyPrint from 52.5 to 55.0.
>>> Since WeasyPrint 53.0, they have moved the PDF generation from cairo to pypdf [2].
>>> I have generated PDFs from RFC 8650 to RFC 9260 [3].
>>> 
>>> There are some differences from my random checks.
>>> 
>>> Let me know your thoughts.
>>> 
>>> Note that these PDFs haven't gone through the pdfaPilot step to convert to PDF/A-3 with the XML source file embedded.
>>> 
>>> [1] https://github.com/ietf-tools/xml2rfc/pull/802
>>> [2] https://github.com/CourtBouillon/pydyf
>>> [3] https://devbox.amsl.com/weasyprint55/
>>> 
>>> Cheers,
>>> Kesara
>> 
>> -- 
>> Xml-sg-cmt mailing list
>> Xml-sg-cmt@ietf.org
>> https://www.ietf.org/mailman/listinfo/xml-sg-cmt
>> 
> 
> -- 
> Xml-sg-cmt mailing list
> Xml-sg-cmt@ietf.org
> https://www.ietf.org/mailman/listinfo/xml-sg-cmt