Re: [Xml-sg-cmt] WeasyPrint Update

Alice Russo <arusso@amsl.com> Thu, 30 June 2022 22:19 UTC

Return-Path: <arusso@amsl.com>
X-Original-To: xml-sg-cmt@ietfa.amsl.com
Delivered-To: xml-sg-cmt@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 69C41C159497 for <xml-sg-cmt@ietfa.amsl.com>; Thu, 30 Jun 2022 15:19:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.906
X-Spam-Level:
X-Spam-Status: No, score=-1.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2IvF9jXdxc4M for <xml-sg-cmt@ietfa.amsl.com>; Thu, 30 Jun 2022 15:19:53 -0700 (PDT)
Received: from c8a.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 566E0C15A727 for <xml-sg-cmt@ietf.org>; Thu, 30 Jun 2022 15:19:53 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTP id 2F5964243EC1; Thu, 30 Jun 2022 15:19:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from c8a.amsl.com ([127.0.0.1]) by localhost (c8a.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XPem6aE1SNAb; Thu, 30 Jun 2022 15:19:53 -0700 (PDT)
Received: from [192.168.4.33] (c-24-17-19-210.hsd1.wa.comcast.net [24.17.19.210]) by c8a.amsl.com (Postfix) with ESMTPSA id 0C76C4243EC0; Thu, 30 Jun 2022 15:19:53 -0700 (PDT)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
From: Alice Russo <arusso@amsl.com>
In-Reply-To: <546a3330-f75e-6733-ab64-e8853ca3dd49@nostrum.com>
Date: Thu, 30 Jun 2022 15:19:52 -0700
Cc: "xml-sg-cmt@ietf.org" <xml-sg-cmt@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <64B87EC4-12DF-466F-960F-1A91A6C615B7@amsl.com>
References: <299a8995-589b-8b9d-8526-21f919afb122@staff.ietf.org> <546a3330-f75e-6733-ab64-e8853ca3dd49@nostrum.com>
To: Robert Sparks <rjsparks@nostrum.com>
X-Mailer: Apple Mail (2.3273)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml-sg-cmt/Z5eEbNfD3RbDQBhOXu8San-Z6IM>
Subject: Re: [Xml-sg-cmt] WeasyPrint Update
X-BeenThere: xml-sg-cmt@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Working list for the xml and style guide change management team <xml-sg-cmt.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml-sg-cmt/>
List-Post: <mailto:xml-sg-cmt@ietf.org>
List-Help: <mailto:xml-sg-cmt-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml-sg-cmt>, <mailto:xml-sg-cmt-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jun 2022 22:19:57 -0000

Hi Robert,

Re:
> I'm not sure we can do a real comparison without running these through the pdfaPilot step, which essentially rewrites the pdf.
> 
> Alice - is it easy to script running all the things at [3] through pdfaPilot? If not, could you run 8779 through (as that's the semi-random one I chose to look at first).

Posted here: https://www.rfc-editor.org/v3test/test8779.pdf

The command used was:
pdfaPilot --collection --embedinto=A3u,rfc8779wp55.pdf --embedfile=No,Source,rfc8779.xml --outputfile=test8779.pdf
(where rfc8779wp55.pdf is https://devbox.amsl.com/weasyprint55/rfc8779.pdf)

They look the same to me, as expected (more below on that topic). The PDF diff tool draftable shows only one change repeatedly (apparently a draftable bug; it's tripping on the characters 'fi'): https://draftable.com/compare/IAdOLoZDctSw

Perhaps more relevant is comparing rfc8779.pdf (as published) vs. rfc8779wp55.pdf:
https://draftable.com/compare/PRHdkYkTIoHP
(My take is nothing egregious there. Differences to page breaking expected; it's unfortunate that a one-line <t> preceding <artwork> no longer stays on the page with the artwork -- pages 11/12 and 13/14. FWIW, the source XML does not contain keepWithNext; the published rendering was good without 'forcing' it.)

Re:
> On Jun 29, 2022, at 2:48 PM, Kesara Rathnayake <kesara@staff.ietf.org> wrote:
> 
>>> I'm not sure we can do a real comparison without running these through the pdfaPilot step, which essentially rewrites the pdf.
>> It's not likely to change the appearance unless the PDF is depending on fonts other than the standard ones or ones included in the PDF which would be a bug.  PDF/A mostly fills in defaults and divides some internal data structures into smaller chunks.

I agree re: "not likely to change the appearance". At the time of the format change, we were told (and have found  during limited visual checks) that there are no changes to appearance after running pdfaPilot.  For each RFC, the PDF from before running pdfaPilot is archived internally; can post if needed.


That said, to Kesara's point about looking at PDFs of more recent RFCs (produced by WeasyPrint 52.5), I'll do some comparing of a few recent ones vs. files in [3] and report back.

Thanks,
Alice

> 
> I'm already seeing the differences in figure/table layout that can affect where pagebreaks lie.
> 
> Most of the other differences I see are in indentation, spacing between paragraphs, etc - makes me wonder if the css is being honored as intended. These add up over pages to change the overall length of the document in pages (though the pagebreak algorithm change makes that unavoidable). Again, I'm curious to see if these go away when run through pdfaPilot.
> 
> 
> On 6/28/22 8:28 PM, Kesara Rathnayake wrote:
>> Hi all,
>> 
>> I have draft PR [1] for the WeasyPrint update.
>> This updates WeasyPrint from 52.5 to 55.0.
>> Since WeasyPrint 53.0, they have moved the PDF generation from cairo to pypdf [2].
>> I have generated PDFs from RFC 8650 to RFC 9260 [3].
>> 
>> There are some differences from my random checks.
>> 
>> Let me know your thoughts.
>> 
>> Note that these PDFs haven't gone through the pdfaPilot step to convert to PDF/A-3 with the XML source file embedded.
>> 
>> [1] https://github.com/ietf-tools/xml2rfc/pull/802
>> [2] https://github.com/CourtBouillon/pydyf
>> [3] https://devbox.amsl.com/weasyprint55/
>> 
>> Cheers,
>> Kesara
> 
> -- 
> Xml-sg-cmt mailing list
> Xml-sg-cmt@ietf.org
> https://www.ietf.org/mailman/listinfo/xml-sg-cmt
>