Re: [Tools-discuss] Why do we even have text formats any more?

Robert Sparks <rjsparks@nostrum.com> Wed, 28 July 2021 02:22 UTC

Return-Path: <rjsparks@nostrum.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CB18A3A1702 for <tools-discuss@ietfa.amsl.com>; Tue, 27 Jul 2021 19:22:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.079
X-Spam-Level:
X-Spam-Status: No, score=-2.079 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, T_SPF_HELO_PERMERROR=0.01, T_SPF_PERMERROR=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=nostrum.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DGRHEvehYOz6 for <tools-discuss@ietfa.amsl.com>; Tue, 27 Jul 2021 19:22:36 -0700 (PDT)
Received: from nostrum.com (raven-v6.nostrum.com [IPv6:2001:470:d:1130::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9B5153A1701 for <tools-discuss@ietf.org>; Tue, 27 Jul 2021 19:22:36 -0700 (PDT)
Received: from unformal.localdomain ([47.186.34.206]) (authenticated bits=0) by nostrum.com (8.16.1/8.16.1) with ESMTPSA id 16S2MZj3077467 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO) for <tools-discuss@ietf.org>; Tue, 27 Jul 2021 21:22:36 -0500 (CDT) (envelope-from rjsparks@nostrum.com)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=nostrum.com; s=default; t=1627438956; bh=U6xeQTou8+ggvoQYCpvQyyTODcqELxfU278nRM6iRjE=; h=To:References:From:Subject:Date:In-Reply-To; b=e2Hf2a2NJ4Uxfs0/O0oizVnF03xbiQ+IeqKhTUUgDcPJzvW3FDa5mRW1lBOTgOjv+ 8p8QBOETGWgNHIQSgFgaz3it2KhbOvEDWpe9ADXrqeiqc0xVGf2O7tuRh3Wph5Hl8o J/Fl5IdDxIpx80loiSyeRkHBUtMtc4so4OIG0AHw=
X-Authentication-Warning: raven.nostrum.com: Host [47.186.34.206] claimed to be unformal.localdomain
To: tools-discuss@ietf.org
References: <4d70a1ac-a275-420a-83f6-99dfd5b5385c@www.fastmail.com>
From: Robert Sparks <rjsparks@nostrum.com>
Message-ID: <14bd112c-fd34-44ce-dcbc-9f3b989cdd7d@nostrum.com>
Date: Tue, 27 Jul 2021 21:22:30 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <4d70a1ac-a275-420a-83f6-99dfd5b5385c@www.fastmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/XV1RacdP3X0AEi3uTScowi4kaSM>
Subject: Re: [Tools-discuss] Why do we even have text formats any more?
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jul 2021 02:22:41 -0000

This is worth exploring, but a few things:

First thought - we have some thousands of RFCs that only exist as txt, 
so I'm reading your argument as "for new things", but keep in mind we 
often have to process both old and new things (think diff).

Second thought - diff. You and I have discussed some potential 
candidates for html-diff that would rival the text diff for visual 
inspection, and they're still not-quite there. And when it comes down to 
some things, diffing text is still going to be the best generalizable 
tool. XML-diffing continues to be more of a stretch than it intuitively 
appears. (I think there's an argument here for keeping as much work that 
would involve diffing as we can in a language like markdown, but...)

Third thought - an alternative already brought up (Carsten I think was 
first) to html-ization for the things we have v3 xml for is to create a 
writer for it that builds it from the xml source rather than trying to 
pull things by heuristics from the text. Maybe where you're pointing 
would obviate that, but there may be different decisions to make in that 
writer that would be advantageous.

And finally, to your footnote, raising "why aren't people submitting 
XML?" - I've seen recently that there is fear from some seasoned 
submitters that the processor at the datatracker will get the references 
wrong. This is tied up with working in v2 and the issues we are working 
to correct with bibxml generation. Mitigating that fear will have an 
impact on the xml submission rate, I think.

RjS

On 7/27/21 8:53 PM, Martin Thomson wrote:
> I realize that this might be a little inflammatory as far as subjects go, but bear with me.
>
> There are probably a few narrow cases where rendering plain text is better than HTML.  But what we've been doing for years (thanks to Henrik's great tool) is take text and turn it into HTML using the power of regular expressions.  That's been good, but it's not always reliable (how many errata mention that "Section X of [FOO]" links to Section X of this document?).  It's also been lagging as the text format changed (case in point: lack of a table of contents).
>
> Here's an alternative: style the HTML so that it looks like the text.  I tried this and it worked shockingly well.
>
> Repo: https://github.com/martinthomson/rfc-txt-html
> Demo: https://martinthomson.github.io/rfc-txt-html/diff.html
>
> This isn't perfect, but it seems pretty good to me.  Keep in mind that this took only a little bit of time to sketch out. No doubt it can be improved.  The readme has a bunch of things I found, all minor.
>
> I don't think that this is the end of text, but a possible way to limit our use of the htmlizer[1].  People who need to automate access to content might still use text, though I will argue that XML is superior in that regard.   The other thing that comes to mind is diffs: HTML-native diff tools are somewhat less than ideal.  Either way, serving HTML is just better.
>
> Enjoy,
> Martin
>
>
> [1] Though I still see a shocking number of people authoring in XML (or XML-capable input formats) and submitting in text.  But I think we have plans to limit that.
>
> ___________________________________________________________
> Tools-discuss mailing list - Tools-discuss@ietf.org
> This list is for discussion, not for action requests or bug reports.
> * Report datatracker and mailarchive bugs to: datatracker-project@ietf.org
> * Report tools.ietf.org bugs to: webmaster@tools.ietf.org
> * Report all other bugs or issues to: ietf-action@ietf.org
> List info (including how to Unsubscribe): https://www.ietf.org/mailman/listinfo/tools-discuss