Re: ETag specification: load balance friendly and merge with Digest header from

Roberto Polli <robipolli@gmail.com> Mon, 13 July 2020 14:52 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A3D923A0A29 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 13 Jul 2020 07:52:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.018
X-Spam-Level:
X-Spam-Status: No, score=-3.018 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7pO2Q5tQjtkE for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 13 Jul 2020 07:52:55 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 553C33A09D3 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Mon, 13 Jul 2020 07:52:55 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.92) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1juzm9-0005Qw-TE for ietf-http-wg-dist@listhub.w3.org; Mon, 13 Jul 2020 14:49:49 +0000
Resent-Date: Mon, 13 Jul 2020 14:49:49 +0000
Resent-Message-Id: <E1juzm9-0005Qw-TE@lyra.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by lyra.w3.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from <robipolli@gmail.com>) id 1juzm8-0005QB-Pd for ietf-http-wg@listhub.w3.org; Mon, 13 Jul 2020 14:49:48 +0000
Received: from mail-il1-x144.google.com ([2607:f8b0:4864:20::144]) by titan.w3.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from <robipolli@gmail.com>) id 1juzm6-0003VC-5G for ietf-http-wg@w3.org; Mon, 13 Jul 2020 14:49:48 +0000
Received: by mail-il1-x144.google.com with SMTP id a11so11415875ilk.0 for <ietf-http-wg@w3.org>; Mon, 13 Jul 2020 07:49:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jr45U5GerVwSEVb24MHDJ1VPQGxFqYy68qqiEX5EU/M=; b=qYk0h5wj2l8B3YSwKVQqiilekpNtxm2ViajVxe0elIj/zDb9kKSmmAEWs9YPoqWUjc WXuM1Us30j23KwZRZ/gDHcSsChDplnQfdH3nB/Q4WiO6fO+t8d13VzWOGk3kfHI5AfGo BxER1AzLqQgoRc8VYjFMvmopvrOh+EcePrjIn9/+pcM9nlJ0uq/TE4zRsrPyETpToc9O wrqcgXzCVcIU7TxnWtIpKeEwXvrBAgpAlEnjueaodw7pwOSO2cT2xE7eJEWL+mA1cwbe ZGUl4xMkRrZgXqFKpoQrvvwm572xlHpHs1o+upZwijNWUXUhcMvyNHCrfbq2p+lJxx/q bRmg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jr45U5GerVwSEVb24MHDJ1VPQGxFqYy68qqiEX5EU/M=; b=hRMpYC/ekT4eZdFdDbdVQn2TwQsCMzSYIPghtQYjTS916heB/152Ck6pypt9Olh3IC yEsKd6VRwHQtCYkd97374WJCMpUZpZ1WCWYunW23xJarT3zVJlCMfX/vGpXvtkpZJbiE hO5zSOJdqXWhfx2mBvDHsUcHT6iQgH34vtu2JsAXiB1ijAeuqZQDRtrSA5Sbr8Wq5hl7 Sz2XeSfsad0KUUPAa5sjH2r4eoYAZIhbrSHzdXT1qGjGInX+gQCSAYz3vAX6GEADX2nZ Tl3kgOsx0lE0LLbJjyyELsuGZBOYdhA2imu++/mBFWnonh/HU0MW/ERZMv5C9OUS+u/P K6aQ==
X-Gm-Message-State: AOAM533OxqkHdlRkVtfLbXUKVpSY751Pt/IApAvfNoT44rvX/NuEEOGE a6BgnNvUuJodh2QAA7AWnODSNpQ6lhCcpkyHV4slfA==
X-Google-Smtp-Source: ABdhPJx7E7DRiE5h5efPZS78QhyY6/PwEP4ojkxDTO1ODRtIqUhTn/qXlA7TPzy4a03Zt7XmVwYngT/FhcHSfqXMJIk=
X-Received: by 2002:a92:bb98:: with SMTP id x24mr39528ilk.270.1594651774956; Mon, 13 Jul 2020 07:49:34 -0700 (PDT)
MIME-Version: 1.0
References: <CADR0UcXFgHrg9Q59AqN0D-PaZuVp96LPhGaQSo3a9cg+zFpU1w@mail.gmail.com>
In-Reply-To: <CADR0UcXFgHrg9Q59AqN0D-PaZuVp96LPhGaQSo3a9cg+zFpU1w@mail.gmail.com>
From: Roberto Polli <robipolli@gmail.com>
Date: Mon, 13 Jul 2020 16:49:22 +0200
Message-ID: <CAP9qbHX4ety3zZBYFZb5dvWinpP-z=L0sp0hzDofjLFe_=qmdw@mail.gmail.com>
To: Sergey Ponomarev <stokito@gmail.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="000000000000963a4305aa53cc78"
Received-SPF: pass client-ip=2607:f8b0:4864:20::144; envelope-from=robipolli@gmail.com; helo=mail-il1-x144.google.com
X-W3C-Hub-Spam-Status: No, score=-5.1
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_DB=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1juzm6-0003VC-5G 5f42f9b5e12506019f78c0be293182d3
X-Original-To: ietf-http-wg@w3.org
Subject: Re: ETag specification: load balance friendly and merge with Digest header from
Archived-At: <https://www.w3.org/mid/CAP9qbHX4ety3zZBYFZb5dvWinpP-z=L0sp0hzDofjLFe_=qmdw@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/37879
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hi Sergey,

Digest header was introduced long ago via Rfc3230. We are just updating
it...

It's goal is different from etag though, but you can use digest-algorithms
to compute strong etags. Consider though that digest changes when
localising resources (eg. Via content-language) while weak etags probably
won't.

If someone thinks we should describe the relationship between digest and
etags in the new spec we can do it.

Have a nice day,
R

Il lun 13 lug 2020, 02:28 Sergey Ponomarev <stokito@gmail.com> ha scritto:

> Hi,
>
> I just implemented ETag caching for BusyBox httpd which is a http server
> for embedded devices like WiFi routers.
> While implementing I had to choose what exactly should be generated as
> ETag.
> ETag is specified in https://tools.ietf.org/html/rfc2616#section-14.19 as
> an opaque value and a server is free to generate it as it needs.
> In the https://httpwg.org/specs/rfc7232.html#rfc.section.2.3 Conditional
> Requests are better explained strategies to generate and compare ETags.
> But even in the upcoming HTTP Caching draft-ietf-httpbis-cache-09 no any
> practical details about ETag generation.
>
> I did small research and found out that all web servers do it in their own
> way and this causes several problems:
> 1. ETag may be badly or even wrongly generated.
> 2. When two different servers e.g. Apache and Nginx are behind load
> balancer then their ETags will be always discarded because they are
> generated differently. That's why some sysadmins disable ETag on one of the
> servers.
> These problems can be easily fixed if HTTP specification will provide a
> recommended way to generate ETags while keeping freedom of choice.
>
> Typical ETag is based on file's Last Modification Time and Size which can
> be easily retrieved from the file system but can be a more strict hash or
> checksum and sometimes a semantic version.
>
> Just a quick overview of typical algorithms used in webservers.
> Consider  we have a file with
> * Size 1047 i.e. 417 in hex.
> * MTime i.e. last modification on Mon, 06 Jan 2020 12:54:56 GMT which
> is 1578315296 milliseconds in unix time or 1578315296666771000 nanoseconds.
> * Inode which is a physical file number 66 i.e. 42 in hex
>
> Different webservers returns ETag like:
> Nginx: "5e132e20-417"                         i.e.
> "hex(MTime)-hex(Size)". Not configurable.
> Apache/2.2: "42-417-59b782a99f493"  i.e.  "hex(INode)-hex(Size)-hex(MTime
> in nanoseconds)". Can be configured but MTime anyway will be in nanos
> http://httpd.apache.org/docs/2.4/mod/core.html#fileetag
> Apache/2.4: "417-59b782a99f493"       i.e.  "hex(Size)-hex(MTime in
> nanoseconds)" i.e. without INode which is friendly for load balancing when
> identical file have different INode on different servers.
> OpenWrt uhttpd: "42-417-5e132e20"    i.e.
> "hex(INode)-hex(Size)-hex(MTime)". Not configurable.
> Tomcat 9: W/"1047-1578315296666"   i.e.  Weak"Size-MTime in Nanoseconds".
> This is incorrect ETag because it should be strong as for a static file
> i.e. octal compatibility.
> LightHTTPD:  most weird:  "hashcode(42-1047-1578315296666771000)" i.e.
> INode-Size-MTime but then reduced to a simple integer by hashcode. Can be
> configured but you can only disable one part (etag.use-inode = "disabled")
>
> Hex numbers are used here so often because it's cheap to convert a decimal
> number to a shorter hex string.
> Inode while adding more guarantees makes load balancing not possible and
> very fragile if you simply copied the file during application redeploy.
> MTime in nanoseconds is not available on all platforms and we don't need
> such granularity. Apache have reported bugs on this like
> https://bz.apache.org/bugzilla/show_bug.cgi?id=55573
> The order MTime-Size or Size-MTime  is also matters because MTime is more
> likely changed so comparing ETag string may be faster for a dozen
> CPU cycles.
> Even if this is not a full checksum hash but definitely not a weak ETag.
> This is enough to show that we expect octal compatibility for Range
> requests.
> Apache and Nginx shares almost all trafik in Internet but most static
> files are shared via Nginx and it is not configurable.
>
> If I am not missing anything then it looks like Nginx uses the most
> reasonable schema. And I used it for BusyBox httpd.
> The whole ETag generated by printf("\"%" PRIx64 "-%" PRIx64 "\"",
> last_mod, file_size)
>
> My proposition is to take Nginx schema and make it as a recommended
> ETag algorithm. Or at least just to mention in rfc7232 as an example.
> And other servers should have at least possibility to configure such ETag
> form.
> I'll try to engage other web servers teams into the discussion and 'll try
> to create patches for them.
>
> While having the simple MTime-Size ETag algorithm solves a bunch of
> problems but some systems wants to have more guarantees and they need hash
> based ETags.
> Any hash even MD5 or CRC32 is great to use as ETag.
>
> There is a draft of Digest Headers
> https://github.com/httpwg/http-extensions/blob/master/draft-ietf-httpbis-digest-headers.md .
> It's idea is similar to Subresource Integration (SRI).
> And in fact instead of introducing the new Digest header we can just reuse
> ETag header with prefix.
>
> Respectively instead of:
>
>     Digest: sha-256=4REjxQ4yrqUVicfSKYNO/cF9zNj5ANbzgDZt3/h3Qxo=
>
> We can use
>
>     ETag: "sha-256=4REjxQ4yrqUVicfSKYNO/cF9zNj5ANbzgDZt3/h3Qxo="
>
> Client can easily parse ETag header and by prefix determine the way to
> validate.
> We'll have "structured ETag" and they are already supported by proxies.
>
> For the same file server can send two comma separated ETags: one MTimeSize
> and additional digest based. Old clients just resend them via
> If-None-Match. If a server like BusyBox can only validate MTimeSize Etag it
> will validate it and ignore sha256 based ETag.
>
> BTW the file hashes can be stored ext4 in extended attributes to avoid
> recalculating.
>
> Please tell your thoughts and opinions and share best practice for ETags.
>
> See also:
> Apache code to generate ETag
> https://searchcode.com/codesearch/view/28934406/
> LightHTTPD
> https://git.lighttpd.net/lighttpd/lighttpd1.4/src/branch/master/src/etag.c
>
> --
> Sergey Ponomarev <https://linkedin.com/in/stokito>, skype:stokito
>
>
>