Re: ETag specification: load balance friendly and merge with Digest header from

Sergey Ponomarev <stokito@gmail.com> Fri, 17 July 2020 07:46 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C19FA3A1418 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 17 Jul 2020 00:46:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.717
X-Spam-Level:
X-Spam-Status: No, score=-2.717 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=fail (2048-bit key) reason="fail (body has been altered)" header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YBATZXuY9J3n for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 17 Jul 2020 00:46:54 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CE8D73A1415 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Fri, 17 Jul 2020 00:46:53 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.92) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1jwL2q-0008No-52 for ietf-http-wg-dist@listhub.w3.org; Fri, 17 Jul 2020 07:44:37 +0000
Resent-Date: Fri, 17 Jul 2020 07:44:36 +0000
Resent-Message-Id: <E1jwL2q-0008No-52@lyra.w3.org>
Received: from www-data by lyra.w3.org with local (Exim 4.92) (envelope-from <stokito@gmail.com>) id 1jwL2o-0008Ls-JM for ietf-http-wg@listhub.w3.org; Fri, 17 Jul 2020 07:44:34 +0000
Received: from mimas.w3.org ([128.30.52.79]) by lyra.w3.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from <stokito@gmail.com>) id 1jw9sY-0008Q1-K3 for ietf-http-wg@listhub.w3.org; Thu, 16 Jul 2020 19:49:14 +0000
Received: from mail-ot1-x341.google.com ([2607:f8b0:4864:20::341]) by mimas.w3.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from <stokito@gmail.com>) id 1jw9sW-0000Cp-1J for ietf-http-wg@w3.org; Thu, 16 Jul 2020 19:49:14 +0000
Received: by mail-ot1-x341.google.com with SMTP id h13so5193899otr.0 for <ietf-http-wg@w3.org>; Thu, 16 Jul 2020 12:49:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=R0bHbzoLSVqDCmDHml2hnxBPwN00c1S7eEWWIqhtTbg=; b=hPWxur95sE92EtoA6X+qPEdjNK5uaYzWyUuT4aCYPnm3fGpIaahQuWJiPw40Imvf35 pXoUxLQ728w+uXn24r+sXVMwTvx0NcdCXXvO3oQgAHbowzwRtdqc5c9V2MBHEuOK2p5L tMHzEqFecP45fhBRtUMXtYcRDvZxegDZ6USaV7f8q3b2oa2bWAbP6/rEGEeuvmK9oI5q mVfG49TuqhIikIbiZHXtk4M24J+nRgj0eG6gU8nrAm8v1ezhx3crjdQ3pcOeClIesfUI GXgKlyf+8h3fKWzuatr9vVI7dpgCdxSZgOAHu5M+a1zA9zxVSzvVBRI0yjzEEDSI0pF1 drPw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=R0bHbzoLSVqDCmDHml2hnxBPwN00c1S7eEWWIqhtTbg=; b=T/bgcPO1t1GRFjZVConmPUbxvYFhy4b078bLUXYemhAnXE7kRcmXaPDZFW1kuz0PCu tI3ZQgPnpeFQhlFPg2mfFWvd+GkQuJ3Bsv96gZBlplA9ucjb+ltdxZE+8XKfetaOymEK 58U8MXhqAh8g2ADwsTgwdgDqVvf/p2RPt8ammxtB+kiQ7tqlcsyhfJ+sEHRax3N9vL6Z 1HJkD/iwz86HhZL4pYNA7i9gQBZ4+Cqss8gkXgmVw8tak3+dV4UhbIcNH/673p5RKg3y PhE0MGVECU7k5pIeh1HuuD2zuJh/cNtczxIE+A2Ad4TxjxIqwSoSDIgPNiOQfr7smuoh l0dQ==
X-Gm-Message-State: AOAM533Iz4Jlprm810RjQytC0TOHREqxVSE283Gbajrw39fsPay43L+0 73k0T0VzXK8Aavu47zhA7ap9PORgDJf+jWYVA/4=
X-Google-Smtp-Source: ABdhPJzTH5Ko7b8IvK3hzKP75HiDyVk4cLlaZrCUfKxICEuW+3pu2h0lZrVEAa2szOe5sgGFLVB/ZV8D9hFaMqhtBSI=
X-Received: by 2002:a9d:4b02:: with SMTP id q2mr6221559otf.296.1594928940780; Thu, 16 Jul 2020 12:49:00 -0700 (PDT)
MIME-Version: 1.0
References: <CADR0UcXFgHrg9Q59AqN0D-PaZuVp96LPhGaQSo3a9cg+zFpU1w@mail.gmail.com> <CAP9qbHX4ety3zZBYFZb5dvWinpP-z=L0sp0hzDofjLFe_=qmdw@mail.gmail.com>
In-Reply-To: <CAP9qbHX4ety3zZBYFZb5dvWinpP-z=L0sp0hzDofjLFe_=qmdw@mail.gmail.com>
From: Sergey Ponomarev <stokito@gmail.com>
Date: Thu, 16 Jul 2020 22:48:49 +0300
Message-ID: <CADR0UcX6mJiYMUm7OU46_ihtqyS+94Y+v4L-sTyJRDvX4HbRBg@mail.gmail.com>
To: Roberto Polli <robipolli@gmail.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="000000000000f510cb05aa945464"
Received-SPF: pass client-ip=2607:f8b0:4864:20::341; envelope-from=stokito@gmail.com; helo=mail-ot1-x341.google.com
X-W3C-Hub-Spam-Status: No, score=-4.1
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_WL=-1
X-W3C-Scan-Sig: mimas.w3.org 1jw9sW-0000Cp-1J 67c7b58bbcb568b71dc15b382d7bc926
X-caa-id: 224a8e3634
X-Original-To: ietf-http-wg@w3.org
Subject: Re: ETag specification: load balance friendly and merge with Digest header from
Archived-At: <https://www.w3.org/mid/CADR0UcX6mJiYMUm7OU46_ihtqyS+94Y+v4L-sTyJRDvX4HbRBg@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/37883
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hi Roberto,

Thank you for the explanation now it clear. ETag header allows multiple
tags so it seems like we can pass both weak ETag and digest in the same
header. Thus we can simplify cases and avoid duplication  when Digest is
the same as hash based  ETag.


On Mon, Jul 13, 2020 at 17:49 Roberto Polli <robipolli@gmail.com> wrote:

> Hi Sergey,
>
> Digest header was introduced long ago via Rfc3230. We are just updating
> it...
>
> It's goal is different from etag though, but you can use digest-algorithms
> to compute strong etags. Consider though that digest changes when
> localising resources (eg. Via content-language) while weak etags probably
> won't.
>
> If someone thinks we should describe the relationship between digest and
> etags in the new spec we can do it.
>
> Have a nice day,
> R
>
> Il lun 13 lug 2020, 02:28 Sergey Ponomarev <stokito@gmail.com> ha scritto:
>
>> Hi,
>>
>> I just implemented ETag caching for BusyBox httpd which is a http server
>> for embedded devices like WiFi routers.
>> While implementing I had to choose what exactly should be generated as
>> ETag.
>> ETag is specified in https://tools.ietf.org/html/rfc2616#section-14.19
>> as an opaque value and a server is free to generate it as it needs.
>> In the https://httpwg.org/specs/rfc7232.html#rfc.section.2.3 Conditional
>> Requests are better explained strategies to generate and compare ETags.
>> But even in the upcoming HTTP Caching draft-ietf-httpbis-cache-09 no any
>> practical details about ETag generation.
>>
>> I did small research and found out that all web servers do it in their
>> own way and this causes several problems:
>> 1. ETag may be badly or even wrongly generated.
>> 2. When two different servers e.g. Apache and Nginx are behind load
>> balancer then their ETags will be always discarded because they are
>> generated differently. That's why some sysadmins disable ETag on one of the
>> servers.
>> These problems can be easily fixed if HTTP specification will provide a
>> recommended way to generate ETags while keeping freedom of choice.
>>
>> Typical ETag is based on file's Last Modification Time and Size which can
>> be easily retrieved from the file system but can be a more strict hash or
>> checksum and sometimes a semantic version.
>>
>> Just a quick overview of typical algorithms used in webservers.
>> Consider  we have a file with
>> * Size 1047 i.e. 417 in hex.
>> * MTime i.e. last modification on Mon, 06 Jan 2020 12:54:56 GMT which
>> is 1578315296 milliseconds in unix time or 1578315296666771000 nanoseconds.
>> * Inode which is a physical file number 66 i.e. 42 in hex
>>
>> Different webservers returns ETag like:
>> Nginx: "5e132e20-417"                         i.e.
>> "hex(MTime)-hex(Size)". Not configurable.
>> Apache/2.2: "42-417-59b782a99f493"  i.e.  "hex(INode)-hex(Size)-hex(MTime
>> in nanoseconds)". Can be configured but MTime anyway will be in nanos
>> http://httpd.apache.org/docs/2.4/mod/core.html#fileetag
>> Apache/2.4: "417-59b782a99f493"       i.e.  "hex(Size)-hex(MTime in
>> nanoseconds)" i.e. without INode which is friendly for load balancing when
>> identical file have different INode on different servers.
>> OpenWrt uhttpd: "42-417-5e132e20"    i.e.
>> "hex(INode)-hex(Size)-hex(MTime)". Not configurable.
>> Tomcat 9: W/"1047-1578315296666"   i.e.  Weak"Size-MTime in Nanoseconds".
>> This is incorrect ETag because it should be strong as for a static file
>> i.e. octal compatibility.
>> LightHTTPD:  most weird:  "hashcode(42-1047-1578315296666771000)" i.e.
>> INode-Size-MTime but then reduced to a simple integer by hashcode. Can be
>> configured but you can only disable one part (etag.use-inode = "disabled")
>>
>> Hex numbers are used here so often because it's cheap to convert a
>> decimal number to a shorter hex string.
>> Inode while adding more guarantees makes load balancing not possible and
>> very fragile if you simply copied the file during application redeploy.
>> MTime in nanoseconds is not available on all platforms and we don't need
>> such granularity. Apache have reported bugs on this like
>> https://bz.apache.org/bugzilla/show_bug.cgi?id=55573
>> The order MTime-Size or Size-MTime  is also matters because MTime is more
>> likely changed so comparing ETag string may be faster for a dozen
>> CPU cycles.
>> Even if this is not a full checksum hash but definitely not a weak ETag.
>> This is enough to show that we expect octal compatibility for Range
>> requests.
>> Apache and Nginx shares almost all trafik in Internet but most static
>> files are shared via Nginx and it is not configurable.
>>
>> If I am not missing anything then it looks like Nginx uses the most
>> reasonable schema. And I used it for BusyBox httpd.
>> The whole ETag generated by printf("\"%" PRIx64 "-%" PRIx64 "\"",
>> last_mod, file_size)
>>
>> My proposition is to take Nginx schema and make it as a recommended
>> ETag algorithm. Or at least just to mention in rfc7232 as an example.
>> And other servers should have at least possibility to configure such ETag
>> form.
>> I'll try to engage other web servers teams into the discussion and 'll
>> try to create patches for them.
>>
>> While having the simple MTime-Size ETag algorithm solves a bunch of
>> problems but some systems wants to have more guarantees and they need hash
>> based ETags.
>> Any hash even MD5 or CRC32 is great to use as ETag.
>>
>> There is a draft of Digest Headers
>> https://github.com/httpwg/http-extensions/blob/master/draft-ietf-httpbis-digest-headers.md .
>> It's idea is similar to Subresource Integration (SRI).
>> And in fact instead of introducing the new Digest header we can just
>> reuse ETag header with prefix.
>>
>> Respectively instead of:
>>
>>     Digest: sha-256=4REjxQ4yrqUVicfSKYNO/cF9zNj5ANbzgDZt3/h3Qxo=
>>
>> We can use
>>
>>     ETag: "sha-256=4REjxQ4yrqUVicfSKYNO/cF9zNj5ANbzgDZt3/h3Qxo="
>>
>> Client can easily parse ETag header and by prefix determine the way to
>> validate.
>> We'll have "structured ETag" and they are already supported by proxies.
>>
>> For the same file server can send two comma separated ETags: one
>> MTimeSize and additional digest based. Old clients just resend them via
>> If-None-Match. If a server like BusyBox can only validate MTimeSize Etag it
>> will validate it and ignore sha256 based ETag.
>>
>> BTW the file hashes can be stored ext4 in extended attributes to avoid
>> recalculating.
>>
>> Please tell your thoughts and opinions and share best practice for ETags.
>>
>> See also:
>> Apache code to generate ETag
>> https://searchcode.com/codesearch/view/28934406/
>> LightHTTPD
>> https://git.lighttpd.net/lighttpd/lighttpd1.4/src/branch/master/src/etag.c
>>
>> --
>> Sergey Ponomarev <https://linkedin.com/in/stokito>, skype:stokito
>>
>>
>> --
Sergey Ponomarev <https://linkedin.com/in/stokito>, skype:stokito