Re: The TCP and UDP checksum algorithm may soon need updating

Warren Kumari <warren@kumari.net> Mon, 08 June 2020 23:50 UTC

Return-Path: <warren@kumari.net>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2046F3A07DE for <ietf@ietfa.amsl.com>; Mon, 8 Jun 2020 16:50:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=kumari-net.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id os0F8GoyewCT for <ietf@ietfa.amsl.com>; Mon, 8 Jun 2020 16:50:50 -0700 (PDT)
Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 158C63A07DD for <ietf@ietf.org>; Mon, 8 Jun 2020 16:50:49 -0700 (PDT)
Received: by mail-lj1-x22e.google.com with SMTP id a9so19183001ljn.6 for <ietf@ietf.org>; Mon, 08 Jun 2020 16:50:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kumari-net.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=brc5QxGT8/QLbr0AZgXjMGwuy6Env4uT9T3TDeq5ddk=; b=Z2gmQTsQCncGk58XT+UMG+l84052GJMoWCIDhmDDBCnK7Ah8SiGfOSj5+tZEy6Yadi HVeez97X+q6XbhFsqgr3BKjzmKR6evyTxzj6raKn5ofe4APhXYKVmI9CaevmVOtcIE6P 57IT0CiPSJgKMs5soj/Mmp44r7CG4Dudj84G5Xc8fTWsFxsAdKt3s+UDuTHjwTfs+pft KHFAddznDa2NE4nGi/vbXX+u62mEoyemRV8pL4x0Cp0pfxFyu1ib5ympa1RCyK+OQIor /GMlOAxLiXNqKVcQk8t4KmkbbyJOu9Z84ujEmewzi+MvVUW0PrtTuDNNwYIkH9CV496h tiSw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=brc5QxGT8/QLbr0AZgXjMGwuy6Env4uT9T3TDeq5ddk=; b=MRhL7HuZ1sa+M/GV3K5zh6ntJK9DtyBgM3Ha6yWugF7msPiG79snTG4u0cv1y3PGrP eaHEdwEob0mhv3rxzgfTBgOiZkn8xr29oVRETrygExi1s/CVaO4TDuw4Izo1BCZVxcrV v8S/hZ2wVhTMcGVbIS37S27+GottiU6NIE4hrPJHGKkQFVUueHnQOyQ7bsCJzOyj+jxY uxtKuD+mYFsarLoN8dbV/WRb6swhGFR6hR/J8BBqO38/7m+0/TVunA9d3oQwix16mG+F lhnINBi5B8wGOnOK+/JIUkSTsroqIPRlKeObfHZSJWAnspQCIKO9JnQDMMV0LFSRTBF9 F/7g==
X-Gm-Message-State: AOAM5305YHhvAr040oKlI/EmofrNSkkSkC+kVWfghk6VnaCtBlGhIZg7 rgAbjvdIPmjluLyQ9Tt89Vtl6AOj6cqQuH0J+nINlhXLBNQcww==
X-Google-Smtp-Source: ABdhPJz4BQRGE9HqEvOgMRMflo+8r52rrX/u72igpC8TbdPqDEVr8B6lKj9E9CUIy18y5fX8bkf4hGnQ+q4Ih/5NuqQ=
X-Received: by 2002:a2e:9b09:: with SMTP id u9mr13405085lji.207.1591660246428; Mon, 08 Jun 2020 16:50:46 -0700 (PDT)
MIME-Version: 1.0
References: <CAHQj4Cf_vgXYEL=x4DCEnpwNxZpJQSD-h6MWmhMWpYwPF9XFow@mail.gmail.com> <0D18B54B-2865-4A3C-813B-595EA17F6D8B@gmail.com> <32750.1591376396@localhost> <CAHQj4CdopwpEfyuOVO3ZywTKveQMpnt_WPh_JDRydgNKHVVmhw@mail.gmail.com> <135be980-9c3b-e7f4-28c6-184d0b48c5cc@necom830.hpcl.titech.ac.jp>
In-Reply-To: <135be980-9c3b-e7f4-28c6-184d0b48c5cc@necom830.hpcl.titech.ac.jp>
From: Warren Kumari <warren@kumari.net>
Date: Mon, 08 Jun 2020 19:50:09 -0400
Message-ID: <CAHw9_iJPCnbE8a2Ws3ivOq+m8YynG_eG9uUNSABOahYx0S_tPA@mail.gmail.com>
Subject: Re: The TCP and UDP checksum algorithm may soon need updating
To: Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp>
Cc: IETF Discuss <ietf@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/6aZ19IAvlDDgCKOBrhC0_ExzPMQ>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Jun 2020 23:50:53 -0000

On Sat, Jun 6, 2020 at 11:47 PM Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> Craig Partridge wrote:
>
> > OK, on to what people are seeing today.  This shows that 1 in every 121
> > file transfers FTP delivers a file that, when you do the md5 sum, turns out
> > not to match the original (note there are multiple possible reasons, but
> > TCP checksum is a strong candidate).
>
> That's unreasonable because most errors are detected by datalink
> layer checksum and almost all remaining errors are detected by
> transport layer checksum, which should have been the reason
> why transport checksum need not be so strong.

I was trying to avoid this thread, but...

I'm somewhat surprised by some of the numbers (like the 1 in every 121
file transfers).
As a quick check, I looked on a personal webserver (connections from
random people on the Internets), and have received 201.1TB, and
207,183,907,570 (207 billion) packets.
Netstat shows a total of 1029 (detected[0]) CRC errors, or around one
every ~200M packets.

I *think* (but may be completely wrong!) that the chance of a 16bit
checksum giving a false-negative is just 2^16[1], so 200M*2^16 one in
around every 13 billion packets?
My average packet size is ~970byes, so that is ~one bad packet in ~12PB.

I believe the L2 quality and checksums has increased sufficiently that
it has made up for the increase in bandwidth and data volumes -- I'm
sure we all used to watch, and expect errors on PSTN modems, 100Meg
Ethernet, 56k leased lines and the like. I still graph FCS/CRC errors
on router interfaces, but they are basically just empty graphs these
days...

Are others seeing much much worse numbers from looking at their counters?


W
[0]: Yup, it's possible that there were some number of undetected ones
(the whole debate in this thread), but assuming that there are not
systemic issues in the checksum algorithm I believe that the chance of
an error occurring and the checksum calculation happening to match is
2^(length of checksum), and so there is a ~1.6% chance of it happening
(1029/2^16), but I bumped the numbers up to 1030 anyway.

[1]: Actually, I think I overestimate the chance of this happening --
there would need to be a corruption of a packet, *and* the checksum in
the same packet would need to be corrupted, and happen to be the
correct one for that corrupted packet (a 1 in 2^16 chance) - but, that
requires (at least) 2 corruptions in the same packet, or a corruption
which happens to calculate to the same value,  and I don't know how to
easily account for that, so I'll just use the worse case estimate.


>
> > Anecdotally, folks are reporting some middlebox vendors are not updating
> > the TCP checksum but rather letting the outbound interface simply recompute
> > the entire checksum -- which means that if the TCP segment gets damaged
> > during middlebox handling, the middlebox will slap a valid checksum on bad
> > data.
>
> That should be the real problem to make transport checksum not
> to work end to end.
>
> Thus, your proposal to have stronger checksum can not prevent
> file corruptions.
>
> So, we should make middlebox vendors to update checksum incrementally
> or, check the original checksum just before sending a packet
> with the original header (not applicable if payload is also modified).
>
>                                                 Masataka Ohta
>
> PS
>
> This is a old problem documented in the original paper on
> the E2E principle.
>
> https://dl.acm.org/doi/pdf/10.1145/357401.357402
> 2.2 A Too-Real Example
> One gateway computer developed a transient
> error: while copying data from an input to
> an output buffer a byte pair was
> interchanged, with a frequency of about one
> such interchange in every million bytes passed.
>


-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf