Re: The TCP and UDP checksum algorithm may soon need updating

John R Levine <johnl@taugh.com> Wed, 10 June 2020 15:17 UTC

Return-Path: <johnl@taugh.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E0BC13A07A9 for <ietf@ietfa.amsl.com>; Wed, 10 Jun 2020 08:17:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1536-bit key) header.d=iecc.com header.b=ico+4PTI; dkim=pass (1536-bit key) header.d=taugh.com header.b=QdFZiYGC
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9kxnn7n3CMxr for <ietf@ietfa.amsl.com>; Wed, 10 Jun 2020 08:17:44 -0700 (PDT)
Received: from gal.iecc.com (gal.iecc.com [IPv6:2001:470:1f07:1126:0:43:6f73:7461]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 83D5E3A0786 for <ietf@ietf.org>; Wed, 10 Jun 2020 08:17:44 -0700 (PDT)
Received: (qmail 82935 invoked from network); 10 Jun 2020 15:17:41 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type:user-agent; s=143f4.5ee0f995.k2006; i=johnl-iecc.com@submit.iecc.com; bh=bxg0Q9P0hOEwECRCRBHiR5epB4Utwbx0Zch9Jsw1Oz8=; b=ico+4PTITkbeVNvl43VKkSTyRLXldx+6myfVjJXyGEhUnECzwfGb/TzT+0O9eif5MPrdCEppoTFa6EQ9e+3uIx1n6MUKtGJYC/8s9bVJ5E66NBt4g+PQTSSeWPreLcbwmDPMEZMBGSI7g8JZMw3kOEqvgZmTgI7d+HNx/Xkv9J6zKeVp+fHTBmURG6c0WVdyIk9ys7yCeBzFJhdbt1HgruL+kI/O1HWeqi8PHvu1NbQI1L9NpYyZ9mZHM1OZjYaF
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=taugh.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type:user-agent; s=143f4.5ee0f995.k2006; olt=johnl-iecc.com@submit.iecc.com; bh=bxg0Q9P0hOEwECRCRBHiR5epB4Utwbx0Zch9Jsw1Oz8=; b=QdFZiYGC2uWRwYDDuFWB7008aWbCBFFxobUumK77ssOeDwEiV5cpP4rkdv2ZHsoTFcqXDaG1NQseoblL6GnIqE9R1H8k2HH+pmLJVqRr8UIsVQXcAtNGWoAAuxnoyHV4O1t106b1OFElz5cPL8GoN10hlPQ4oKRodEvSincEnq90NpPXEZE3xuVwXuwSWnRKDBDCSuE3W33krqUHFZlC3UaV/S6YSkeEmkUfWW/UYyF48jARqCw4FKMsjuV8q1HY
Received: from localhost ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPSA (TLS1.3 ECDHE-RSA AES-256-GCM AEAD, johnl@iecc.com) via TCP6; 10 Jun 2020 15:17:41 -0000
Date: Wed, 10 Jun 2020 11:17:40 -0400
Message-ID: <alpine.OSX.2.22.407.2006101110480.62014@ary.qy>
From: John R Levine <johnl@taugh.com>
To: Warren Kumari <warren@kumari.net>
Cc: IETF Discuss <ietf@ietf.org>
Subject: Re: The TCP and UDP checksum algorithm may soon need updating
In-Reply-To: <CAHw9_iK226J9FV8+Jdhr3=NBLRr4U-gr21Nn-JXdNK7Gzu66jw@mail.gmail.com>
References: <D55AFBFD-0D59-4176-B6BD-D6A1801FEC2C@akamai.com> <6c9f5bd9-6e26-5d25-e66e-bec206473ff3@mtcc.com> <20200610001225.GD3100@localhost> <3ac60a21-4aee-d742-bedc-5be3a4e65471@mtcc.com> <rbpbpp$2fgp$1@gal.iecc.com> <CAHw9_iK226J9FV8+Jdhr3=NBLRr4U-gr21Nn-JXdNK7Gzu66jw@mail.gmail.com>
User-Agent: Alpine 2.22 (OSX 407 2020-02-09)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"; format="flowed"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/o3NoV-89HAmW41rV3SLtkwWHwuI>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Jun 2020 15:17:47 -0000

On Wed, 10 Jun 2020, Warren Kumari wrote:
>> Having read the papers that Craig referenced, that's my interpretation.
>>
>> One of them is about a big physics application which sends multiple
>> terabytes of data over the net using what looks like a version of
>> FTP that transfers several files at once.  They send the data as a lot
>> of of 4 gig files. When they started verifying file checksums, they
>> found about 20% of the received files were corrrupted in transit.
>
> I'm assuming you are talking about "Cross-Geography Scientific Data
> Transferring Trends and Behavior", which contains (Section 4.1
> Checksum, encryption, and reliability, p.12):

No, it's "Transferring a Petabyte in a Day".

https://www.researchgate.net/publication/325405478_Transferring_a_Petabyte_in_a_Day

"As mentioned, we split each 1.2 TiB snapshot into 256 files of 
approximately equal size. We determined that transferring 64 or 128 files 
concurrently, with a total of 128 or 256 TCP streams, yielded the maximum 
transfer rate. We achieved an average disk-to-disk transfer rate of 92.4 
Gb/s (or 1 PiB in 24 hours and 3 minutes): 99.8% of our goal of 1 PiB in 
24 hours, when the end-to-end verification of data integrity in Globus is 
disabled. In contrast, when the end-to-end verification of data integrity 
in Globus is enabled, we achieved an average transfer rate of only 72 Gb/s 
(or 1 PiB in 30 hours and 52 minutes).

The Globus approach to checksum verification is motivated by the 
observations that the 16-bit TCP checksum is inadequate for detecting data 
corruption during communication [16, 17] and that corruption can occur 
during file system operations [18]. Globus pipelines the transfer and 
checksum computation; that is, the checksum computation of the ith file 
happens in parallel with the transfer of the (i + 1)th file. Data are read 
twice at the source storage system (once for transfer and once for 
checksum) and written once (for transfer) and read once (for checksum) at 
the destination storage system. Therefore, in order to achieve the desired 
rate of 93 Gb/s for checksum-enabled transfers, in the absence of checksum 
failures, 186 Gb/s of read bandwidth from the source storage system and 93 
Gb/s write bandwidth and 93 Gb/s of read bandwidth concurrently from the 
destination storage system are required. If checksum verification failures 
occur (i.e., one or more files are corrupted during the transfer), even 
more storage I/O bandwidth, CPU resources, and network bandwidth are 
required in order to achieve the desired rate."

Globus is a file transfer service from U of Chicago

https://www.globus.org/data-transfer

Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly