[sbm] Kernel API invariants and surprising behavior in TCP_NOTSENT_LOWAT
Jonathan Lennox <jonathan.lennox42@gmail.com> Wed, 12 March 2025 17:36 UTC
Return-Path: <jonathan.lennox42@gmail.com>
X-Original-To: sbm@mail2.ietf.org
Delivered-To: sbm@mail2.ietf.org
Received: from localhost (localhost [127.0.0.1]) by mail2.ietf.org (Postfix) with ESMTP id 9C526A780D0 for <sbm@mail2.ietf.org>; Wed, 12 Mar 2025 10:36:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at ietf.org
X-Spam-Flag: NO
X-Spam-Score: -0.448
X-Spam-Level:
X-Spam-Status: No, score=-0.448 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: mail2.ietf.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail2.ietf.org ([166.84.6.31]) by localhost (mail2.ietf.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aAp-9Jc6itKw for <sbm@mail2.ietf.org>; Wed, 12 Mar 2025 10:36:19 -0700 (PDT)
Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail2.ietf.org (Postfix) with ESMTPS id 441B5A780C6 for <sbm@ietf.org>; Wed, 12 Mar 2025 10:36:19 -0700 (PDT)
Received: by mail-pj1-x1031.google.com with SMTP id 98e67ed59e1d1-3012885752dso327580a91.2 for <sbm@ietf.org>; Wed, 12 Mar 2025 10:36:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741800978; x=1742405778; darn=ietf.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=CitQY93ZeAIJwqtug7Ki2eK4vZJ8asvUlGwqXbQLGJw=; b=j9uyzDO3p1U4Oows580at8vG/3kE7J8cXmZKqcG0AON1KGTKXfvEqXPv6xm5ZSlmb8 IeDcIroeRp4etynyJxNeQ+uzOzoHTyj6TUKIEB3dh3Q8UvGuJX4aiNtym7zuOno1ssbt RVBJ4R8ruxNZ/zaDWvI6SEvafXxGQ+ClYyabujKLCZbST+sK+0rtROcqlWU3dbdhg7tp f9Lh/O0arDkPRD7OPo/JDhpa0rCidoTlcS7e2X9W1uChDeKaCl66p9T52u2lVNdThk91 SMsUaz/URcyGyubKZuBz7E699oKWxYO45yA+O9rPHvgNtdlMjIgPM14cFClOGiw77pp3 diCg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741800978; x=1742405778; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=CitQY93ZeAIJwqtug7Ki2eK4vZJ8asvUlGwqXbQLGJw=; b=ueRUDB1LlZ6u1u1eGW/fA56r+2Ek3NI1A0YzwBl62p/rQMs2NOBqZbXfQR57ENFWH5 KMw1rBhumT7INjwZTrsf/olaY3r4YBEkEbR3WrnUwRiVbsXo9NnrqxaP6V6tiHhIufdo Ynxh+wU3tLkvqo7C9O2Sl6vNsLaRWyhiqlK+nG/Vxz+70GcHi33RpV0mNJT7jAL60HU7 nPFOlrA+43IuD6+NI3g0+RKCOvH6PN+g53wbKNz0NLkNf1ifbdvBik6vFtMf1E/kCxLj lM+q+u2a0Y5Z+7C4CBAX9z5EqOCPk4BxPMrkZNcN2Q+KBjKqvt68wLq+RhhJTWWtZ0u4 wVww==
X-Gm-Message-State: AOJu0YyxTj0MQhx4ugh4809SXmS5QPlqEA6ui9jcTTlc9iP0s4ebSnY0 r7U5XXcld/8k3n8NIfW6xp6sPtDwKZdoQBYnpYAz5q1mLOQGZBbONR4w7gGSEpyM2woEukZ3vhb wORf8AGy0NnSt+Y7b9GRqDUxZC2rOZqmP
X-Gm-Gg: ASbGncvSoqYYBwEvMYuPLTXR738pLnUT5bbP7mNXxvSVBJEsBM94exjX/wH0UHFEJ8a eHwwBVkfLJK96uKwtw5G2wPUBXhGM5hWMVVKZ6fW080kfGopEKuDUzTueG8Si6GWGbCQv74zado DPaXvgSEgAlZ2hkR8WVWTA1n2GeQUn5KcT6E3T7xpddnF2Nm/NarIWp/JEAw==
X-Google-Smtp-Source: AGHT+IFc09lsDlNcpolI4kkHYoXYH4MSPcui9oCdYht8fB0GbtzMGkwHWyi8wNGkRhhfuIcCTWaBdLb+aFxaDsIKy0k=
X-Received: by 2002:a17:90b:1c86:b0:2ff:5750:7a34 with SMTP id 98e67ed59e1d1-2ff7cf4e346mr26850430a91.34.1741800977872; Wed, 12 Mar 2025 10:36:17 -0700 (PDT)
MIME-Version: 1.0
From: Jonathan Lennox <jonathan.lennox42@gmail.com>
Date: Wed, 12 Mar 2025 13:36:05 -0400
X-Gm-Features: AQ5f1Jr7Q0HWucB3H76Pt6-kwAMwRYczCe2ayRRtOFo9hct5EsbJmY81FR_JBCw
Message-ID: <CAKx+b+YWjWFd61zSbCQjX2zzTfPVo3=8rXS24wQ_4WVhHo9FrA@mail.gmail.com>
To: sbm@ietf.org
Content-Type: multipart/alternative; boundary="0000000000008e475c063028a4b9"
Message-ID-Hash: 4WI6ZTWQSGKQRU2U3NVZ2HPXI4KFCDS3
X-Message-ID-Hash: 4WI6ZTWQSGKQRU2U3NVZ2HPXI4KFCDS3
X-MailFrom: jonathan.lennox42@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.9rc6
Precedence: list
Subject: [sbm] Kernel API invariants and surprising behavior in TCP_NOTSENT_LOWAT
List-Id: Source Buffer Management <sbm.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/sbm/72XEdVBNmFoad-4wn-xw501T49o>
List-Archive: <https://mailarchive.ietf.org/arch/browse/sbm>
List-Help: <mailto:sbm-request@ietf.org?subject=help>
List-Owner: <mailto:sbm-owner@ietf.org>
List-Post: <mailto:sbm@ietf.org>
List-Subscribe: <mailto:sbm-join@ietf.org>
List-Unsubscribe: <mailto:sbm-leave@ietf.org>
One surprising thing about the original macOS semantics for TCP_NOTSENT_LOWAT (and thus the intended semantics for TCP_REPLENISH_TIME) is that it breaks an implied kernel invariant. I hypothesize that this is why the Linux implementation behaves differently than the macOS one. This invariant is that the status returned by select() / kevent() / epoll() is *true*: that if those system calls say that a socket isn't writable, then it really isn't, and attempting to write to it will block, or return EWOULDBLOCK / EAGAIN. The macOS semantics of TCP_NOTSENT_LOWAT instead break this - select() or kevent() will claim that a socket is not writable when it has more than TCP_NOTSENT_LOWAT bytes in its buffer, even though a write will in fact succeed up to SO_SNDBUF bytes. Thus, once TCP_NOTSENT_LOWAT is set, there is no way to tell on macOS if a write to the socket will actually succeed or block. I believe this invariant is pretty baked in to the Linux kernel code, and that's why the implementation made it so that it was in fact impossible to write more than TCP_NOTSENT_LOWAT bytes to the socket. (I'm not an expert in the Linux socket code, but it looks like it uses the same code to trigger epoll and to decide when a tcp endmsg() call truncates its write.) I'm thinking that it may be cleaner from an API design perspective (though, sadly, less portable) to instead add new kevent and epoll event types specifically for when the socket hits its low-water point, whether by size or by time, and leave the existing writeable events to their original semantic. (This wouldn't allow triggering low-water from select(), but I don't think that's a huge hardship for modern code.) Are there Linux kernel developers on this list? I suspect this will need input from that community.
- [sbm] Kernel API invariants and surprising behavi… Jonathan Lennox
- [sbm] Re: Kernel API invariants and surprising be… Jonathan Lennox