Secdir last call review of draft-ietf-6man-ipv6-alt-mark-06

Christopher Wood via Datatracker <noreply@ietf.org> Tue, 15 June 2021 23:16 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: ipv6@ietf.org
Delivered-To: ipv6@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 0D2C23A41CA; Tue, 15 Jun 2021 16:16:19 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Christopher Wood via Datatracker <noreply@ietf.org>
To: secdir@ietf.org
Cc: draft-ietf-6man-ipv6-alt-mark.all@ietf.org, ipv6@ietf.org, last-call@ietf.org
Subject: Secdir last call review of draft-ietf-6man-ipv6-alt-mark-06
X-Test-IDTracker: no
X-IETF-IDTracker: 7.32.0
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <162379897899.20803.2196921209927070076@ietfa.amsl.com>
Reply-To: Christopher Wood <caw@heapingbits.net>
Date: Tue, 15 Jun 2021 16:16:19 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipv6/6E2HOxXtz9-Rxt0VtlPtbAuiGA8>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.29
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jun 2021 23:16:20 -0000

Reviewer: Christopher Wood
Review result: Has Issues

General comments:

I don't quite understand the need for this mechanism -- why would one use these
markings instead of transport-layer signals a la ECN? -- so I've constrained my
comments to the mechanical details. My only high level comment pertains to the
threat model and value of these metrics. In particular, it's not clear to me
how an operator would distinguish between actual operational problems causing
loss or delay from an attacker that's modifying marking flags to give the
appearance of loss or delay. In untrusted domains, how are these markings
expected to be used reliably? (I guess I just don't understand the threat model
well enough, and I couldn't glean it from the security considerations.)

Specific comments:

Section 2.

        o  In case of Hop-by-Hop Option Header carrying Alternate Marking
        bits, it is not inserted or deleted, but can be read by any node
        along the path.  The intermediate nodes may be configured to
        support this Option or not and the measurement can be done only
        for the nodes configured to read the Option.  Anyway this should
        not affect the traffic throughput on nodes that do not recognize
        the Option, as further discussed in Section 4.

A couple questions come to mind when reading this. In no particular order:

- What stops a hop along the path from inserting or deleting these markings?
What is affected if that happens?

- Does it affect throughput on nodes that _do_ recognize the option?

While the threat model (monitoring within a controlled domain) seems to rule
out these issues, the implications of alterations, even if accidental, seem
worth elaborating upon.

        Flow Label and
        FlowMonID within the same packet have different scope, identify
        different flows, and are intended for different use cases.

Is the set of packets defined by a FlowMonID a subset of those defined by a
Flow Label, do they have some overlap, or are they completely disjoint?
(Writing out the relationship in more detail might help clarify why a new label
is indeed needed for non-experts.) It seems like a shame to redefine yet
another flow field.

As a nit, given the relation to and possible confusion with Flow Label, perhaps
we could rename FlowMonID to something TraceID?

        So, for the purposes of
        this document, both IP addresses and Flow Label should not change in
        flight and, in some cases, they could be considered together with the
        FlowMonID for disambiguation.

The restrictions of a controlled domain, wherein there is assumed to be no
attacker that can modify these fields, is probably worth noting here. It's in
Section 2.1 and the security considerations, in the "harm to measurements"
section, but that is somewhat buried at this point in the document, though
perhaps worth promoting to some point earlier in the document.

Section 2.1.

This should probably point to the security considerations for more information
about controlled domains.

Section 3.1.

   o  Opt Data Len: The length of the Option Data Fields of this Option
      in bytes.

Are there requirements for how long the reserved field in the option data is
supposed to be? It seems that this field must consist of all zeroes, but that
it can be up to 255 bytes long. Given that the data consists of a FlowMonID (20
bits) and two flags (2 bits), would it be useful to recommend (or require) a
size for this?

Section 5.

   It is important to highlight that the definition of the Hop-by-Hop
   Options in this document SHOULD NOT affect the throughput on nodes
   that do not recognize the Option.

This is an interesting requirement. Surely a node that processes the option
does more work before forwarding a packet, which seems like it would affect
throughput, even if that impact is negligible. Perhaps "SHOULD NOT affect the
throughput" could be rephrased as "is designed to minimize throughput impact on
nodes that do not support the option"?

Section 5.1.

   The measurement of the packet loss is really straightforward.  The
   packets of the flow are grouped into batches, and all the packets
   within a batch are marked by setting the L bit (Loss flag) to a same
   value.

Does this require nodes to batch packets in memory before forwarding? (As
written, that seems to be the case, which seems odd.)

        The source node can switch the value of the L bit between 0
        and 1 after a fixed number of packets or according to a fixed timer,
        and this depends on the implementation.

Using a timer for this seems like a very error or noisy implementation
approach. Beyond having tightly synchronized clocks, which is already a
challenging requirement, is the idea that using a counter is somehow more
complex than a timer? (If there's no benefit to using a timer, and it only
introduces operational challenges, I'd recommend just removing the suggestion
altogether, but I may be missing something.)

        In a few words this
        implies that the length of the batches MUST be chosen large enough so
        that the method is not affected by those factors.

There does not seem to be enough guidance here to enforce this MUST, especially
given the different factors that affect batch size. What happens if this MUST
is violated? (Perhaps downgrading to a SHOULD would be better.)

Section 5.2.

How do nodes know if they should measure delay using the single- or
double-marking methodology? Is that determines by some per-domain policy?

        The most efficient and robust mode
        is to select a single double-marked packet for each batch, in
        this way there is no time gap to consider between the double-
        marked packets to avoid their reorder.

I'm having a hard time understanding this guidance. How exactly does one select
a single packet? Is it done at random, or is there another way? (The figures
seem to suggest that the packet is picked from the "middle" of a batch.)

Section 5.3.

   The FlowMon identifier field is to uniquely identify a monitored flow
   within the measurement domain.  The field is set at the source node.
   The FlowMonID can be uniformly assigned by the central controller or
   algorithmically generated by the source node.  The latter approach
   cannot guarantee the uniqueness of FlowMonID but it may be preferred
   for local or private network, where the conflict probability is small
   due to the large FlowMonID space.

What happens when all values in the FlowMonID space are consumed? Are old flows
discarded or overwritten? I would imagine there's some way IDs are recycled
given the finite 2^20 space, but that's not discussed.

Section 5.3.1.

This seems like text that should be moved to the security considerations. In
doing so, it can also be trimmed. (I would claim that the 32-bit FlowMonID
example is irrelevant given that these labels are 20 bits long, for example.)

Section 6.

        Moreover, Alternate Marking should usually be applied in
        a controlled domain and this also helps to limit the problem.

Does this mean to suggest that Alternate Marking can be used in networks where
attackers exist? If so, comments above regarding the integrity of these fields
should be addressed, I think.

   The privacy concerns of network measurement are limited because the
   method only relies on information contained in the Option Header
   without any release of user data.  Although information in the Option
   Header is metadata that can be used to compromise the privacy of
   users, the limited marking technique seems unlikely to substantially
   increase the existing privacy risks from header or encapsulation
   metadata.

The QUIC working group spent a _long_ time trying to understand the privacy
implications of a single latency bit. I'd encourage the authors here to review
the history of that discussion, and then revisit this paragraph. While privacy
implications may not seem obvious, I think it's a mistake to say that it is
unlikely to introduce any new sort of attack vector.

   The Alternate Marking application described in this document relies
   on an time synchronization protocol.  Thus, by attacking the time
   protocol, an attacker can potentially compromise the integrity of the
   measurement.

This seems somewhat buried, and probably worth promoting to the introduction.

Editorial comments:

- Some language is a bit informal, e.g., "Anyway, ...". I recommend removing
such phrasings throughout.

- "Alternate Marking" and "alternate marking" are inconsistently capitalized.
Is that intentional?

- OAM is undefined in Section 4 -- perhaps we can spell it out? (I assume it's
Operations, Administration, and Maintenance.)