Re: [apps-discuss] Working Group Last Call draft-ietf-appsawg-sieve-duplicate

Ned Freed <ned.freed@mrochek.com> Sat, 11 January 2014 18:37 UTC

MIME-version: 1.0
Content-type: TEXT/PLAIN; CHARSET="iso-8859-1"
Message-id: <01P30W782SVK0000AS@mauve.mrochek.com>
Date: Sat, 11 Jan 2014 08:21:12 -0800
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Sat, 11 Jan 2014 12:45:07 +0000" <005501cf0eca$faf86840$4001a8c0@gateway.2wire.net>
References: <CAL0qLwZqJPTssNVLLaSjAP5wqteZ==fuawNF+WUZYvi+YWV1UQ@mail.gmail.com> <00a301cf07e8$01352160$4001a8c0@gateway.2wire.net> <52CF384D.3080502@rename-it.nl> <005501cf0eca$faf86840$4001a8c0@gateway.2wire.net>
To: "t.petch" <ietfc@btconnect.com>
Cc: Stephan Bosch <stephan@rename-it.nl>, IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Working Group Last Call draft-ietf-appsawg-sieve-duplicate
Precedence: list

> Mostly inline, but my big comment which gets lost in the detail is about
> scope of caches.  You say
> "there is only one 'cache' (in the document it is called the duplicate
> tracking list)"

> Really, worldwide, like the DNS root?-)

The phrase "only one cache" appears nowhere in the document that I can see. In
fact the word "cache" doesn't appear anywhere, which seems correct given that
nothing in this document involves using a cache, at least not in the sense that
I (or Wikipedia) understand the term.

> Hitherto, as I understand it, sieve has had no state; a filter is
> applied to a stream of messages and that is it, forget it and start
> again.

That's completely incorrect. The obvious counterexample is the vacation
extension, which has essentially been part of sieve from the beginning. Another
example is the notify extension. And in both of these cases the maintained
state is used to suppress duplicate messages. So they have a lot in common with
this extension.

More generally, sieve implementations often impose limits on the number of
times some operation can be performed. Sometimes this is per-script, but
sometimes it isn't. And that also requires state.

As a result sieve implementors are quite familiar with the ins and outs of
state storage.

> This adds state.  This then introduces the question of scope.

Nope. It's also been there pretty much from day one in Sieve.

> If I get the same message from two different ISP (which I  do), will
> that be detected?  I expect not.

It depends on where your sieve is store and evaluated. If it's a single sieve
being evaluated by your MUA that sees both messages, then yes, it will be
detected. If you have separate sieves attached to your ISP accounts, where each
one only sees one copy of the message, then no, it won't.

To put it another way, the "scope" of the state associated with a script
is always the script itself. Nothing more and nothing less. And the current
document does in fact say this, in the first paragraph of section 1:

   It adds a test to determine whether a certain message
   was seen before by the delivery agent in an earlier execution of ***the
   Sieve script.***

And it doesn't matter if your two sieves are both "yours" for some value of
"yours", they are identical, and if the two script instances are in fact under
shared admiistrative control so their storage could be coordinated. Why?
Because there's no way to devine the intent of having two separate instances of
a script. Maybe you want them to share some context. Or maybe for some reason
you don't. There's no way to communicate your intent in any case (and even if
there was there's no chance users would understand such a facility), so the
only sensible thing to do is to handle them separately.

> When those ISP outsource their MX to
> the same third party, will that be detected?

MX outsourcing has nothing to do with Sieve. Sieves are by definition evaluated
around the time of final delivery. A secondary MX by definition is not
performing final delivery.

Now, of course you can operate sieve outside of the context where it is
specified. But that's not something sieve specifications have to deal with.

> I expect not as long as
> they have different mail domains.

Mail domains have nothing to do with it either. Examples abound where users
have multiple mail domains associated with a single server or the handling of a
single mail domain is spread across multiple administrative domains.

But even this made sense, it wouldn't matter. The scope of storate associated
with a script is always just that script instance.

> At what point is there a single cache?

Again, this isn't a cache, at least not in the sense I understand the term. But
in any case, storage is per-script, that is, every sieve script gets its own
separate space for storing duplicate information.

> When I used multiple mailboxes for the same mail domain?

Those would each have separate scripts. You could choose to set them to be the
same script, but that doesn't make them the same instance.

> Only when I use a specific  mailbox? or when?

> In a sense it does not matter because it only effects the user
> experience and not the protocol on the wire, but I expect that users
> will care if the implementation on ISP A gives totally different results
> to ISP B.  And of course, when there is more than one cache (worldwide
> :-), then operators of this facility will need more resources to
> maintain them.

You're assuming vast complexity in the sieve space which AFAIK does not exist.
(There is in fact vast complexity in the space, but it's along fairly different
axes than the one you talk about here. And since almost all of lies far
outside the usage context defined by the specifications, it is not relevant
here.)

And perhaps more to the point, we have considerable experience with duplicate
elimination associated with the sieve vacation extension. In 10+ years of use,
I can't recall a single complaint about vacation duplicates that wasn't
associated with either a bug or some species of outright storage failure.
(Indeed, far and away the most common complaint we see in regards to vacation
is that the suppression mechanisms work too well and fail to send messages when
they are needed.)

> ----- Original Message -----
> From: "Stephan Bosch" <stephan@rename-it.nl>
> To: "t.petch" <ietfc@btconnect.com>; "Murray S. Kucherawy"
> <superuser@gmail.com>; "IETF Apps Discuss" <apps-discuss@ietf.org>
> Sent: Friday, January 10, 2014 12:01 AM
> > On 1/2/2014 7:24 PM, t.petch wrote:
> > > This I-D seems to need more thought.
> > >
> > > s.1 "For example, if a member of the list decides to
> > >    reply to both the user and the mailing list itself, the user will
> > >    one copy of the message directly and another through the mailing
> > >    list.
> > >
> > > Well, they MAY, but they don't on a good list system, such as the
> one in
> > > use here.

I get duplicate messages from IETF lists all the time.

> > >
> > > "Also, if someone cross-posts over several mailing lists to
> > >    which the user is subscribed, the user will receive a copy from
> each
> > >    of those lists."
> > >
> > > Ditto, not here.
> >
> > Ok, but these situations are quite common. I've implemented an earlier
> > version of this extension based on user requests. So, what do you want
> > me to do? Word this differently so that it is clear that this
> shouldn't happen for sanely configured mailing lists?

It does happen for sanely configured lists.

> I found the 'will' too forceful; my instant reaction was 'no it won't'
> because that is my experience.  Just moderate it slightly, 'will often'
> 'may' 'commonly'-  just not a somewhat forceful 'will'

I see no point to making such a change but don't object to it.

> > > "   Duplicate messages are normally detected using the Message-ID
> header
> > >    field, which is required to be unique for each message.  "
> > >
> > > REQUIRED maybe, but I seem to recall the malformed-mail I-D raising
> the
> > > possibility that it was not.  In which case, ...?
> >
> > I'm not sure how common that is, but you are right:  as the
> > specification is now, that would cause a false positive. We can make
> the
> > default a bit more complex by combining the Message-ID  with some
> other
> > header (Date perhaps?), thereby further reducing the likelihood of a
> > false positive. I guess we need to think about that a little more.

No, you really don't. The default has the virtue of simplicity and working fine
in almost all cases. The minute you start throwing other fields into the mix,
especially ones that tend to be messed with on submission, the less effective
the process will be.

> Yes, I think you should allow for the possibility in the I-D - as to
> how, I am less fussed.  Could be 'outside the scope of' up to 'should
> detect duplicates and discard as malformed' - just show that it has been
> considered.

I have no problem with pointing this out.

> > > s.3
> > > "an earlier Sieve execution."
> > > reading on it is apparent that this is any number of executions
> limited
> > > by the size of the FIFO cache and the maximum lifetime of entries in
> the
> > > cache.
> >
> > Yes. So, what exactly is your comment here? I don't think it is useful
> > to mention such detail early in the description.
> >
> > > "   Usage:  [":header" <header-name: string> /
> > >                           ":uniqueid" <value: string>]
> > >
> > > Why have two way of doing the same thing?  As I read it, this test
> is on
> > > a header field, so why not have just ":header" with a default of
> message
> > > I-D?
> >
> > I am not sure what you mean here. The :uniqueid argument does
> explicitly
> > not operate (directly) on a message header, but rather on some string
> > value composed by the user (using the variables extension). This can
> > consist of header field contents, but also on message body or even
> some
> > source other than the message being delivered.

> If I understand it aright, the test for duplication can be on
>  - message id
>  - header field
>  - something else
> which are invoked by <nothing>, :header, :unique-id respectively.
> Since message id is just another header field, why not merge the first
> two with the message id as the default header field if none other is
> specified?

Because it's a bad idea. The full syntax of the three cases is:

    duplicate

    duplicate :header "foo"

    duplicate :uniqueid "bar"

The only way to "merge" the first two syntactically would be something
like:

   duplicate :header

But that would be a case of a tagged argument that takes an optional parameter.
AFAICT this is not prohibited by RFC 5228, but this would be the first
extension to use such a construct. And given that it's unnecessary verbiage
added to a elegantly simple default case, I'm strongly opposed to making
such a change.

> > > And what happens if I use header field X in one execution and then
> > > header field Y in another? I presume separate caches for X and Y, in
> > > which case, duplicates may not be detected.
> >
> > No, there is only one 'cache' (in the document it is called the
> > duplicate tracking list). See the following text:
> >
> >  The "duplicate" test MUST track an unique ID value independent of its
> >  source.  This means that it does not matter whether values are
> >  obtained from the message ID header, from an arbitrary header
> >  specified using the ":header" argument or explicitly from the
> >  ":uniqueid" argument.

> see above

And see my response. I guess I have no objection to adding a statement
along the lines of "stored information about duplciates is always associated
with a single script" or something similar, but I don't see it as necessary.

> > Some examples follow this text. Do you mean that this needs to be
> > clarified more?
> >
> > > The use of multiple fields
> > > opens up all sorts of complications that need more explanation
> depending
> > > on the concept of the scope of the operation, which I do not see
> clearly
> > > explained.
> >
> > I am not sure what you mean here. I am assuming this comment is not
> > relevant given the above. Please clarify otherwise.
> >
> > > "The user can explicitly control the
> > >    length of this expiration time by means of the ":seconds"
> argument,
> > >    which is always specified in seconds.  "
> > >
> > > seconds seems short to me.  On the IETF lists, I typically see a gap
> of
> > > several hours between a message on one list and a message on another
> > > list, with four hours being the norm.  I would regard 5 minutes as
> the
> > > minimum and 36 hours, or perhaps less, as the maximum.
> >
> > Given the vacation-seconds extension, the use of a seconds granularity
> > is not strange in the Sieve realm.
> >
> > I like the flexibility of using seconds (mailinglists are not the only
> > application area of this extension), but I am not against changing it
> to
> > :minutes per se. Do any other people have thoughts?

> Ok but I just got a duplicate, once via the ietf announce list, once via
> the ietf main list, and they were 10 hours apart.  For me, this is
> typical, hours not seconds.

The history of vacation seems precisely on point here. The origincal vacation
extension only had a :days parameter, because that's what people thought
the finest granularity needed would be.

But that turned out to be wrong, and we had to create a second RFC, with
all the associated pother, not to mention grotting up the require namespace,
to fix it by adding :seconds.

The truth is we don't know what the finest granularity needs to be here. But
we do know three things:

(1) The chances it's less than a second are extremely small.
(2) You can always express a coarser granularity using a finer one, but not
    vice versa.
(3) Fixing this things after the fact is a big PITA.

I therefore think seconds is the right choice.

> > > "leading and trailing whitespace MUST first be trimmed from the
> value"
> > >
> > > This is a can of worms.  Normalisation often appears on these lists
> > > without, usually, a satisfactory answer, let alone the issues of
> i18n.
> > > More needs to be considered here.
> >
> > This mainly serves as a means to prevent stray white space from
> messing
> > with the string match. The core Sieve language also does this for
> > instance for the header test. And how would i18n be relevant here?

> Read RFC6532; that you have not referenced it makes me think that you
> have not considered i18n which I would regard as remiss nowadays.

I coauthored RFC 6532 and completely fail to see any relevance. In particular,
if two messages are in fact duplicates, don't you think they're going to be
normalized the same way? 

Put another way, while it's a well established fact that various intermediaries
muck around with header whitespace, I've yet to hear of intermediaries playing
normalization games. Absent evidence of even the existence of such things, let
alone sufficient information for how you'd want to deal with them, I'm
completely comfortable with this document not saying anything on this topic.

				Ned

[apps-discuss] Working Group Last Call draft-ietf… t.petch
[apps-discuss] REMINDER: Working Group Last Calls… Murray S. Kucherawy
Re: [apps-discuss] Working Group Last Call draft-… Hector Santos
Re: [apps-discuss] Working Group Last Call draft-… Stephan Bosch
Re: [apps-discuss] REMINDER: Working Group Last C… Murray S. Kucherawy
Re: [apps-discuss] Working Group Last Call draft-… t.petch
Re: [apps-discuss] Working Group Last Call draft-… Ned Freed
Re: [apps-discuss] Working Group Last Calldraft-i… t.petch
Re: [apps-discuss] Working Group Last Calldraft-i… Ned Freed
Re: [apps-discuss] Working Group Last Calldraft-i… Arnt Gulbrandsen
Re: [apps-discuss] Working Group Last Calldraft-i… Hector Santos