[Extra] IMAP4rev2 body search

Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> Fri, 17 January 2020 14:56 UTC

Return-Path: <arnt@gulbrandsen.priv.no>
X-Original-To: extra@ietfa.amsl.com
Delivered-To: extra@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EBD49120077 for <extra@ietfa.amsl.com>; Fri, 17 Jan 2020 06:56:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=gulbrandsen.priv.no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gfQ-hdN5xYOB for <extra@ietfa.amsl.com>; Fri, 17 Jan 2020 06:56:10 -0800 (PST)
Received: from stabil.gulbrandsen.priv.no (stabil.gulbrandsen.priv.no [IPv6:2a01:4f8:191:91a8::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 72CC4120074 for <extra@ietf.org>; Fri, 17 Jan 2020 06:56:10 -0800 (PST)
Received: from stabil.gulbrandsen.priv.no (stabil.gulbrandsen.priv.no [IPv6:2a01:4f8:191:91a8::3]) by stabil.gulbrandsen.priv.no (Postfix) with ESMTP id 377DFC0074; Fri, 17 Jan 2020 15:00:25 +0000 (GMT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gulbrandsen.priv.no; s=mail; t=1579273225; bh=2BqufpiBVr8Yr1sHizA3cxOUFJVM8kTXmMojq39vQUU=; h=From:To:Subject:Date:From; b=VsXh+4RfedRB0sXQUZxtD3KXrc811DGeRk4Sy9j7ghRSg9IxONsfnpG+HFsaTqzdt Yf36xxoUIDT7e7xshD7K5GVlhNRERkSSG4vA9JX2brMG8is+/miag7LUbwSYkC9m6s eF0l1mbBHKElMG8CuWvGH6qwlKLyCkWx11lfwULA=
Received: from arnt@gulbrandsen.priv.no by stabil.gulbrandsen.priv.no (Archiveopteryx 3.2.0) with esmtpsa id 1579273224-27478-27476/9/59; Fri, 17 Jan 2020 15:00:24 +0000
From: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
To: extra@ietf.org
Date: Fri, 17 Jan 2020 15:56:06 +0100
Mime-Version: 1.0
Message-Id: <9d918580-70be-4e9b-90ef-372523afe359@gulbrandsen.priv.no>
User-Agent: Trojita/0.7; Qt/5.7.1; xcb; Linux; Devuan GNU/Linux 2.1 (ascii)
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/extra/_D3clWBT8Q33zamPdIRNUxuGRUM>
Subject: [Extra] IMAP4rev2 body search
X-BeenThere: extra@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Email mailstore and eXtensions To Revise or Amend <extra.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/extra>, <mailto:extra-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/extra/>
List-Post: <mailto:extra@ietf.org>
List-Help: <mailto:extra-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/extra>, <mailto:extra-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Jan 2020 14:56:18 -0000

Hi,

a discussion the other day^Wweek jars in my mind.

I think we should underspecify the BODY search key more clearly. All it now 
says is "matches if contains", which is IMO correct (it matches the running 
code) but underspecified and vague. I think we ought to eliminate the 
vagueness, and propose the following explicit underspecification:

---

Messages that contain the specified string in the body of the message.

The server SHOULD decdode the content-transfer-encoding, so that a message 
matches independent of its content-transfer-encoding. Apart from that rule, 
this specification explicitly allows much server behaviour that has been 
common in IMAP4rev1 servers, including:

Most servers interpret "contains" on a character level, ie. "BODY range" 
matches a message that contains the word "orange", but some servers 
interpret it on a token or word level, ie. "BODY range" does not match a 
message that contains "orange", because "orange" is one token. It may 
however match a message that contains "ranges", if the server uses stemming 
(and perhaps language detection).

Some servers search only in text/plain. Others also search in other types, 
for example Microsoft Word and PDF attachments.

Some servers search HTML on a source level ("BODY range" does not match 
ra&shy;nge), others search HTML as normally displayed ("BODY range" matches 
ra&shy;nge).

Most servers interpret the messages and search terms independent of 
encoding, such that a message that uses charset=iso-8859-8 may match a 
search term that uses UTF-8. However, not even this is required.

Rationale: IMAP4rev1 servers vary in the kind and quality of their search 
implementation. This document chooses to avoid adding requirements on the 
business logic in such servers.

---

(I'm open to changing any of this. IMO any variant is good, and good is 
better than best. I'd prefer that quarrelsome people can't beat each other 
over the head with the RFC, that's all.)

Arnt