[VCARDDAV] Proposal around escape character handling (2nd round)

Daisuke Miyakawa <d.miyakawa@gmail.com> Tue, 13 July 2010 13:35 UTC

Return-Path: <d.miyakawa@gmail.com>
X-Original-To: vcarddav@core3.amsl.com
Delivered-To: vcarddav@core3.amsl.com
Received: from localhost (localhost []) by core3.amsl.com (Postfix) with ESMTP id DF4CF3A69E1 for <vcarddav@core3.amsl.com>; Tue, 13 Jul 2010 06:35:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.444
X-Spam-Level: *
X-Spam-Status: No, score=1.444 tagged_above=-999 required=5 tests=[AWL=-0.311, BAYES_50=0.001, HTML_MESSAGE=0.001, MIME_BASE64_TEXT=1.753]
Received: from mail.ietf.org ([]) by localhost (core3.amsl.com []) (amavisd-new, port 10024) with ESMTP id Pym6lxGGMt2s for <vcarddav@core3.amsl.com>; Tue, 13 Jul 2010 06:35:28 -0700 (PDT)
Received: from mail-gx0-f172.google.com (mail-gx0-f172.google.com []) by core3.amsl.com (Postfix) with ESMTP id 4A19B3A67DA for <vcarddav@ietf.org>; Tue, 13 Jul 2010 06:35:28 -0700 (PDT)
Received: by gxk3 with SMTP id 3so3555481gxk.31 for <vcarddav@ietf.org>; Tue, 13 Jul 2010 06:35:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=yJdsiCS2v+iskB634Bvh0TNolMgjP3Z3LW5+rmPOMtw=; b=sKL+1TLD8mmyaGyLYH6LJYj7ZIRH6y+x2+KDAjoYwg7RxIcN9d0rydRb9gjDTYmd9Y 2FVLczKjBacNDLsS25mGX2aOYOyePTAqhO7yAeS6MEPLk6GlSspQUX0nDf/H1L1Jew0Y dM+aCVFN3l2Z3dMZhUalwljRpk3oLIOjz9mHg=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=t06Clggefkz2JmUk18s1jsuRvXZR3QwfkZy5IK/qkDb1W9OeQEvix5JHvo5V6fwWmn fSRLQ9tQ76UHdpEZ/yxqe+foyiTy0j+JeELJyVguS8eQinVvPJPnsUwJljQ9PJlLacwd thilZjC+IRirjP/qjJQ42XeYbNtmomk8t2Fys=
MIME-Version: 1.0
Received: by with SMTP id n13mr2234125agb.31.1279028133630; Tue, 13 Jul 2010 06:35:33 -0700 (PDT)
Received: by with HTTP; Tue, 13 Jul 2010 06:35:32 -0700 (PDT)
Date: Tue, 13 Jul 2010 22:35:32 +0900
Message-ID: <AANLkTilx6XgI2iosuKf5zmHnLggkmYe4EeeN-PijvI5K@mail.gmail.com>
From: Daisuke Miyakawa <d.miyakawa@gmail.com>
To: vcarddav@ietf.org
Content-Type: multipart/alternative; boundary=0016e68db6e6901613048b44f115
Subject: [VCARDDAV] Proposal around escape character handling (2nd round)
X-BeenThere: vcarddav@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IETF vcarddav wg mailing list <vcarddav.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/vcarddav>, <mailto:vcarddav-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/vcarddav>
List-Post: <mailto:vcarddav@ietf.org>
List-Help: <mailto:vcarddav-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/vcarddav>, <mailto:vcarddav-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Jul 2010 13:35:31 -0000

Hi group!

As you may know, I had proposals around escape characters.

Fortunately or unfortunately the thread became a bit too long for the other
people to look over.
Thus, I'd like to re-organize (and modify a bit) my proposals and ask your
opinions again.
Please let me know when I miss something important mentioned in the previous
thread (e.g. ignoring your opinion), though I carefully checked it.

****** What the current draft (rev12) defines:
Backslash, semicolon, comma, and new line MUST be encoded in accordance with
following rules:
- backslash <-> \\
- semicolon <-> \;
- comma <-> \,
- new line <-> \n or \N

Currently, how semicolon should be handled depends on which property has
it. According to Simon's description:
>* *The rule for comma, backslash, and newline are global and apply
> everywhere (see section 3.3). However, the rule for semicolon only
> applies to some properties which use the semicolon as separator. It is
> up to the people defining a particular X- property to decide whether
> they want to use the semicolon as separator.

****** Proposal 1 (new):
The one-to-one rules above MUST be applied to "all" the properties, even
including X- properties, for uniformity between properties.
In other words, semicolons MUST be escaped even when the property does not
allow multiple values (like (0, 1)).

In this proposal, how readers must/should act when ';' is given without
escape in those properties is undefined. I don't think "undefined" is a good
but I cannot think up better idea for mentioning it as a formal

****** Proposal 2 (new):
Add one additional one-to-one mapping.
- \t <-> TAB

- This convention has been used to encode usual texts, not in vCard but in
the other text handlers (from C-language), I kind of thought it might be
better to add this
-- I felt vCard looks "exceptional" without this rule.
- I think white spaces should carefully be treated and \t is typical and
important for us to specially take care of.

I'd say that there are few opportunities where I've seen TAB in actual
This proposal is just for keeping consistency between vCard 4.0 and the
other escaping rules used in other systems (like programming languages).
# For example, see http://www.python.org/dev/peps/pep-3138/

****** Proposal 3: (currently up to the group's decision)
\uNNNN <-> (a Unicode character with charcode 0xNNNN)
\U00NNNNNN <-> (a Unicode character with charcode 0xNNNNNN, where 0xNNNNNN
SHOULD be more than 0x10000)

The proposal above is mainly based on
I modified this proposal a bit (from "\x" to "\u", "\U").

Python, JavaScript and some other programming languages use \u \U format for
encoding Unicode, not \x (it is usually for 8bits).
This is not applied to Perl, where \x{NNNN}, 0xNNNN seem to be used.
I'm not sure about Ruby.
What I can tell is that programming languages actual support this kind of
escaping rules.

Theoretically we don't need to consider surrogate pairs like Python do, but
I think convention and uniformity are important from a practical view point.

Yea: 1 (only me)
- The other actual implementations using Unicode (Java, Python, icu4c)
actually support this format.
- I prefer actual usability to theoretical simplicity, while I agree that
"theoretically" this spec is not needed.
-- My experience handling Unicode with surrogate pairs tell me that this
mapping is practically useful.
- I don't think \xNN is needed in vCard, as we don't need to take care of
readability of 8bits. We just need b encoding for it.

Nay: 4 (Simon, Cyrus, Julian, and Barry)
Simon's reply
> I disagree again. If you are using a given character in a sentence,
> whether it is visible or not, it is because you intend the recipient to
> read it. Otherwise, the character would not be useful and would not be
> present. For example, in this paragraph I used many spaces which are
> invisible and I don't think we would gain anything by replacing them
> with \x20 in a vCard. We are encoding user-readable text in vCard, not
> random bits.

Barry's reply:
> a single wire format is sufficient; display of non-ASCII characters is an
orthogonal issue

****** Proposal 4 (new):
When a vCard entry happened to have escaped characters undefined in vCard
4.0 spec, readers SHOULD just remove the backslash and append the wrongly
escaped characters as is.
(This is not "MUST" but "SHOULD", because actual astute vCard readers may
have to cope with wrong input composed by the other composer)

e.g. This is \a pen. -> This is a pen.
(Readers SHOULD NOT understand \a as an alert (like C-language requires) but
just a character 'a').

Jurian's reply related to this proposal
> on \ escaping: either allow \ in front of any character, or be *very*
clear that using it when it's not needed makes the vCard invalid (test
cases? validator service?)

I suppose just requesting readers to remove invalid backslash would suffice.

I agree that my proposals are a bit too dependent on practical view points.
Feel free to correct me if they have problems (from the view of the IETF way
or something I'm not familiar with).


Daisuke Miyakawa (宮川大輔)