Re: [http-state] Ticket 3: Public Suffixes

Adam Barth <ietf@adambarth.com> Sat, 16 January 2010 22:40 UTC

Return-Path: <adam@adambarth.com>
X-Original-To: http-state@core3.amsl.com
Delivered-To: http-state@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 7B1493A67E5 for <http-state@core3.amsl.com>; Sat, 16 Jan 2010 14:40:03 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.647
X-Spam-Level:
X-Spam-Status: No, score=-1.647 tagged_above=-999 required=5 tests=[AWL=-0.270, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_32=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EPdL39iROy9j for <http-state@core3.amsl.com>; Sat, 16 Jan 2010 14:40:01 -0800 (PST)
Received: from mail-pw0-f50.google.com (mail-pw0-f50.google.com [209.85.160.50]) by core3.amsl.com (Postfix) with ESMTP id 0A8833A67AE for <http-state@ietf.org>; Sat, 16 Jan 2010 14:40:01 -0800 (PST)
Received: by pwi20 with SMTP id 20so1077051pwi.29 for <http-state@ietf.org>; Sat, 16 Jan 2010 14:39:55 -0800 (PST)
MIME-Version: 1.0
Received: by 10.143.20.36 with SMTP id x36mr2880971wfi.231.1263681595085; Sat, 16 Jan 2010 14:39:55 -0800 (PST)
In-Reply-To: <20100116194716.GA3036@local.gobigwest.com>
References: <7789133a1001160001h62d203b3w76e175ec22d55e6@mail.gmail.com> <20100116194716.GA3036@local.gobigwest.com>
From: Adam Barth <ietf@adambarth.com>
Date: Sat, 16 Jan 2010 14:39:35 -0800
Message-ID: <7789133a1001161439o6873ec88jdebc911ea5dd0ebc@mail.gmail.com>
To: corvid <corvid@lavabit.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
Cc: http-state <http-state@ietf.org>
Subject: Re: [http-state] Ticket 3: Public Suffixes
X-BeenThere: http-state@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discuss HTTP State Management Mechanism <http-state.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/http-state>, <mailto:http-state-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/http-state>
List-Post: <mailto:http-state@ietf.org>
List-Help: <mailto:http-state-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/http-state>, <mailto:http-state-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Jan 2010 22:40:03 -0000

On Sat, Jan 16, 2010 at 11:47 AM, corvid <corvid@lavabit.com> wrote:
> Adam wrote:
>> Another alternative is to recommend a heuristic that works in many
>> cases and then further recommend that user agents use the full list.
>> The problem with this approach is that I don't know of any simple
>> heuristics that provide reasonable behavior.  In the past, some user
>> agents have used heuristics based on the length of the top-level
>> domain (i.e., two characters => ccTLD => foo.cc is a public suffix).
>> Unfortunately, this heuristic has undesirable consequences for some
>> small countries that let folks register domains directly in the ccTLD.
>
> This seems good to me to both
> - let implementors know that they can use the publicsuffix list
> - try to provide the best heuristic we know of for user agents who might
>  not have the luxury of using publicsuffix for whatever reason (or can't
>  depend on it)

Here's the best heuristic I know.  The algorithm can probably be
simplified and explained more clearly.

[[
Roughly, getDomain(strFQDN) amounts to:

1> If the final label is empty, drop it for the purposes of this
1> algorithm
// Otherwise "www.example.com." would have four labels "www",
"example", "com", "".  Instead, we drop the final label.

2> Name the labels Ln,...,L3,L2,L1; decreasing from start
(Leftmost=Ln) to finish (Rightmost=L1).
// If at any point in this algorithm the result demands >n labels,
getDomain returns "".

3> Check n > 1.  If not, there's no domain, just a plain hostname.
Return ""; exit.
// Dotless FQDNs consist of a host only, there is no domain.

4> Check L1 == "tv".  If so, getDomain returns L2.L1; exit.
// "tv" is a special-case "completely flat" ccTLD for historical reasons.

5> Check Len(L1) > 2.  If so, getDomain returns L2.L1; exit.
// Len(L1)>2 suggests L1 is a gTLD rather than a ccTLD.
// If Len(L1)<=2 we assume L1 is a part of a ccTLD.

6> Check if L2 in gTLD list "com,edu,net,org,gov,mil,int".  If so,
getDomain returns L3.L2.L1; exit.
// gTLDs, when they appear immediately left of a ccTLD (modulo
exception in step 4), are considered a part of the TLD.

7> If L1 is in the list "GR,PL" AND L2 is NOT in the gTLD list,
getDomain returns L2.L1; exit.
// GR and PL are considered "flat" ccTLDs EXCEPT when a gTLD appears in L2.
// getDomain("a.pl") returns "a.pl"
// getDomain("a.uk") returns ""

8> If Len(L2) < 3 getDomain returns L3.L2.L1; exit.
// getDomain("aa.bb.cc") returns "aa.bb.cc"

9> Otherwise, getDomain returns L2.L1
// getDomain("aa.bbb.cc") returns "bbb.cc"
]]

The heuristic is sufficiently ugly and wrong that I'd prefer to
recommend that user agent that care about security use the public
suffix list.  For example, it breaks the cookie protocol for domains
in the "to" ccTLD.  If a user agent doesn't care about security, then
it can skip the public suffix check and the protocol will still
function fine.

Adam