Re: #428 Accept-Language ordering for identical qvalues

Nicholas Shanks <nickshanks@gmail.com> Mon, 21 January 2013 14:24 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F1A0621F882C for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 21 Jan 2013 06:24:27 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.203
X-Spam-Level:
X-Spam-Status: No, score=-9.203 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, MIME_QP_LONG_LINE=1.396, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MYe3ehkcKFuu for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 21 Jan 2013 06:24:27 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 1F40421F882B for <httpbisa-archive-bis2Juki@lists.ietf.org>; Mon, 21 Jan 2013 06:24:27 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1TxIHX-00019Q-80 for ietf-http-wg-dist@listhub.w3.org; Mon, 21 Jan 2013 14:23:27 +0000
Resent-Date: Mon, 21 Jan 2013 14:23:27 +0000
Resent-Message-Id: <E1TxIHX-00019Q-80@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <nickshanks@gmail.com>) id 1TxIHS-00018g-Bg for ietf-http-wg@listhub.w3.org; Mon, 21 Jan 2013 14:23:22 +0000
Received: from mail-wi0-f172.google.com ([209.85.212.172]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <nickshanks@gmail.com>) id 1TxIHR-000288-DX for ietf-http-wg@w3.org; Mon, 21 Jan 2013 14:23:22 +0000
Received: by mail-wi0-f172.google.com with SMTP id o1so7187907wic.17 for <ietf-http-wg@w3.org>; Mon, 21 Jan 2013 06:22:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=yNSgiQCh4f7lRjpVAW3HatkHWUo5YeGNTJkj+Wcn0Ro=; b=bcumVAWfkUgg3n8ZI9wU6l2v2Ob59I7khKrEhYFUjDwjpduHYDNJ91JLcq/b6BNFVp KMX1CDUma+IpO6b8VsjmtlNA8WXt5eKW7L+gpwZR0R/66YzO6MbkhcV2rTH/H5pjmCgl 0DBwhCfuKf5pW9TXxUPPr3oNLJASXYPVukwD8Jh1I1WFqL6kyeV5W/VPwn5YzQetiNTe hvSZ5Xoy8+iK9b+EtgN+gsP9mo0uLiIAqz81K+CRZnZTfe62cuAqUW3K86ruMg7COA3m ckZl8PYcvN818E0tsZD1/pI/CEzeqnHNfXNrC02l9aLLPFefzO2Gh3C/VTBAajBgHzJM zSgw==
X-Received: by 10.194.142.162 with SMTP id rx2mr26876843wjb.17.1358778175177; Mon, 21 Jan 2013 06:22:55 -0800 (PST)
Received: from [192.168.0.76] (host213-120-126-47.in-addr.btopenworld.com. [213.120.126.47]) by mx.google.com with ESMTPS id p2sm18900928wic.7.2013.01.21.06.22.53 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 21 Jan 2013 06:22:54 -0800 (PST)
References: <em144175d2-e44d-4209-b5a2-f2dbf14d99d4@bombed> <50FCA047.8010101@treenet.co.nz> <316F5F01-1C9F-4077-B400-68CDE6B391CA@mnot.net>
Mime-Version: 1.0 (1.0)
In-Reply-To: <316F5F01-1C9F-4077-B400-68CDE6B391CA@mnot.net>
Content-Type: text/plain; charset="cp932"
Content-Transfer-Encoding: quoted-printable
Message-Id: <14DB00C3-5E54-45E6-9A4E-225B5C1BCD2C@gmail.com>
Cc: Amos Jeffries <squid3@treenet.co.nz>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
X-Mailer: iPad Mail (10A403)
From: Nicholas Shanks <nickshanks@gmail.com>
Date: Mon, 21 Jan 2013 14:22:50 +0000
To: Mark Nottingham <mnot@mnot.net>
Received-SPF: pass client-ip=209.85.212.172; envelope-from=nickshanks@gmail.com; helo=mail-wi0-f172.google.com
X-W3C-Hub-Spam-Status: No, score=-3.0
X-W3C-Hub-Spam-Report: AWL=-0.265, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1TxIHR-000288-DX d10723aa92c2663669d20fab68f1d27f
X-Original-To: ietf-http-wg@w3.org
Subject: Re: #428 Accept-Language ordering for identical qvalues
Archived-At: <http://www.w3.org/mid/14DB00C3-5E54-45E6-9A4E-225B5C1BCD2C@gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16081
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

I have a 111 MB access log beginning in 2004 that records UA, AL & AC headers which I don't mind posting publicly. I have not analysed it so do not know what fraction of this log is bots, and it's only from an obscure 'test your accept language/charset headers' page on my personal site, so could be full of weird junk.

"wc -l" says there are 1.8 million lines in there, and the head of the file looks fine, but when I run tail, I see "n" where my script should have written a new line, so I probably edited the script at some point and missed a backslash. that will need cleaning up.


― Nicholas.

On 21 Jan 2013, at 02:06, Mark Nottingham <mnot@mnot.net> wrote:

> That's interesting, thanks. 
> 
> One thing to add; even if the client includes a q=0, the server can still ignore it. 
> 
> Cheers,
> 
> P.S. If you are able (considering privacy issues, etc.) and want to dump such data in a useable format, feel free to ask for a repository on the github account.
> 
> 
> 
> On 21/01/2013, at 12:56 PM, Amos Jeffries <squid3@treenet.co.nz> wrote:
> 
>> My collection of 2 years worth of language headers says no.
>> 
>> Of 2018 unique Accept-Language header field-values;
>> 1532 are using q-values in a strictly sorted list
>> 491 are not using q-values
>> 14 are using "q=0.0".
>> 5 are using q-values and non-qvalues without ordering the sent list (1 looks otherwise normal, teh others are using puny-codes)
>> 
>> The 14 are also unique in being very long and having multiple entries with equal q-values. They are still without exception strictly ordered with the entries having no q-value entries first (as if q=1.0 was used for sort but omitted sending). They are also containing a number of oddities such as multiple entries for language codes with differing q-values.
>> 
>> NP: Of those 14 odd A-L headers noted above I have UA details on 8 of them. All claim to be Firefox but the Gecko dates do not line up with other info on those versions (the 11.0 was released some years before 3.5.9 on the same OS) so the whole input is a bit suspect.
>> 
>> 
>> The 5 cases un-ordered list have puny-code values with no q-value being listed after an otherwise normal series of languages. Like so:
>> "en-us,en;q=0.5,x-ns1qHkbtrt8Nhv,x-ns2E1e0Nnym7b6"
>> 
>> I have a few cases of q-value ordered list followed by wildcard "*" with no q-value. Sender obviously assuming the list is ordered.
>> 
>> 
>> 
>> Broken down by UA, which I started ~6 months ago at Juliens suggestion I have 54289 distinct UA visiting, of which;
>> 21756 are not sending A-L header at all
>> 19621 unique UA are using a single language code with no q-value
>> 12495 unique UA are using q-values as above.
>> 8 are sending only wildcard "*" or "*/*"
>> 
>> The remainder ~400 roughly match up with the 491 AL field-values not using q-values. Are older agents (Windows 98, NT, 2k stand out), agents sending the same language multiple times (VoilaBot variants and Safari there), or sending sub-language variants with the generic form last eg "en-GB,en", "en-US,en", "en-US,en,*" (Tablets and Mobile Safari mostly). Obviously assuming sorted lists even back into the Windows 98 ones.
>> 
>> There are also a few bots sending exactly 2 puny-code entries.
>> 
>> 
>> Amos
> 
> --
> Mark Nottingham   http://www.mnot.net/
> 
> 
> 
>