[media-types] Default charset parameter for text media types

Peter Occil <poccil14@gmail.com> Sun, 29 July 2018 04:16 UTC

Return-Path: <poccil14@gmail.com>
X-Original-To: media-types@ietfa.amsl.com
Delivered-To: media-types@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C479712F18C for <media-types@ietfa.amsl.com>; Sat, 28 Jul 2018 21:16:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.749
X-Spam-Level:
X-Spam-Status: No, score=-1.749 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5YK6yf1Um7ac for <media-types@ietfa.amsl.com>; Sat, 28 Jul 2018 21:16:00 -0700 (PDT)
Received: from mail-yb0-x22d.google.com (mail-yb0-x22d.google.com [IPv6:2607:f8b0:4002:c09::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DDDA0127332 for <media-types@ietf.org>; Sat, 28 Jul 2018 21:15:59 -0700 (PDT)
Received: by mail-yb0-x22d.google.com with SMTP id k124-v6so3551655ybk.6 for <media-types@ietf.org>; Sat, 28 Jul 2018 21:15:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:mime-version:to:from:subject:date:importance; bh=ldmRMkLKiBc4xkX8qA2i6F5VMmWseAjwGEqefp2jJCQ=; b=nUVw/W3vNCq3y1Wf/gVTTvJsZKLqvFb4qRYqbd42QOVEPnTnSOq3ET2Uc/bCDb7mNE srKxJ9kPk8GWs7cWGxTl0j94f44Qs4a73lxIwC4YHqZFSX62DPeDzbUVFYqw2KCunC+1 Xkz3ACAAgxVhlORxuCZg12VgykM1yfw/ywfx/lOdlkOAsVjoFMLtETJDbSM7dxiApxJs k6Tt87ZQLU4oKQQlToHcuiMUYcejTWf8CyMdqOISMiSDsYLp3TCbU6LK+vix/aZx4aVP O+fGhiLGnhA3/rjDYyuG+XzWpo1GQEPfSwRgCelO3gI9rCJbwlZOldQOHyeDPUAZ2Tnr mCPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:mime-version:to:from:subject:date :importance; bh=ldmRMkLKiBc4xkX8qA2i6F5VMmWseAjwGEqefp2jJCQ=; b=OMXwGgQakGRFDEunCP2gGAPx6PGNX2I8A8ySpU7aFypqHAjz+mXKu9nzYyHcreTN5B PRGWUwexMjG2dLu+bQHpmQaoSfow5OyJU0N8buv9NBIP60x76hlwBZMQoqW06nrphVoT xXyMn+w4WFKjcwVbrSfZperUxzR8gLyLujDcqFCFzWjFm7T0VUowPRKHmjQDhf6NhltT 4yLVU+SeZ0ymS9l/DEam9ElSY/aHfv0k0xBTzzq3ScoOk3BYfnVtILIROMlJLng3Fdnq M3QEhCwUT72ILmNeho8WNtyexe7664kgfx4g/u9EQZKClAADJPW0Q1K8SgnGrcpnvUY8 roIA==
X-Gm-Message-State: AOUpUlGT024zAbP4kivFYhRSs9HgPLwsbUIcS9g75e1SPPAqGEY4Niur YHcyqobXDDXZ34edRYB7Ttmk66Gr
X-Google-Smtp-Source: AAOMgpcQ/V+/jlFS8EbN371F8/x5lIsvp+E9nuJjj0FHNkfkUrmC1fmDhWAB9TS0pOWBMJs3GyukPw==
X-Received: by 2002:a25:860a:: with SMTP id y10-v6mr6613638ybk.327.1532837758625; Sat, 28 Jul 2018 21:15:58 -0700 (PDT)
Received: from ?IPv6:2601:192:4e00:596:22:8b71:4eb9:6006? ([2601:192:4e00:596:22:8b71:4eb9:6006]) by smtp.gmail.com with ESMTPSA id 203-v6sm3564600ywv.34.2018.07.28.21.15.57 for <media-types@ietf.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 28 Jul 2018 21:15:58 -0700 (PDT)
Message-ID: <5b5d3f7e.1c69fb81.304f4.1280@mx.google.com>
MIME-Version: 1.0
To: "media-types@ietf.org" <media-types@ietf.org>
From: Peter Occil <poccil14@gmail.com>
Date: Sun, 29 Jul 2018 00:15:59 -0400
Importance: normal
X-Priority: 3
Content-Type: multipart/alternative; boundary="_C627CA8E-20A6-4650-96AE-5B19ABF77823_"
Archived-At: <https://mailarchive.ietf.org/arch/msg/media-types/DMDWjQc3EmNZ4uQy8OSZoIpYkdo>
Subject: [media-types] Default charset parameter for text media types
X-BeenThere: media-types@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: "IANA mailing list for reviewing Media Type \(MIME Type, Content Type\) registration requests." <media-types.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/media-types>, <mailto:media-types-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/media-types/>
List-Post: <mailto:media-types@ietf.org>
List-Help: <mailto:media-types-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/media-types>, <mailto:media-types-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Jul 2018 04:16:02 -0000

What is the default charset parameter when that parameter is absent from a text media type?  Before RFC 6657, the answer was simple: US-ASCII.  However, things have become more complicated since that RFC was published -- RFC 6657 says:

   [A]ll new "text/*" registrations [that is, registrations after July 2012] MUST clearly specify how the charset is determined[.] However, existing "text/*" registrations [before July 2012] that fail to specify how the charset is determined still default to US-ASCII.

This would be easy if all existing and new text media types "specif[ied] how the charset is determined".  Unfortunately this is not the case.

Although I have managed to categorize all text media types into three categories: "no default", "US-ASCII default", and "UTF-8 default", deciding which category a media type falls in is not always trivial, since it depends among other things on when the media type was registered, whether it treats the charset parameter as required, optional, or not used, and whether the text is always ASCII or always UTF-8.

What follows are notes that explain how each text media type is assigned one of those three categories (source is my MailLib library's MediaType.cs).  They may be helpful in the process of revising some of those media types so that they "clearly specify how the charset is determined", as RFC 6657 requires for media type registrations after July 2012.

      // NOTE: RFC6657 changed the rules for the default charset in text
      // media types, so that there is no default charset for as yet
      // undefined media types. However,
      // media types defined before this RFC (July 2012) are grandfathered
      // from the rule: those
      // media types "that fail to specify how the charset is determined" still
      // have US-ASCII as default. The text media types defined as of
      // Jul. 11, 2018, are listed below:
      //
      // -- No default charset assumed: --
      //
      // RTP payload types; these are usually unsuitable for MIME,
      // and don't permit a charset parameter, so a default charset is
      // irrelevant:
      // -- 1d-interleaved-parityfec, fwdred, red, parityfec, encaprtp,
      // raptorfec, rtp-enc-aescm128, t140, ulpfec, rtx, rtploopback
      //
      // Charset determined out-of-band:
      // -- vnd.motorola.reflex*(5)*(10)
      //
      // Special procedure defined for charset detection:
      // -- ecmascript*(8), javascript*(8), html,
      // rtf*(5)
      //
      // XML formats (no default assumed if charset is absent, according
      // to RFC7303, the revision of the XML media type specification):
      // -- xml, xml-external-parsed-entity,
      // vnd.in3d.3dml*, vnd.iptc.newsml, vnd.iptc.nitf,
      // vnd.ms-mediapackage*(5),
      // vnd.net2phone.commcenter.command, vnd.radisys.msml-basic-layout,
      // vnd.wap.si, vnd.wap.sl, vnd.wap.wml
      //
      // Behavior deliberately undefined (so whether US-ASCII or another
      // charset is treated as default is irrelevant):
      // -- example
      //
      // These media types don't define a charset parameter (after
      // RFC6657):
      // -- grammar-ref-list*(9), vnd.hgl*(6)*(9), vnd.gml*(9)
      //
      // Uses charset parameter, but no default charset specified (after
      // RFC6657):
      // -- markdown*
      //
      // -- US-ASCII assumed: --
      //
      // These media types don't define a charset parameter (before
      // RFC6657):
      // -- dns, mizar, vnd.latex-z,
      // prs.lines.tag, vnd.dmclientscript,
      // vnd.dvb.subtitle, rfc822-headers,
      // vnd.si.uricatalogue*(7), vnd.si.fly*(7)
      //
      // No charset parameter defined, but does specify ASCII only (after
      // RFC6657):
      // -- vnd.ascii-art****, prs.prop.logic*
      //
      // These media types don't define a default charset:
      // -- css, richtext, enriched, tab-separated-values*,
      // vnd.in3d.spot*, vnd.abc, vnd.wap.wmlscript, vnd.curl,
      // vnd.fmi.flexstor, uri-list, directory*
      //
      // US-ASCII default:
      // -- plain, sgml, troff
      //
      // -- UTF-8 assumed: --
      //
      // UTF-8 only:
      // -- vcard, jcr-cnd, cache-manifest
      //
      // Charset parameter defined but is "always ... UTF-8":
      // -- n3, turtle, vnd.debian.copyright, provenance-notation
      //
      // UTF-8 default:
      // -- csv, calendar**, vnd.a***, parameters, prs.fallenstein.rst,
      // vnd.esmertec.theme.descriptor, vnd.trolltech.linguist,
      // vnd.graphviz, vnd.sun.j2me.app-descriptor, strings*(5),
      // csv-schema*(5)
      //
      // * Required parameter.
      // ** No explicit default, but says that "[t]he charset supported
      // by this revision of iCalendar is UTF-8."
      // *(5) No charset parameter defined.
      // *(6) 8-bit encoding.
      // *(7) Says "US-ASCII" is always used, or otherwise says
      // the media type contains ASCII text
      // *(8) RFC4329: If charset unrecognized, check for UTF-8/16/32 BOM if it
      // exists; otherwise use UTF-8. If UTF-8, ignore UTF-8 BOM
      // *(9) After RFC 6657.
      // *(10) Charset determined "in an a priori manner", rather than
      // being stated in the payload.
      // *** Default is UTF-8 "if 8-bit bytes are encountered" (even if
      // none are found, though, a 7-bit ASCII text is still also UTF-8).
      // **** Content containing non-ASCII bytes "should be rejected".

--Peter