May Draft, #2 -- TEXT

Nathaniel Borenstein <nsb@thumper.bellcore.com> Wed, 22 May 1991 02:15 UTC
Message-Id: <ocCRGju0M2Y1EVFV44@thumper.bellcore.com>
Date: Tue, 21 May 1991 22:17:19 -0400
From: Nathaniel Borenstein <nsb@thumper.bellcore.com>
To: Greg Vaudreuil <gvaudre@NRI.Reston.VA.US>, John C Klensin <KLENSIN@infoods.mit.edu>, NED@hmcvax.claremont.edu, keld@dkuug.dk, erik@sra.co.jp, MRC@cac.washington.edu
Subject: May Draft, #2 -- TEXT
In-Reply-To: <9105211000.aa05707@NRI.NRI.Reston.VA.US>
References: <9105211000.aa05707@NRI.NRI.Reston.VA.US>
                      CHANGES From FIRST MAY DRAFT

Lots of prose changes, mostly minor, a few verbatim from Ned & Greg.

The MAILASCII character set is now called US-ASCII, because that's what
it really is.  However, it is no longer the default -- the default
character set is now undefined, because that's the only thing that
really corresponds to existing practice!  The discussion about how to
mail things safely was separated from the US-ASCII discussion and given
its own appendix.

Quoted-Printable has been tightened up still further.

The Content-Size definition now differentiates the two possible uses via
a "when" part.

Changed "charset" syntax to be "subtype".  Further reduced number of
content-types to NINE by consolidating things into subtypes. 
Liberalized the rules for defining subtypes, even as the rules for types
were tightened up.

Multipart messages no longer have character sets specifications, and so
the prefix & postfix have gone away again, sigh..

An interesting new paragraph on subtypes has been added right near the
end of the multipart description.  I think this may be the key to
getting multipart structuring done right in the future.

Along with the "text-plus" type, I've defined a new "richmail" default
subtype that I think could be very important.
How to Read the May Draft of RFC-XXXX

This is the fifth major draft, at least, of RFC-XXXX.  Those of you who
have been following along are, no doubt, heartily sick of the process by
now, as am I.  I'm trying to make it easier for us all in the following
ways:

1.  I've compiled a list of major changes from the April draft.  I'm not
trying to pull any fast ones on anybody, but it is possible that the
list is incomplete.  It is, however, my best attempt to provide a simple
list of what has changed.

2.  With previous drafts, I think that comments mostly came in three flavors:

    NITS:  Minor points of clarification, typographical or
        technical correction, etc.  These were uncontroversial and I
        tried to adopt them all.

    SHOW-STOPPERS:  These were major disagreements, where people
        indicated unhappiness so great that they might be unable to
        live with the draft as written.  Obviously I've tried VERY
        hard to deal with these, but sometimes people have
        SHOW-STOPPER comments that are pretty nearly in direct
        conflict with each other.

    ARGUMENTS:  These are sincere disagreements where the person
        disagreeing could still live with the draft if he lost the
        argument.

I would like to STRONGLY URGE the readers of this draft to self-classify
their comments into the above three categories, and to treat them in the
following ways:

    NITS:  Send them directly to me; no need to bother the whole list.

    SHOW-STOPPPERS:  Sigh...  I'm hoping there aren't any left,
        but if you have them, please send them to the whole list.

    ARGUMENTS:  If you can live with losing the argument, and if
        the argument has already been well-argued in the past on the
        list, ask yourself: is it worth re-arguing?  I'm not trying
        to prevent debate, merely encouraging you to reflect before
        reopening old arguments.

I still need help on a number of things, particularly fleshing out some
of the references and sanity-checking some of the areas in which I'm not
an expert, notably character sets, audio, and privacy-enhanced messages
(PEM).  If you know something about one of these, please read that part
of the draft extra carefully.

That's all.  Enjoy.  I look forward to your comments.  Well, sort of....
 :-)   -- Nathaniel

Major Changes From April Draft

There is a lot of new prose, and the document has been reorganized
substantially, to clarify intent and to discuss rejected alternatives. 

Content-type syntax:  There is now a distinguished place for subtypes. 
Character set types have been replaced with a subtype syntax for the
text and message types.  The rest of the syntax has been generalized to
a set of semicolon-separated parameters.

Content-types:  Several content-types have been consolidated into nine
high-level types such as "image" and "audio".  The Scribe and SGML
content-types have been eliminated.  DES-MESSAGE has been replaced by
MESSAGE/PEM.  Notable new content-types worth looking at include
text-plus/richmail, binary, message, message/partial,
application/external-reference.  The scheme for officially defining new
content-types has been changed to require an RFC for content-types, but
to be more liberal for subtypes.  The text type's default character set
is now left undefined, to match prior reality(!) but an explicit
specification (e.g. US-ASCII) is encouraged for future composers.

Multipart messages:  The definition has changed so that body-parts are
no longer messages, though the syntax is the same.  A new distinguished
closing delimiter is now required.  A new 'digest" subtype is also
defined, as is a new concept of subtypes for multipart messages.

The "Encoded-Variable" stuff has been elminated, in favor of
Content-type: Message/charset

Content-Encoding has been changed to Content-TransferEncoding.  The
hexadecimal encoding has been eliminated, and some prose about the need
for a compressed encoding has been added.

The base64 encoding has added "," as a way to specify portable end-of-lines.

The quoted-printable encoding has changed "&" and "\" to "=" and ":" for
portability, and has added some rules (and clarified others) regarding
CRLF and white space, and is generally much tighter.

Two new optional header fields, Content-ID and Content-Description, have
been defined.  Content-Size has been extended & clarified.

Added a new notion of "RFC-XXXX-compliant" implementations, defining a
minimal subset to be implemented to earn such a label.

The X.400-related types have been dropped, leaving these questions for
the experts.

Network Working Group -- Request for Comments: XXXX


                Mechanisms for Specifying and Describing
                  the Format of Internet Message Bodies


                     Nathaniel Borenstein, Bellcore
                           Ned Freed, Innosoft


                                May 1991

Status of This Memo

This draft document will be submitted to the RFC editor as a protocol
specification.  Distribution of this memo is unlimited.  Please send
comments to Nathaniel Borenstein <nsb@thumper.bellcore.com>

Abstract

This document suggests extensions to the RFC 822 message representation
protocol to allow multi-part textual and non-textual messages to be
represented and exchanged without loss of information.   This is based
on earlier work documented in RFC 934 and RFC 1049, but extends and
revises that work.
Table of Contents

1:  Introduction
2:  The Content-Type Header Field
3:  The Content-TransferEncoding Header Field
  3.1:  Quoted-Printable Content-TransferEncoding
  3.2:  Base64 Content-TransferEncoding
4:  Additional Optional Content- Header Fields
  4.1:  Optional Content-ID Header Field
  4.2:  Optional Content-Description Header Field
  4.3:  Optional Content-Size Header Field
5:  The Nine Predefined Content-type Values
  5.1:  The TEXT Content-type and the US-ASCII Character Set
  5.2:  The "Multipart" Content-Type
  5.3:  The "Text-Plus" Content-Type and "RichMail" subtype
  5.4:  The Message Content-Type
  5.5:  The Binary Content-Type
  5.6:  The Application Content-Type Value
  5.7:  The Audio, Image, and Video, and X- Content-Types
6:  RFC-XXXX Compliance
Appendix I -- Guidelines For Sending Data Via Email
Appendix II -- Examples
  Example 1:  Simple Non-ASCII Text Example
  Example 2:  A Complex Multipart Example
Summary
Contacts
Acknowledgements
References

1	Introduction

Since its publication in 1982, RFC 822 [RFC-822] has defined the
standard format of textual mail messages on the Internet.  Its success
has been such that the RFC 822 format has been adopted, wholly or
partially, well beyond the confines of the Internet and of SMTP
transport, as defined by RFC 821 [RFC-821].  As the format has seen
wider use, a number of limitations have become increasingly problematic
for the user community.

RFC 822 was intended to specify a format for text messages.  As such,
non-text messages, such as multimedia messages that might include audio
or images, are simply not mentioned.  Even in the case of text, however,
RFC 822 is inadequate for the needs of email users whose languages
require the use of character sets richer than US ASCII [REF-ANSI].  For
mail containing audio, video, Japanese text, or even text in most
European languages, RFC 822 does not specify enough to permit
interoperability.

One of the notable limitations of RFC 821/822 based mail systems is the
fact that they limit the contents of electronic mail messages to
relatively short lines of seven-bit ASCII.  This forces a user to
convert any non-textual data that she may wish to send into seven-bit
bytes representable as printable ASCII characters before invoking her
local mail UA (User Agent program).  Examples of such encodings
currently used in the Internet include pure hexadecimal, uuencode, the
3-in-4 base 64 scheme specified in RFC 1113, the Andrew Toolkit
Representation, and many others.

These limitations become even more apparent as gateways are designed to
allow for the exchange of mail messages between RFC 822 hosts and X.400
hosts.  X.400 [REF-X400] specifies mechanisms for the inclusion of
non-textual body parts within electronic mail messages.  The current
standards for the mapping of X.400 messages to RFC 822 messages specify
that either X.400 non-textual body parts should be converted to (not
encoded in) an ASCII format, or that they should be discarded, notifying
the RFC 822 user that discarding has occurred.  This is clearly
undesirable, as information that a user may wish to receive is lost. 
Even though a user's UA may not have the capability of dealing with the
non-textual body part, the user might have some mechanism external to
the UA that can extract useful information from the body part. 
Moreover, it does not allow for the fact that the message may eventually
be gatewayed back into an X.400 MHS, where the non-textual information
would definitely become useful again.

This memo describes several mechanisms that combine to solve these
problems.  In particular, it describes:

1.  A Content-type header field, generalized from RFC 1049 [RFC-1049],
which can be used to describe the type and subtype of data in the body
of a message and to fully specify the representation (encoding) of such
data.

2.  A Content-TransferEncoding header field, which can be used to
describe an auxilliary encoding that was applied to the data in order to
allow it to pass through the mail transport layer.

3.  A "text" content-type value, which can be used to represent text
information in a number of character sets in a standardized manner.

4.  A "multipart" content-type value, which can be used to combine
several separate body-parts, which may be made of different types of
data, into a single message.

5.  A "binary" content-type value, which can be used to transmit
uninterpreted or partially-interpreted binary data, and hence to
implement an email file transfer service.

6.  A "message" content-type value, for encapsulating a mail message.

7.  Several additional content-type values and subtypes, which can be
used by consenting User Agents to interoperate with additional message
types such as audio, images, and more.

8.  Several optional header fields that can be used to further describe
the data in a message body or body-part, in particular the Content-Size,
Content-ID, and Content-Description header fields.

Finally, to specify and promote a minimal level of interoperability,
this memo describes a subset of the above mechanisms that defines
"compliance" with this memo.  That is, it specifies the minimal subset
required for an implementation to be called "RFC-XXXX-compliant."

2	The Content-Type Header Field

The Content-Type header field was first defined in RFC 1049.  This
section extends and supersedes that definition.  RFC 1049 content-types
are all compliant with the new, more general syntax.  (In particular,
RFC 1049 content-types omitted the subtype/character-set specification,
and always had at most two of the parts now called "parameters", which
were distinguished by their position as indicating a version number and
a resource reference.)  

The Content-Type  header field is used to specify the type of data in a
message, by giving a type name, and to provide auxiliary information
that may be required for certain types.   In addition. a distinguished
syntax is defined for specifying subtype information, including
character set information in the case of text.  After the type name and
the optional subtype, the remainder of the header field is simply a set
of parameter specifications, as defined for each named type, and an
optional comment.

(COMPATIBILITY NOTE:  Readers familiar with RFC 1049 Content-types will
notice that the syntax has been generalized substantiallly.  However,
RFC 1049 content-types are all compliant with the new syntax.  In
particular, RFC 1049 content-types omitted the subtype specification,
and always had at most two of the parts now called "parameters", which
were distinguished by their position as indicating a version number and
a resource reference.)

In the Extended BNF notation of RFC-822, we define a Content-type header
field value as follows:

Content-Type:= type ["/" subtype] *[";" parameter]
		[comment]

parameter :=      local-part

subtype := local-part

type := local-part

The type and subtype values are not case sensitive.  TEXT, Text, and
TeXt are all equivalent.  

An initial set of nine content-types are defined by this memo.  This set
of type names is not intended to be completely exhaustive.  More may be
defined later, by a future RFC.  However, it is expected that most
extensions to the set of objects that are sent through the mail can be
accomplished by the creation of new subtypes of these initial types.

The only constraint on the definition of subtype names is the desire
that their uses not conflict.  That is, it would be undesirable to have
two different communities using "Content-type: binary/foobar" to mean
two different things.  The process of defining new content-subtypes,
then, is not intended to be a mechanism for imposing restrictions, but
simply a mechanism for publicizing the usages.  There are, therefore,
two acceptable mechanisms for defining new content-type subtypes:

    1.  Private values (starting with "X-") may be defined
        bilaterally between two cooperating agents without outside
        approval or standardization

    2.  "Standard" values may be defined by the publication of
        an Internet RFC.  The RFC need not be very long, but must
        define the content-type and subtype, its associated
        parameter syntax, and the format of the body of a message so
        marked.

    3.  The value may be inferred as an obvious and unambiguous
        extension of the subtypes defined in a previous RFC.  For
        example, this memo defines an "image" type with subtypes
        that denote image formats such as G3Fax.  An additional
        image type for which there is one clear and obvious name is
        an obvious extension of the subtypes of "image."

The nine initial predefined content-types are detailed in the appendices
of this memo.  The are:

    text --  textual information, with character set given by the subtype
    text-plus  -- mostly textual information, with embedded
        formatting commands.  A simple default type is defined, with
        possible subtypes including troff, TeX, and so on.
    message -- an encapsulated message, with initial subtypes for
        partial messages and privacy-enhanced messages
    multipart -- a message consisting of multiple parts of
        independent type values, with initial subtype digest.
    audio -- a message containing audio data, with initial subtypes
        a-law and u-law.
    image -- a message containing image data, with initial subtypes
        G3fax, gif, pbm, ppm, and pgm.
    video -- a message containing video data.
    binary -- a message containing some other form of binary data.
    application -- a message containing data to be processed by a
        mail-based application.

If no Content-type header field is present, "text" is assumed, with the
default (undefined) character set as specified later in this memo.  
This is consistent with the default message body type as defined by RFC
822. 

It should be noted that the list of Content-type values given here may
be augmented in time, via the mechanisms described above, and that the
set of subtypes is expected to grow substantially.  We have simply
attempted, in this RFC, to give as many standard Content-type
definitions as was possible given the current state of our knowledge.  
3	The Content-TransferEncoding Header Field

Many content-types are represented, in their "natural" format, as 8-bit
or binary data.  Such data can not be transmitted over existing Internet
mail mechanisms because both RFC 821 and RFC 822 restrict mail messages
to 7 bit data with reasonably short lines.  It is necessary, therefore,
to define a standard mechanism for encoding such data in an acceptable
manner.

This RFC specifies that such encodings will be indicated by a new
"Content-TransferEncoding" header field.  The Content-TransferEncoding
field is used to indicate the type of transformation that has been used
to represent the message body in an acceptable manner.  

It may seem that the Content-TransferEncoding could be inferred from the
characteristics of the Content-Type that is to be encoded, or, at the
very least, certain Content-TransferEncodings could be mandated for use
with specific Content-Types. There are several reasons why this is not
the case. First, given the varying types of transports used for mail,
some encodings may be appropriate for some Content-Type/transport
combinations and not for others. Second, certain Content-Types may
require different types of transfer encoding under different
circumstances. For example, many PostScript messages may consist
entirely of short lines of 7-bit data and hence require little or no
encoding. Other PostScript messages (especially those using Level 2
PostScript's binary encoding mechanism) may only be resonably
represented using a binary transport encoding. Finally, since
Content-Type is intended to be an open-ended specification mechanism,
strict specification of an association between Content-Types and
encodings effectively couples the specification of an application
protocol with a specific lower-level transport. This is not desireable
since the developers of a Content-Type may be and should not have to be
aware of all the transports in use and what their limitations are.

It should be noted, also, that there is considerable interest and effort
being expended on extending mail transport to permit 8-bit or binary
data.  If such extensions ever become commonplace, the
Content-TransferEncoding mechanism will quickly become irrelevant, and
it is therefore desirable not to "overload" Content-TransferEncoding
with additional mechanisms that might still be useful in such a future. 
For this reason, Content-TransferEncoding is restricted in its scope to
refer to nothing but the 7-bit encoding question.  Matters such as the
basic format in which information is "encoded" are to be handled by
other mechanisms.  

Unlike Content-types, which are expected to proliferate, it is expected
that there will never be more than a few different
Content-TransferEncoding values, both because there is less need for
variation and because the effect of variation in
Content-TransferEncoding would be more problematic.  However,
establishing only a single Content-TransferEncoding mechanism does not
seem possible.  In particular, there is a tradeoff between the desire
for a compact and efficient encoding of binary data and the desire for a
readable encoding of data that is mostly, but not entirely, 7-bit data. 
For this reason, at least two encoding mechanisms are necessary, a
"readable" encoding and a "dense" encoding.   

A third encoding, for compressed ("super-dense") data, is also strongly
desirable.  This RFC does not specify a "compressed" encoding, due to
the uncertain legal state of the UNIX "compress" command and a lack of
certainty, during the drafting of this RFC, regarding the right way to
define a standard compression algorithm.  It is hoped that a compressed
Content-TransferEncoding will be defined in a future RFC.  Any
compression algorithm for such a use should be unambiguously defined and
without legal encumbrances.

The Content-TransferEncoding field is designed to specify a two-way
mapping between the "native" representation of a type of data and a
representation that can be readily exchanged using 7 bit mail transport
protocols as defined by RFC 821 (SMTP). This field has not been defined
by any previous RFC. The field's value is a single atom specifying the
type of encoding, as enumerated below.  Formally:

Content-TransferEncoding:=	"BASE64"/
			"QUOTED-PRINTABLE"/
			"8BIT"/"BINARY"/
			"7BIT"/"X-"atom

These values are not case sensitive.  That is, Base64 and BASE64 and
bAsE64 are all equivalent.  An encoding type of 7BIT implies that the
message is already in a seven-bit mail-ready representation. This value
is assumed if the Content-TransferEncoding header field is not present. 
If the message is stored or transported via a mechanism that permits
8-bit data, a Content-TransferEncoding of "8bit" may be used.  If the
message is stored or transported via a mechanism that permits arbitary
binary data, a Content-TransferEncoding of "binary" may nonetheless be
used.  In particular, "8bit" or "binary" should be used in the case
where there is fear that the message may "leak" into a more restricted
(7-bit) transport environment.  (DISCUSSION:  The distinction between
the Content-TransferEncoding values of "binary," "8bit," and "7bit" may
seem unimportant in an 8-bit binary environment, but clear labeling will
be of enormous value to gateways between 8-bit and 7-bit systems.  The
difference between "8bit" and "binary" is that "8bit" implies adherence
to SMTP limits on line length and CR/LF semantics, whereas "binary" does
not.)

Implementors may, if necessary, define new Content-TransferEncoding
values, but should prefix them with "x-" to indicate their non-standard
status, e.g. "Content-TransferEncoding:  x-my-new-encoding".   However,
unlike Content-types and subtypes, the creation of new
Content-TransferEncoding values is explicitly discouraged, as it seems
likely to hinder interoperability with little potential benefit.

If a Content-TransferEncoding header field appears as part of a message
header, it applies to the entire message body, whether or not that body
is of type "multipart."  If it is of type multipart, the encoding
applies recursively to all of the encapsulated parts, including their
encapsulated headers.  If a Content-TransferEncoding header field
appears as part of an encapsulation's headers, it applies only to the
body of the encapsulated part.  If the encapsulated part is itself of
type "multipart", the encoding applies recursively to all of the
encapsulated parts within that encapsulated part.

It should be noted that, because email is character-oriented, the
mechanisms describe here are mechanisms for encoding arbitrary byte
streams, not bit streams.  If a bit stream is to be encoded via one of
these mechanisms, it should first be converted to a byte stream using
the network standard bit order ("big-endian"), in which the earlier bits
in a stream become the higher-order bits in a byte.  A bit stream not
ending at an 8-bit boundary should be padded with zeroes.  If the
precise bit count is needed, it can be given in the Content-Size header
field, described later in this document.

The following sections will define the two standard encoding mechanisms.

3.1	Quoted-Printable Content-TransferEncoding

The Quoted-Printable encoding is intended to represent data that largely
contains octets less than 127.  It encodes the data in such a way that
the resulting octets are both unlikely to be modified by mail transport,
and, when read as ASCII text, are largely recognisable by humans.  A
message which is entirely ASCII may also be encoded in Printed-Quotable
to insure it's survival in an environment which is anticipated to
transverse a character translating gateway such as those onto Bitnet.

In this encoding, ASCII characters 33 through 57, inclusive, 59, 60, and
62 through 126, inclusive, are unchanged.  All other characters,
including characters 32 (SPACE), 58 (:), 61 (=), 127 (DEL), and all
control characters, are to be represented as determined by the following
rules:

    Rule #1:  Any 8 bit value may be represented by a ":" followed by a
    two digit hexadecimal representation of the character's 8-bit value.
     Thus, for example, character 12 (control-L, or formfeed) can be
    represented by ":0C", the equal-sign character (61) can be
    represented by ":3D", and the colon character (58) itself can be
    represented by ":3A".  Rule #1 is the REQUIRED representation for
    characters 127 through 160 and for character 255.

    Rule #2:  An 8 bit value from 161 through 254 may, alternately, be
    represented by an equal-sign character followed by the single
    character obtained by the removal of the high order bit, i.e. by
    subtracting 128 from the value.  Thus  the 8 bit value 193 may be
    represented as "=A".  Rule #2 is completely optional, given rule #1,
    but is provided for improved readability of some 8-bit character
    sets in which turning on the 8th bit produces a character similar to
    the corresponding 7 bit character, e.g. the 8th bit simply adds an
    umlaut.  

    Rule #3:  The literal equal-sign and colon characters must
    themselves be quoted by colons.  Thus, the colon may be represented
    as "::" and the equal-sign as ":=".  Note that this is not ambiguous
    with regard to the first clause, because neither ":" nor "=" are
    part of the hexadecimal alphabet.

    Rule #4:  A colon at the end of a line may be used to indicate a
    non-significant line break.  That is, if one needs to include a long
    line without line breaks, a message encoded with the
    quoted-printable encoding should include "soft" line breaks in which
    the line break is preceded by a colon.  Thus if the "raw" form of
    the line is a single line that says:

    Now's the time for all men to come to the aid of their country. 
    Now's the time for all men to come to the aid of their country. 
    Now's the time for all men to come to the aid of their country.

    This could be represented, in the quoted-printable encoding, as

    Now's the time for all men to come to the aid of their country.  :
    Now's the time for all men to come to the aid of their country.  :
    Now's the time for all men to come to the aid of their country.  

    This provides a mechanism with which long lines are encoded in such
    a way as to be restored by the user agent.    The quoted-printable
    encoding REQUIRES that lines be broken so that they are no more than
    78 characters long, using soft line breaks when necessary.

    Rule #5:  Although the SPACE (32) and TAB (9) characters may
    generally be represented as themselves, they should NOT be so
    represented at the end of a line, because some MTA's are known to
    remove "white space" from the end of a line.  In such cases, the
    characters MUST be represented as in rule #1 (as ":20" and ":09"
    respectively) or as themselves, followed by a soft line break
    followed by a real line break.  Of course, these characters can be
    so represented within a line as well, if this is desired; in the
    case of the TAB character, representing it as ":09" may be somewhat
    more robust even in the middle of a line.  Note that in decoding a
    quoted-printable message, any trailing white space on a line should
    be deleted, as it will necessarily have been added by intermediate
    transport agents.

    Rule #6: A CR LF pair normally constitutes a line break and should
    be represented by a line break in the quoted-printable encoding if
    that is its meaning. Isolated CRs, LFs, and LF CR sequences must be
    represented using the :0D, :0A, and :0D:0A notations respectively.
    CR LF sequences that are not intended to represent a line break
    should be encoded as :0D:0A to reflect this usage.  In other words,
    the concept "end of line" is represented, in the quoted-printable
    encoding, by CR LF, although this may be modified in local storage
    formats.  Literal occurrences of CR or LF that do not occur as CRLF
    or are not intended to represent end-of-line markers must be
    represented in hexadecimal.

Since the hyphen character ("-") is represented as itself in the
Quoted-Printable encoding, the usual care must be taken, when
encapsulating a quoted-printable encoded message  or body part in a
multipart message, to ensure that the encapsulation boundary does not
appear anywhere in the message.  See the definition of multipart
messages, later in this memo.

3.2	Base64 Content-TransferEncoding

The Base64 Content-TransferEncoding is designed to represent arbitrary 8
bit data in a form that is not humanly readable.  The encoding and
decoding algorithms are simple, but the encoded data is only about 33
percent larger than the unencoded data.  This encoding is based on the
one used in Privacy Enhanced Mail applications, as defined in RFC 1113. 
 The base64 encoding is adapted from RFC 1113, with two changes:  base64
elminates the "*" mechanism for embedded clear text and defines a new
syntax for portable end-of-line markers, using the comma character.

A 66-character subset of International Alphabet IA5 is used, enabling 6
bits to be represented per printable character. (The extra 65th and 66th
characters "=" and "," are used to signify special processing
functions.) This subset has the important property that it is
represented identicially in IA5 and ASCII, and all characters in the
subset are part of the so-called invariant subset of EBCDIC. Other
popular encodings such as the encoding used by the UUENCODE utility and
the base85 encoding specified as part of Level 2 PostScript do not share
these properties, and thus do not fulfill the portability requirements
placed on a binary transport encoding for mail.

The encoding process represents 24-bit groups of input bits as output
strings of 4 encoded characters. Proceeding from left to right across a
24-bit input group is formed by concatenating 3 8-bit input groups, this
is then treated as 4 concatenated 6-bit groups.  When encoding a bit
stream via the base64 encoding, the bit stream should be presumed to be
ordered with the most-significant-bit first.  That is, the first bit in
the stream will be the high-order bit in the first byte, and the eighth
bit with be the low-order bit in the first byte, and so on.

Each 6-bit group is used as an index into an array of 64 printable
characters. The character referenced by the index is placed in the
output string. These characters, identified in Table 1 below, are
selected so as to be universally representable, and the set excludes
characters with particular significance to SMTP (e.g., ".", "CR", "LF").

                                 Table 1

   Value Encoding  Value Encoding  Value Encoding  Value Encoding
       0 A            17 R            34 i            51 z
       1 B            18 S            35 j            52 0
       2 C            19 T            36 k            53 1
       3 D            20 U            37 l            54 2
       4 E            21 V            38 m            55 3
       5 F            22 W            39 n            56 4
       6 G            23 X            40 o            57 5
       7 H            24 Y            41 p            58 6
       8 I            25 Z            42 q            59 7
       9 J            26 a            43 r            60 8
      10 K            27 b            44 s            61 9
      11 L            28 c            45 t            62 +
      12 M            29 d            46 u            63 /
      13 N            30 e            47 v
      14 O            31 f            48 w         (pad) =
      15 P            32 g            49 x         (eol) ,
      16 Q            33 h            50 y

Special processing is performed if fewer than 24 bits are available at
the end of a message or encapsulated part of a message.  A full encoding
quantum is always completed at the end of a message. When fewer than 24
input bits are available in an input group, zero bits are added (on the
right) to form an integral number of 6-bit groups.  Output character
positions which are not required to represent actual input data are set
to the character "=".  Since all canonically encoded output is an
integral number of octets, only the following cases can arise: (1) the
final quantum of encoding input is an integral multiple of 24 bits;
here, the final unit of encoded output will be an integral multiple of 4
characters with no "=" padding, (2) the final quantum of encoding input
is exactly 8 bits; here, the final unit of encoded output will be two
characters followed by two "=" padding characters, or (3) the final
quantum of encoding input is exactly 16 bits; here, the final unit of
encoded output will be three characters followed by one "=" padding
character.

One addition is made to the RFC 1113 specification of this encoding: 
The comma character (",", ASCII 44) may be used to represent an
"end-of-line" or "end-of-record" marker.  If line-oriented data are
encoded using base64, it is desirable to restore end-of-line markers
according to the local convention.  The RFC 1113 specification, as given
above, offers no way to differentiate between a binary file including a
CRLF sequence and a portable end-of-line marker.  This memo augments
that mechanism to permit such differentiation, as follows.  To represent
an end-of-line marker:

    1.  Treat the byte stream preceding the end-of-line as
    terminating with at the end of the line -- that is, pad with "="
    characters as appropriate to complete the representation of the
    line.

    2.  Insert a comma character.

    3.  Resume the encoding starting a new 24-bit input group with
    the first character on the next line.

Thus, while encoding the binary sequence "a-b-c-CR-LF-a-b-c"  yields the
octets which are represented in ASCII as "YWJjDQphYmM=", encoding
"a-b-c" followed by an end-of-line followed by "a-b-c" yields
"YWJj,YWJj"  They will be translated back into the same thing if the
local end-of-line convention is CRLF, but they will be translated back
differently if the end-of-line convention is anything other than CRLF.

Note: There is no need to worry about quoting apparent encapsulation
boundaries within base64-encoded parts of multipart messages, because no
hyphen characters are used in the base64 encoding.
4	Additional Optional Content- Header Fields

4.1	Optional Content-ID Header Field

In constructing a high-level user agent, it may be desirable to allow
one message body-part to make reference to another.  This may be done
using the "Content-ID" header field, which is syntactically identical to
the "Message-ID" header field:

Content-ID := "<" msg-id ">"

4.2	Optional Content-Description Header Field

It may be desirable to associate some descriptive information with a
given body-part.  For example, it may be useful to mark an "image"
body-part as "a picture of the Space Shuttle Endeavor."  Such text may
be placed in the Content-Description header field.  

Content-Description := *text

4.3	Optional Content-Size Header Field

In the discussions of earlier drafts of this memo, some people indicated
a strong preference for using a size-counting scheme to delimit the
boundaries between encapsulated parts of multipart messages.  This was
rejected because such schemes are not, in general, sufficiently robust
across the SMTP transport layer.  For example, line counts can be
altered by line-wrapping MTA's, and byte counts can be altered in any
number of ways, and may be confused by crossing boundaries in which the
size of an end-of-line marker changes.  However, there are restricted
environments in which either or both of these counts can be relied upon,
and in such environments it may be desirable to implement a count-based
approach to delimiters.  Therefore this memo specifies a conventional
way to do this, in order to promote interoperability among systems that
are able to take this approach.

In such cases, boundary delimiters, as defined above, are still
required.  However, the header area of an encapsulated part may include
an optional Content-Size header which indicates where the encapsulated
part ends, if its size has not been altered.  The size may be measured
in either bytes or lines.  Those who use the Content-Size header field
should still preserve the encapsulation boundaries, and should recognize
that other agents are free to ignore it in favor of complete reliance on
encapsulation boundaries.

It should also be noted that those who wish to use the Content-Size
mechanism have two rather different possible motivations.  One is to
find the end of the data as represented for mail transport, an
enterprise which, as noted above, can be counted on to provide no better
than an estimate.  The other is to declare the initial size of the
object before mail transport, to be used as a check on the integrity of
the data.  Accordingly, the Content-Size header field allows the sender
to distinguish whether he is measuring the size of the original object
or its encoded form.

The Content-Size header field is defined as follows:

Content-Size := 1*DIGIT unit when

unit := "lines" / "bytes" / "bits"

when := "original" / "encoded"

Note that each encapsulated part should still end with an end-of-line
followed by an encapsulation boundary.  However, a message store that
wishes, for example, to use a storage format that is largely RFC
822-compliant, but includes binary storage of binary objects, can use
the Content-Size header field to indicate whether or not the final
end-of-line is to be interpreted as part of the binary object.  If the
end-of-line follows the number of bytes specified for the encapsulation,
then it is not part of the encapsulation.

The size given by the Content-Size header field is the size of the
encapsulation's body only, not counting the blank line that separates
the header from the body.  In other words, the four bytes CRLF CRLF,
which separate header from body, are NOT counted as part of the
content-size.
5	The Nine Predefined Content-type Values

This memo defines nine initial content-type values and an extension
mechanism for private or experimental types.  Further types must be
defined and published by a new RFC.  It is expected that most innovation
in new types of mail  take place as subtypes of the nine types defined
here.

5.1	The TEXT Content-type and the US-ASCII Character Set

The text content-type is intended for sending textual email.  It is the
default content-type.  Subtype names are used, for text, to indicate
character sets.  In keeping with historical practice and expectations,
the default content-type for internet mail is "text", and the default
subtype (character set) is unspecified.  This content-type can be
explicitly specified as "text", and the character set that many people
seem to think of as the default can be specified as "US-ASCII". 
However, it must be noted that because of the lack of character set
specification in RFC 822, nothing can be assumed about mail with
content-type "text" but no character set specification.

Alternately, a different character set subtype may be specified, in
which case the body text is in the specified character set.  A
recommended list of predefined subtype names can be found at the end of
this appendix.  Note that if the specified character set includes 8-bit
data, the Content-TransferEncoding header field is required in order to
transmit the message via SMTP.

The default character set, US-ASCII, has been the subject of some
confusion and ambiguity in the past.  Not only were there some
ambiguities in the definition, there have been wide variations in
practice.  In order to elminate such ambiguity and variations in the
future, it is strongly recommended that new user agents explicitly
specify a character set via the content-type header field.  

The US-ASCII character set is based on a series of standards and on the
historical standard practice in the Internet mail community.  However,
the precise meaning of this character set has been the subject of some
debate.  In this appendix, therefore, we define the US-ASCII character
set.  

The message body is coded in the character set of American Standard Code
for Information Interchange, sometimes known as "7-bit ASCII". This is
not an arbitrary seven-bit character code, but indicates that the
message body uses character coding that uses the exact correspondence of
codes to characters specified in ASCII.  National use variations of
ISO646 [REF-ISO646] are not ASCII, and neither an explicit "ASCII"
character set, nor "US-ASCII", nor the default (omission of a character
set) should be used when characters are coded using them.   (Discussion:
RFC821 very explicitly specifies "ASCII", and references  an earlier
version of the American Standard cited in [REF-ANSI].  Whether that
specification, rather than a reference to an International Standard, was
done deliberately or out of convenience or ignorance, is no longer
interesting: insofar as one of the purposes of specifying a content-type
and character set is to permit the receiver to unambiguously determine
how the sender intended the coded message to be interpreted, assuming
anything other than "strict ASCII" as the default would risk
unintentional and incompatible changes to the semantics of messages now
being transmitted.    This also implies that messages containing
characters coded according  to national variations on ISO646, or using
code-switching procedures (e.g., those of ISO2022), as well as 8-bit or
multiple  octet character encodings MUST use an appropriate character
set specification to be consistent with this specification.)    

The complete US-ASCII character set is listed below: 

 0 nul  16 dle  32 sp   48  0   64  @   80  P    96  `   112  p 
 1 soh  17 dc1  33  !   49  1   65  A   81  Q    97  a   113  q 
 2 stx  18 dc2  34  "   50  2   66  B   82  R    98  b   114  r 
 3 etx  19 dc3  35  #   51  3   67  C   83  S    99  c   115  s 
 4 eot  20 dc4  36  $   52  4   68  D   84  T   100  d   116  t 
 5 enq  21 nak  37  %   53  5   69  E   85  U   101  e   117  u 
 6 ack  22 syn  38  &   54  6   70  F   86  V   102  f   118  v 
 7 bel  23 etb  39  '   55  7   71  G   87  W   103  g   119  w 
 8 bs   24 can  40  (   56  8   72  H   88  X   104  h   120  x 
 9 ht   25 em   41  )   57  9   73  I   89  Y   105  i   121  y 
10 lf   26 sub  42  *   58  :   74  J   90  Z   106  j   122  z 
11 vt   27 esc  43  +   59  ;   75  K   91  [   107  k   123  { 
12 np   28 fs   44  ,   60  <   76  L   92  \   108  l   124  |
13 cr   29 gs   45  -   61  =   77  M   93  ]   109  m   125  } 
14 so   30 rs   46  .   62  >   78  N   94  ^   110  n   126  ~ 
15 si   31 us   47  /   63  ?   79  O   95  _   111  o   127 del

Beyond US-ASCII, one can imagine an enormous proliferation of character
sets.  It is the opinion of the authors of this memo that a large number
of character sets is NOT a good thing.  We would prefer to specify a
single character set that can be used universally for representing all
of the world's languages in electronic mail.  Unfortunately, there is no
clear choice for such a universal representation, and existing practice
in several communities seems to point to the continuing use of multiple
character sets in the near future.  For this reason, we define names for
a small number of character sets for which a strong consituent base
exists.  We recommend the use of ISO-10646 wherever possible.

The defined subtypes of text, which name alternate character sets, are:

US-ASCII -- as defined above.

ISO-10646 -- as defined in [REF-ISO-10646] 

ISO-8859-X -- where "X" is to be replaced, as necessary, for the
national use variants of ISO-8859 [REF-ISO-8859]

ISO-2022 -- as defined in [REF-ISO-2022]

In the opinion of the authors, this is already far more character sets
than are really desirable, and implementors are discouraged from
defining new ones unless absolutely necessary.

***** I AM SURE THAT I NEED SOME FLESHING OUT OF THE ABOVE DEFINITIONS &
REFERENCES
5.2	The "Multipart" Content-Type

In the case of multiple part messages, a "multipart" Content-type field
should appear in the RFC 822 message header. The message body is then
assumed to contain multiple parts separated by encapsulation boundaries.
 Each of the parts is defined, syntactically, as a complete RFC 822
message in miniature.  That is, what is found between the encapsulation
boundaries is a header area, a blank line, and a body area, in
accordance with the RFC 822 syntax for a message.  However body parts
are NOT to be interpreted as actually being RFC 822 messages.  To begin
with, NO header fields are actually required in body parts.  A body part
that starts with a blank line, therefore, is a body part for which all
default values are to be assumed.  In such a case, of course, the
absence of a Content-type header field implies that the encapsulation is
US-ASCII text.  The only header fields that have defined meaning for
body-parts are those the names of which begin with "Content-".  All
other header fields are to be ignored in body-parts, and may be
discarded by gateways.  They are permitted to appear in body parts only
for ease of conversion between messages and body parts.

It must be understood that body parts are NOT messages.  For example, a
gateway between Internet and X.400 mail must be able to tell the
difference between a body part that consists of an image and a bodypart
that consists of an encapsulated message, the body of which is an image.
 In order to represent the latter, the body part should have
"Content-type: message", and its body (after the blank line) should be
the encapsulated message, with its own "Content-type: image" header
field.  Body parts use the same syntax as messages because there are
many legitimate cases in which a body part might be converted into a
message, or vice versa.  The identical syntax makes such conversions
easy, but must be understood by implementors.  (For the special case in
which all parts are actually messages, a "digest" subtype is also
defined.)

As stated previously, each pair of consecutive body parts are separated
by an encapsulation boundary.  The encapsulation boundary MUST NOT
appear inside any of the encapsulated parts.  Thus, it is crucial that
the composing agent be able to choose and specify the boundary that will
separate the parts.  

The Content-type field for multipart  messages requires two
supplementary fields.  The first is used to specify a version number and
should be either "1-S" and "1-P".  The two versions have identical
syntax, but the "-P" is intended as a hint, to receivers, that the parts
are intended to be viewed in parallel rather than sequentially.  
Implementations that can not show the parts in parallel, or that choose
not to do so, are free to treat all multipart messages of version "1-P"
as if they were version "1-S".  However, all implementations should
check the version number, to ensure graceful behavior in the event that
an incompatible future version of multipart messages is defined later.

The second supplementary field, which is always required for multipart
messages, is used to specify the format of the encapsulation boundary. 
The encapsulation boundary is defined as a line consisting entirely of
two hyphen characters ("-", decimal code 45) followed by the second
parameter of the Content-type header field with any leading or trailing
white space removed.  (DISCUSSION:  The specification that white space
be removed is intended to eliminate the possible introduction of
ambiguity caused by the addition or deletion of white space by message
transport agents.  They hyphens are for rough compatibility with the
earlier RFC 934 method of message encapsulation, and for ease of
searching for the boundaries in some implementations.  However, it
should be noted that multipart messages are NOT completely compatible
with RFC 934 encapsulations; in particular, they do not obey RFC 934
quoting conventions for embedded lines that begin with hyphens.)

Thus, a typical multipart content-type header field might look like this:

Content-type: multipart; 1-S; gc0p4Jq0M2Yt08jU534c0p

This indicates that the message consists of several parts, each itself
structured as an RFC 822 message, which are intended to be viewed
one-at-a-time, and that the parts are separated by the line

--gc0p4Jq0M2Yt08jU534c0p

The encapsulation boundaries must not appear within the encapsulations,
and should be no longer than 70 characters, not counting the two leading
hyphens.

The encapsulation boundary following the last body-part should be a
distinguished delimiter that indicates that no further body-parts will
follow.  Such a delimiter is identical to the previous delimiters, with
the addition of two more hyphens at the end of the line:

--gc0p4Jq0M2Yt08jU534c0p--

It should be noted that there appears to be room for additional
information prior to the first encapsulation boundary and following the
final such boundary.  For several reasons, however, it is specified that
these areas should be left blank, and that implementations should ignore
anything that appears before the first boundary or after the last one.

The use of "Content-Type: Multipart" as a message part within another
"Content-Type: Multipart" is explicitly allowed.   In such cases, for
obvious reasons, care must be taken to ensure that each nested mulitpart
message should use a different boundary delimiter.  See Appendix II for
an example of nested multipart messages.

The use of content-type "Multipart" with only a single included part may
be useful in certain contexts, and is explicitly permitted.

Overall, the body of a multipart message may be specified as follows:

body := prefix 1*encapsulation close-delimiter postfix

encapsulation := delimiter CRLF message

delimiter := "--" <delimiter from Content-type resource> 

close-delimiter := delimiter "--"

prefix := *text

postfix := *text

message = <as defined in RFC 822, with all header fields
	  optional, containing no lines matching "delimiter">

The above description defines the default subtype of the multipart type,
"mixed", which may be explicitly specified with a content-type of
"multipart/mixed".   Other subtypes are possible, but should be defined
to be syntactically compatible with the "mixed" subtype.  Unrecognized
subtypes should be treated as being of subtype "mixed."  (DISCUSSION: 
Conspicuously missing from the multipart type is a notion of structure. 
In general, it seems premature to try to standardize structure yet.  It
is recommended that those wishing to provide a more structured or
integrated multipart messaging facility should define a subtype of
multipart that is syntactically identical, but that always expects the
inclusion of a distinguished part (e.g. with a content-type of
"Application/x-my-structure-subtype") that can be used to specify the
structure and integration of the other parts, probably referring to them
by their Content-ID field.  If this approach is used, other
implementations will not recognize the subtype, but will treat it as the
default subtype (multipart/mixed) and will thus be able to show the user
the parts that are recognized.)

This memo defines one particular subtype of multipart, the "digest"
subtype.  This type is syntactically identical to multipart, but the
semantics are different.  In particular, in a digest, all of the parts
are assumed to be of type "Message".  That is, each part is implicitly
prefixed by a line that says "Content-type: message" followed by a blank
line.  This is provided in order to allow a more readable digest format
that is largely compatible (except for the quoting convention) with RFC
934.
5.3	The "Text-Plus" Content-Type and "RichMail" subtype

There are many formats for representing what might be known as "extended
text" -- text with embedded formatting and presentation information.  An
interesting characteristic of most such representations is that they are
to some extent readable even without the software that interprets them. 
It is useful, then, to distinguish them, at the highest level, from such
non-readable data as images or audio messages.  In the absence of
appropriate interpreting software, it is reasonable to show extended
text to the user, while it is not reasonable to do so with binary data.

To represent such data, this memo defines a "text-plus" content-type. 
Plausible subtypes of text-plus are typically given by the common name
of the representation format, e.g. "text-plus/Troff" or "text-plus/TeX".
 Character sets are not specified as subtypes; in general it is assume
that rich text formats will have their own mechanisms for representing
alternate or multiple character sets.  Initial subtypes include troff,
tex, PostScript, DVI, and ODA.

**** Should the latter three really be binary????

In order to promote the wider interoperability of simple formatted text,
this memo defines a default subtype for "text-plus", the "richmail"
subtype.  This subtype was designed to meet the following criteria:

    1.  All special formatting characters are extremely portable
    (only "%", "(", and ")" are used).

    2.  The syntax is extremely simple to parse, so that even
    teletype-oriented mail systems can easily strip away the
    formatting information and leave only the readable text.

    3.  The syntax is easily extended to allow for new formatting
    commands that are deemed essential.

    4.  The capabilities are extremely limited, to ensure that it
    can represent no more than is likely to be representable by the
    user's primary word processor.  While this limits what can be
    sent, it increases the likelihood that it can be properly
    displayed.

    5.  The syntax does not correspond exactly to any known existing
    system, thus giving no special preference to anyone's current
    syntax.

The syntax of "richmail" is very simple.  It is assumed, at the
top-level, to be in the US-ASCII character set.  All characters
represent themselves, with the following exceptions:

    1.  The "%" character may be used to quote characters that need
    quoting, particularly itself and the left and right parenthesis
    characters.

    2.  The "%" character is used to begin a formatting token, which
    is no more than 28 characters long, consists only of
    case-insensitive alphanumeric characters or the hyphen character
    "-", and ends with a "(" character.

    3.  After a formatting token, subsequent text is affected by
    that formatting command until the next unquoted ")" character.

Thus, for example, the following "text-plus/richmail"  body fragment:

%bold(Now) is the time for %italic(all) good men @%smaller((and women%))
to come to the aid of their country.

represents the following formatted text (which will, no doubt, look
cryptic in the text-only version of this memo):

Now is the time for all good men (and women) to come to the aid of their
country.

Initially defined formatting tokens are:

    Bold -- causes the enclosed text to be in a bold font, if possible.
    Italic -- causes the enclosed text to be in an italic font, if
        possible.
    Fixed -- causes the enclosed text to be in a fixed width font,
        if possible.
    Smaller -- causes the enclosed text to be in a smaller font, if
        possible.
    Bigger -- causes the enclosed text to be in a bigger font, if possible.
    Underline -- causes the enclosed text to be underlined, if possible.
    Center -- causes the enclosed text to be centered, if possible.
    FlushLeft -- causes the enclosed text to be left justified, if
        possible.
    FlushRight -- causes the enclosed text to be right justified, if
        possible.
    Indent -- causes the enclosed text to be indented at both
        margins, if possible.
    ISO-10646 -- causes the enclosed text to be interpreted as text
        in the ISO-10646 character set, if possible.
    ISO-8859-X  (for any registered value of X) -- causes the
        enclosed text to be interpreted as text in the appropriate
        character set, if possible.
    ISO-2022 -- causes the enclosed text to be interpreted as text
        in the ISO-2022 multiple character set representation, if
        possible.
    US-ASCII -- causes the enclosed text to be interpreted as text
        in the US-ASCII character set, if possible.  Although this is
        the default character set, it might be usefully nested inside
        another character set.
    Invisible -- causes the enclosed text to be regarded as
        invisible, and not shown to the user.  This can be used by
        systems that wish to translate a "richer" format into "richmail"
        for mail transport, but want to be able to restore the
        formatting more fully if it is read with a completely compatible
        system.
    No-op -- has no effect on the enclosed text.

Implementations should regard any unrecognized formatting token as
equivalent to "No-op", thus facilitating future extensions to "richmail".

Richmail also differentiates betweeen "hard" and "soft" line breaks.  A
"soft" line break is represented by a single CRLF (end-of-line marker),
and may be ignored for purposes of presentation.   A sequence of one or
more hard line breaks may be represented by one plus that number of CRLF
markers.  Thus, a sequence of three consecutive CRLFs represents two
hard line breaks.  This allows portable wrapped and justified text,
independent of window-size or line-length restrictions.

A minimal richmail implementation is one that implements the "Invisible"
and "No-op" formatting tokens, regards all others as synonyms for
"no-op", and understands the richmail newline conventions.
5.4	The Message Content-Type

It is frequently desirable, in sending mail, to encapsulate another mail
message.  For this common operation, a special content-type, "message",
is hereby defined.

A content-type of "message" with the default subtype of "822" indicates
that the body or body part is an encapsulated message, with the syntax
of an RFC 822 message.   This default subtype may be explicitly
specified as "Content-type: message/822"

The special subtype "pem" may be used to indicate that the body or body
part is a message conforming to the Privacy Enhanced Mail protocol 
[RFC-1113].   

The special subtype "partial" may be used to indicate that the body or
body part is a fragment of a larger message.  Three subfields must be
specified in the content-type field:  The first is a unique identifier,
to be used to match the parts together.  The second, an integer, is the
part number.  The third, another integer, is the total number of parts. 
Thus, part 2 of a 3-part message might have the following header field:

    Content-type: Message/Partial; oc=jpbe0M2Yt4s; 2; 3

When the parts of a message broken up in this manner are put together,
the result is a complete RFC-822 format message, which may have its own
Content-type header field, and thus may contain any other data type. 
(EXPLANATION:  The purpose of the MESSAGE/PARTIAL type is to allow large
objects to be delivered as several separate pieces of mail and
automatically reassembled by the receiving user agent.  This may be
desirable when intermediate transport agents limit the size of messages
that can be sent.)

Additionally, all the character set subtypes of text are defined as
subtypes of "message."  If a character set subtype is given, it applies
to the uninterpreted textual fields in the RFC 822 message header area. 
Thus it can be used to represent address and subject  information in
non-ASCII character sets.  The character set subtype does NOT apply to
the body of the encapsulated message.  Thus, to encapsulate a message
with non-ASCII characters in both the header fields and in the body, you
would need something like the following:

    From: <ASCII form>
    Subject:  <ASCII form>
    Content-type:  message/iso-10646

    From: <iso-10646-form>
    Subject <iso-10646-form>
    Content-type: text/iso-10646

    Message body in iso-10646 character set.
5.5	The Binary Content-Type

A content-type of "binary" may be used to Indicate that the body or body
part is binary data.  A subtype may be specified, but none are defined
here.  The parameters for type binary are a set of attribute/value
pairs, of the form "NAME=VALUE", separated by the usual semicolons.  The
set of possible attributes to be defined includes, but is not limited to:

    NAME -- a suggested name for the binary data as a file.

    TYPE -- the type of binary data

    CONVERSIONS -- the set of operations that have been performed on
    the data before putting it in the mail (and before any
    Content-TransferEncoding that might have been applied).  If
    multiple conversions have occurred, they should be specified in
    the order they were applied, and separated by commas.  

The values for these attributes are left undefined at present, but may
require specification in the future.  An example of a common (though
discouraged) usage might be:

    Content-type:  binary; name=foo.tar; type=tar; 
            conversions=compress,uuencode

However, the use of such mechanisms as uuencode and compress is
explicitly discouraged, in favor of the more standardized
Content-TransferEncoding mechanism.  In particular, uuencode is not
well-suited for mail transport because it is ill-defined, it comes in
several incompatible versions, many of which do not work in a pipe, and
which use characters that do not translate well into certain
representations (e.g. EBCDIC) and are not transmitted reliably over
certain connections (e.g. those that remove trailing white space from a
line).  

The recommended action for an implementation that receives binary mail
of an unrecognized type is to simply offer to put the data in a file,
with any Content-TransferEncoding undone, or perhaps to use it as input
to a user-specified process.  Implementations are warned NOT to
implement a path-search mechanism whereby an arbitrary program named in
the Content-type header (e.g. the "type=" subfield) is found and
executed using the binary data as input.  Such an implementation could
open up a significant security problem, the elucidation of which is left
as an exercise for the reader.
5.6	The Application Content-Type Value

The "application" content-type is to be usedor mail-based applications. 
The notion of mail-based application is an application that defines a
standard format for representing intermediate data that is to be
manipulated by cooperating user agents.  For example, a meeting
scheduler might define a standard representation for information about
proposed meeting dates.  An intelligent user agent would use this
information to conduct a dialog with the user, and might then send
further more based on that dialog.

Such applications may be defined as subtypes of the "application"
content-type.  There is no default subtype for application, and this
memo defines only one subtype, the "external-reference" subtype.

The External-Reference subtype indicates that the body or body part is
primarily a placeholder for the data that are intended to be conveyed,
presumably because too much data is involved for the underlying mail
transport mechanism to handle.  The subfields are, as in the case of the
"binary" content-type, attribute-value pairs.  In this case, the
subfields describe a mechanism for accessing the binary data.   The set
of possible attributes includes, but is not limited to:

    FILENAME -- The name of a file that contains the external data.

    SITE -- one or more domain names, comma separated, of machines
    that are known to have access to the data file.

    REAL-TYPE -- The real content-type of the data, once retrieved.

    EXPIRATION -- The date (in the format "month day, year") after
    which the existence of the external data is not guaranteed.

With the emerging possibility of very wide-area file systems, it becomes
very hard to know in advance the set of machines where a file will and
will not be accessible directly from the file system.  Therefore it
makes sense to provide both a file name, to be tried directly, and the
name of one or more sites from which the file is known to be accessible.
 An implementation can try to retrieve remote files using FTP or any
other protocol, using anonymous file retrieval or prompting the user for
the necessary name and password.  However, the external-reference
mechanism is not intended to be limited to file retrieval.  One can
imagine, for example, using a LISTSERV mechanism, or using unique
identifiers and a video server for external references to video clips. 
However, this memo explicitly defines only the FILENAME and SITE
attributes for retrieval purposes, as this is the only retrieval method
that is currently widely applicable.  Other attributes may be defined as
needed.

The "REAL-TYPE" attribute may be used to specify a new content-type
header field to be applied to the data once retrieved, as the data are
assumed to be only the body of a message, not including any header
information.  Note that semicolons may be quoted within subfields.  Thus
an external reference to an image in G3FAX format might have the
following content-type header field:

    Content-Type: application/external-reference;
        name=/usr/local/images/contact.g3; 
        site=thumper.bellcore.com; 
        real-type="image/g3fax" 
        expiration = "September 23, 1997"

If a message is of content-type "application/external-reference", then
the actual body of the message is ignored.

5.7	The Audio, Image, and Video, and X- Content-Type Values

This memo defines several morecontent-type values that are defined only
incompletely here, and await further practical experience before their
values can be more completely specified.

AUDIO --   Indicates that the body or body part contains audio data. 
The subtype specifies the audio representation format; predefined
case-insensitive values are "U-law" and "A-law".  (U-law and A-law are
the American and European audio telephony standards.)  The parameters
are attribute/value pairs, as in the binary content-type, and may be
used to name a header format (e.g. "header=Sun"), to specify the size of
the header that precedes the actual audio data (e.g. "headersize=234568
bytes", or for other purposes.

IMAGE --Indicates that the body or body part contains an image.  The
subtype names the specific image format; predefined case insensitive
values include "G3Fax" for Group Three Fax and "pbm", "pgm", and "ppm"
for the "portable bitmap" formats for black and white, grey scale, or
color images.

VIDEO -- Indicates that the body or body part contains a video sequence.
 The subtype and possible parameter values are left undefined by this
memo.

"X-" anything -- Any type value beginning with the characters "X-" and
not defined here or in another RFC is a private value, to be used by
consenting mail systems by mutual agreement.  Any format without a
rigorous and public definition should be named with an "X-" prefix.  The
widely-used Andrew system uses the "X-BE2" name, so new systems should
probably choose a different name.
6	RFC-XXXX Compliance

The mechanisms described in this memo are open-ended.  It is definitely
not expected that all implementations will implement all of the
content-types described, nor that they will all share the same
extensions.  In order to promote interoperability, however, it is useful
to define the concept of "RFC-XXXX-Compliance" to define a certain level
of implementation that allows the useful interworking of messages with
content that differs from US ASCII text.  In this section, we specify
the requirements for such compliance.

An RFC-XXXX-Compliant mail user agent must:

    1.  Recognize the Content-TransferEncoding header field, and
    decode data encoded with either the quoted-printable or base64
    implementations.  (If a compressed encoding  is ever agreed to,
    it should also become part of all compliant user agents.)

    2.  Recognize and interpret the Content-type header field, and
    avoid showing an unsuspecting user raw data that has a
    content-type field other than text.

    3.  Explicitly handle the following content-type values, as
    defined in the appendices:

        -- text, with at least the US-ASCII character set.

        -- text-plus, with the default "richmail" subtype, at
        least the minimal implementation specified above.

        -- message, at least the default (simple) encapsulation.

         -- multipart, although parallel parts may be
        serialized, with all unrecognized subtypes treated as
        the default subtype, multipart/mixed.

        -- binary, although no particular subtype recognition is
        required.

    4.  Upon encountering an unrecognized content-type, an
    implementation should treat it as if it had a content-type of
    "binary" with no parameter sub-arguments.  How such data is
    handled is up to an implementation, but likely options for
    handling such unrecognized data include offering the user to
    write it into a file (decoded from its mail transport format) or
    offering the user to name a program to which the decoded data
    should be passed as input.  Unrecognized predefined types, which
    might include audio, image, video, or application, should also
    be treated in this way.

A user agent that meets the above conditions is said to be RFC-XXXX
compliant.  The meaning of this phrase is that it is assumed to be
"safe" to send virtually any kind of properly-marked data to users of
such mail systems, because they will at least be able to treat the data
as undifferentiated binary, and will not simply splash it onto the
screen of unsuspecting users.   Of course, there is another sense in
which it is always "safe" to send RFC-XXXX format data, which is that it
such data will not break or be broken by any known systems that are
compliant with RFC 821 and RFC 822.  User agents that are RFC-XXXX
compliant have the additional guarantee that the user will not be shown
data that were never intended to be viewed as text.
Appendix I -- Guidelines For Sending Data Via Email

Because of the restriction imposed on message bodies by RFC 822 and, in
practice, by Message Transport Agents that are more-or-less compliant
with RFC 821, implementors should be careful in several ways regarding
mail transport of any data through the mail.  The primary limitations
imposed by RFC 821 are that only seven-bit data may be transmitted, and
that the data must be broken up into lines of no more than 1000
characters.  However, in practice, widely-used MTA's are known to impose
some additional restrictions.  The following guidelines may be useful to
anyone devising a data format (content-type) that will survive such
MTA's unscathed.  (Note that anything encoded in the base64 or
quoted-printable encodings will satisfy these rules, but that some
well-known mechanisms, notably the UNIX uuencode facility, will not.)

    (1) Delimiters other than CR-LF pairs may be used in the local
    representation of a message on some systems.  The persistence of
    CR-LF pairs should not be relied on.

    (2) Isolated CR and LF characters are not well tolerated in
    general; they may be lost or converted to delimiters on some
    systems, and hence should not be relied on.

    (3) TAB characters may be misinterpreted or may be automatically
    converted to variable numbers of spaces.  This is unavoidable in
    some environments, notably those not based on the ASCII
    character set. Such conversion is STRONGLY DISCOURAGED, but it
    may occur, and users of US-ASCII format should not rely on the
    persistence of TAB characters.

    (4) Lines longer than 78 characters may be wrapped or truncated
    in some environments. Line wrapping and line truncation are
    STRONGLY DISCOURAGED, but unavoidable in some cases.
    Applications which depend on lines not being wrapped should use
    mechanisms other than unencoded US-ASCII bodyparts to transmit
    messages. 

    (5)  Trailing "white space" characters (SPACE, TAB, etc.) on a
    line may be discarded by some transport agents, and hence should
    not be relied on.

Please note that the above list is NOT a list of recommended practices
for MTA's -- we do not recommend that MTA's alter the character of white
space, or wrap long lines.  These are known BAD practices on established
networks, and implementors must guard against the bad effects they can
cause.  Thus the above might be seen as a list of recommended defensive
actions to be taken by User Agents to defend themselves against the
known ways in which MTA's sometimes modify messages.
Appendix II -- Examples

Example 1	Simple Non-ASCII Text Example

***** FILL IN HERE WITH AN EXAMPLE OF NON-ASCII TEXT.  Can somone
provide me with a cute example from a non-ASCII character set?
Example 2	A Complex Multipart Example

What follows is the outline of a complex multipart message.  This
message has three parts to be displayed serially:  an introductory plain
text part, an embedded multipart message, and a closing encapsulated
text message in a non-ASCII character set.  The embedded multipart
message has two parts to be displayed in parallel, a picture and an
audio fragment.

    From: ...
    Subject: ...
    Content-type: multipart; 1-s; tweedledum

    This is a multipart message.  
    Since I've not specified another character set, 
    this "prefix" area is in US ASCII.
    --tweedledum

    ...Some more text appears here...
    [Note that the preceding blank line means 
    no header fields were given and this is text,
    with charset US ASCII.]
    --tweedledum
    Content-type: multipart; 1-p; tweedledee

    This is a multipart message.  
    If you are reading this text, you might want to 
    consider changing to a user agent that understands 
    how to properly display multipart messages.
    --tweedledee
    Content-type: u-law; 8000 HZ; X-NEXT
    Content-TransferEncoding: base64

    ... base64-encoded NeXT-format audio data goes here....
    --tweedledee
    Content-type: image; G3FAX
    Content-TransferEncoding: Base64

    ... base64-encoded FAX data goes here....
    --tweedledee--
    --tweedledum
    Content-type: message/ISO-8859-1

    From: Keld J|rn Simonsen (name can be non-ASCII)
    Subject: whatever
    Content-type: Text/ISO-8859-1
    Content-TransferEncoding: Quoted-printable

    ... Closing text goes here ...
    --tweedledum--
Summary

Using the Content-Type and Content-TransferEncoding header fields, it is
possible to include, in a standardized way, arbitrary types of data
objects with RFC 822 compliant mail messages. No restrictions imposed by
either RFC 821 or RFC 822 or broken, and care has been taken to avoid
problems caused by additional restrictions imposed by the
characteristics of some Internet mail transport mechanisms (see Appendix
I). The "multipart" and "message" content-types allow mixing and
heirarchical structuring of objects of different types in a single
message. Further content-tyes allow a standardized mechanism for tagging
messages or mesage parts as audio, image, or several other kinds of
data.  Additional optional header fields provide conventional mechanisms
for certain extensions deemed desirable by many implementors.  Finally,
a number of useful content-types are defined for general use by
consenting user agents.

Contacts

For more information, the authors of this document may be contacted via
Internet mail:

             Nathaniel Borenstein <nsb@thumper.bellcore.com>
                      Ned Freed <ned@innosoft.com>

Acknowledgements

This RFC is the result of the collective effort of a large number of
people, at several IETF meetings and on the IETF-SMTP and IETF-822
mailing lists.  Although any enumeration seems doomed to suffer from
egregious omissions, the following are among the many contributors to
this effort:  Harald Alvestrand, Randall Atkinson, Kevin Carosso, Mark
Crispin, Dave Crocker, Walt Daniels, Frank Dawson, Hitoshi Doi, Kevin
Donnelly, Johnny Eriksson, Craig Everhart, Roger Fajman, Alain Fontaine,
Phil Gross, David Herron, Bruce Howard, Bill Janssen, Risto Kankkunen,
Phil Karn, Tim Kehres, Neil Katin, Steve Kille, Anders Klemets, John
Klensin, Vincent Lau, Timo Lehtinen, John MacMillan, Rick McGowan, Leo
Mclaughlin, Goli Montaser-Kohsari, Keith Moore, Mark Needleman, John
Noerenberg, David J. Pepper, Jonathan Rosenberg, Jan Rynning, Mark
Sherman, Keld Simonsen, Bob Smart, Einar Stefferud, Michael Stein, Taro
Suzuki, Steve Uhler, Stuart Vance,  Erik van der Poel, Peter Vanderbilt,
Greg Vaudreuil, Brian Wideen, Glenn Wright, and David Zimmerman.  The
authors apologize for any omissions from this list, which were certainly
unintentional.
References

[REF-ISO646] International Standard--Information Processing--ISO 7-bit
coded  character set for information interchange, ISO 646:1983.

[REF-ISO-2022] International Standard--Information Processing--ISO 7-bit
and  8-bit coded character sets--Code extension techniques, ISO
2022:1986.

[REF-ANSI] Coded Character Set--7-Bit American Standard Code for 
Information Interchange, ANSI X3.4-1986.

[REF-X400]  Schicker, Pietro, "Message Handling Systems, X.400", Message
Handling Systems and Distributed Applications, E. Stefferud, O-j.
Jacobsen, and P. Schicker, eds., North-Holland, 1989, pp. 3-41.

[RFC-821] Postel, J.B.  Simple Mail Transfer Protocol.  August, 1982,
Network Information Center, RFC-821. 

[RFC-822]   Crocker, D.  Standard for the format of ARPA Internet text
messages.   August, 1982, Network Information Center, RFC-822.

[RFC-934]   Rose, M.T.; Stefferud, E.A.  Proposed standard for message 
encapsulation.  January, 1985, Network Information Center, RFC-934.

[RFC-1049]  Sirbu, M.A.  Content-type header field for Internet
messages.  March, 1988, Network Information Center, RFC-1049. 

[RFC-1113]  Linn, J.  Privacy enhancement for Internet electronic mail:
Part I -  message encipherment and authentication procedures [Draft]. 
August, 1989, Network Information Center, RFC-1113.

[RFC-1154]  Robinson, D.; Ullmann, R.  Encoding header field for
internet messages. April, 1990, Network Information Center, RFC-1154.

[REF-ISO-10646] ************

[REF-ISO-8859] **********
May Draft, #2 -- explanation Nathaniel Borenstein
May Draft, #2 -- TEXT Nathaniel Borenstein
Re: May Draft, #2 -- explanation John C Klensin
Re: May Draft, #2 -- explanation Nathaniel Borenstein
Re: May Draft, #2 -- explanation Mark Crispin
Re: May Draft, #2 -- explanation John C Klensin
Re: May Draft, #2 -- explanation Nathaniel Borenstein
Re: May Draft, #2 -- explanation Mark Crispin
Re: May Draft, #2 -- explanation John C Klensin
Re: May Draft, #2 -- explanation Nathaniel Borenstein
PS -- Re: May Draft, #2 -- explanation Nathaniel Borenstein