May Draft, #2 -- TEXT
Nathaniel Borenstein <nsb@thumper.bellcore.com> Wed, 22 May 1991 02:15 UTC
Received: from thumper.bellcore.com by NRI.NRI.Reston.VA.US id aa28929; 21 May 91 22:15 EDT
Received: by thumper.bellcore.com (4.1/4.7) id <AA08597> for gvaudre@NRI.Reston.VA.US; Tue, 21 May 91 22:17:21 EDT
Received: from Messages.7.14.N.CUILIB.3.45.SNAP.NOT.LINKED.thumper.mouseclub.sun4.40 via MS.5.6.thumper.mouseclub.sun4_40; Tue, 21 May 1991 22:17:19 -0400 (EDT)
Message-Id: <ocCRGju0M2Y1EVFV44@thumper.bellcore.com>
Date: Tue, 21 May 1991 22:17:19 -0400
From: Nathaniel Borenstein <nsb@thumper.bellcore.com>
To: Greg Vaudreuil <gvaudre@NRI.Reston.VA.US>, John C Klensin <KLENSIN@infoods.mit.edu>, NED@hmcvax.claremont.edu, keld@dkuug.dk, erik@sra.co.jp, MRC@cac.washington.edu
Subject: May Draft, #2 -- TEXT
In-Reply-To: <9105211000.aa05707@NRI.NRI.Reston.VA.US>
References: <9105211000.aa05707@NRI.NRI.Reston.VA.US>
CHANGES From FIRST MAY DRAFT Lots of prose changes, mostly minor, a few verbatim from Ned & Greg. The MAILASCII character set is now called US-ASCII, because that's what it really is. However, it is no longer the default -- the default character set is now undefined, because that's the only thing that really corresponds to existing practice! The discussion about how to mail things safely was separated from the US-ASCII discussion and given its own appendix. Quoted-Printable has been tightened up still further. The Content-Size definition now differentiates the two possible uses via a "when" part. Changed "charset" syntax to be "subtype". Further reduced number of content-types to NINE by consolidating things into subtypes. Liberalized the rules for defining subtypes, even as the rules for types were tightened up. Multipart messages no longer have character sets specifications, and so the prefix & postfix have gone away again, sigh.. An interesting new paragraph on subtypes has been added right near the end of the multipart description. I think this may be the key to getting multipart structuring done right in the future. Along with the "text-plus" type, I've defined a new "richmail" default subtype that I think could be very important. How to Read the May Draft of RFC-XXXX This is the fifth major draft, at least, of RFC-XXXX. Those of you who have been following along are, no doubt, heartily sick of the process by now, as am I. I'm trying to make it easier for us all in the following ways: 1. I've compiled a list of major changes from the April draft. I'm not trying to pull any fast ones on anybody, but it is possible that the list is incomplete. It is, however, my best attempt to provide a simple list of what has changed. 2. With previous drafts, I think that comments mostly came in three flavors: NITS: Minor points of clarification, typographical or technical correction, etc. These were uncontroversial and I tried to adopt them all. SHOW-STOPPERS: These were major disagreements, where people indicated unhappiness so great that they might be unable to live with the draft as written. Obviously I've tried VERY hard to deal with these, but sometimes people have SHOW-STOPPER comments that are pretty nearly in direct conflict with each other. ARGUMENTS: These are sincere disagreements where the person disagreeing could still live with the draft if he lost the argument. I would like to STRONGLY URGE the readers of this draft to self-classify their comments into the above three categories, and to treat them in the following ways: NITS: Send them directly to me; no need to bother the whole list. SHOW-STOPPPERS: Sigh... I'm hoping there aren't any left, but if you have them, please send them to the whole list. ARGUMENTS: If you can live with losing the argument, and if the argument has already been well-argued in the past on the list, ask yourself: is it worth re-arguing? I'm not trying to prevent debate, merely encouraging you to reflect before reopening old arguments. I still need help on a number of things, particularly fleshing out some of the references and sanity-checking some of the areas in which I'm not an expert, notably character sets, audio, and privacy-enhanced messages (PEM). If you know something about one of these, please read that part of the draft extra carefully. That's all. Enjoy. I look forward to your comments. Well, sort of.... :-) -- Nathaniel Major Changes From April Draft There is a lot of new prose, and the document has been reorganized substantially, to clarify intent and to discuss rejected alternatives. Content-type syntax: There is now a distinguished place for subtypes. Character set types have been replaced with a subtype syntax for the text and message types. The rest of the syntax has been generalized to a set of semicolon-separated parameters. Content-types: Several content-types have been consolidated into nine high-level types such as "image" and "audio". The Scribe and SGML content-types have been eliminated. DES-MESSAGE has been replaced by MESSAGE/PEM. Notable new content-types worth looking at include text-plus/richmail, binary, message, message/partial, application/external-reference. The scheme for officially defining new content-types has been changed to require an RFC for content-types, but to be more liberal for subtypes. The text type's default character set is now left undefined, to match prior reality(!) but an explicit specification (e.g. US-ASCII) is encouraged for future composers. Multipart messages: The definition has changed so that body-parts are no longer messages, though the syntax is the same. A new distinguished closing delimiter is now required. A new 'digest" subtype is also defined, as is a new concept of subtypes for multipart messages. The "Encoded-Variable" stuff has been elminated, in favor of Content-type: Message/charset Content-Encoding has been changed to Content-TransferEncoding. The hexadecimal encoding has been eliminated, and some prose about the need for a compressed encoding has been added. The base64 encoding has added "," as a way to specify portable end-of-lines. The quoted-printable encoding has changed "&" and "\" to "=" and ":" for portability, and has added some rules (and clarified others) regarding CRLF and white space, and is generally much tighter. Two new optional header fields, Content-ID and Content-Description, have been defined. Content-Size has been extended & clarified. Added a new notion of "RFC-XXXX-compliant" implementations, defining a minimal subset to be implemented to earn such a label. The X.400-related types have been dropped, leaving these questions for the experts. Network Working Group -- Request for Comments: XXXX Mechanisms for Specifying and Describing the Format of Internet Message Bodies Nathaniel Borenstein, Bellcore Ned Freed, Innosoft May 1991 Status of This Memo This draft document will be submitted to the RFC editor as a protocol specification. Distribution of this memo is unlimited. Please send comments to Nathaniel Borenstein <nsb@thumper.bellcore.com> Abstract This document suggests extensions to the RFC 822 message representation protocol to allow multi-part textual and non-textual messages to be represented and exchanged without loss of information. This is based on earlier work documented in RFC 934 and RFC 1049, but extends and revises that work. Table of Contents 1: Introduction 2: The Content-Type Header Field 3: The Content-TransferEncoding Header Field 3.1: Quoted-Printable Content-TransferEncoding 3.2: Base64 Content-TransferEncoding 4: Additional Optional Content- Header Fields 4.1: Optional Content-ID Header Field 4.2: Optional Content-Description Header Field 4.3: Optional Content-Size Header Field 5: The Nine Predefined Content-type Values 5.1: The TEXT Content-type and the US-ASCII Character Set 5.2: The "Multipart" Content-Type 5.3: The "Text-Plus" Content-Type and "RichMail" subtype 5.4: The Message Content-Type 5.5: The Binary Content-Type 5.6: The Application Content-Type Value 5.7: The Audio, Image, and Video, and X- Content-Types 6: RFC-XXXX Compliance Appendix I -- Guidelines For Sending Data Via Email Appendix II -- Examples Example 1: Simple Non-ASCII Text Example Example 2: A Complex Multipart Example Summary Contacts Acknowledgements References 1 Introduction Since its publication in 1982, RFC 822 [RFC-822] has defined the standard format of textual mail messages on the Internet. Its success has been such that the RFC 822 format has been adopted, wholly or partially, well beyond the confines of the Internet and of SMTP transport, as defined by RFC 821 [RFC-821]. As the format has seen wider use, a number of limitations have become increasingly problematic for the user community. RFC 822 was intended to specify a format for text messages. As such, non-text messages, such as multimedia messages that might include audio or images, are simply not mentioned. Even in the case of text, however, RFC 822 is inadequate for the needs of email users whose languages require the use of character sets richer than US ASCII [REF-ANSI]. For mail containing audio, video, Japanese text, or even text in most European languages, RFC 822 does not specify enough to permit interoperability. One of the notable limitations of RFC 821/822 based mail systems is the fact that they limit the contents of electronic mail messages to relatively short lines of seven-bit ASCII. This forces a user to convert any non-textual data that she may wish to send into seven-bit bytes representable as printable ASCII characters before invoking her local mail UA (User Agent program). Examples of such encodings currently used in the Internet include pure hexadecimal, uuencode, the 3-in-4 base 64 scheme specified in RFC 1113, the Andrew Toolkit Representation, and many others. These limitations become even more apparent as gateways are designed to allow for the exchange of mail messages between RFC 822 hosts and X.400 hosts. X.400 [REF-X400] specifies mechanisms for the inclusion of non-textual body parts within electronic mail messages. The current standards for the mapping of X.400 messages to RFC 822 messages specify that either X.400 non-textual body parts should be converted to (not encoded in) an ASCII format, or that they should be discarded, notifying the RFC 822 user that discarding has occurred. This is clearly undesirable, as information that a user may wish to receive is lost. Even though a user's UA may not have the capability of dealing with the non-textual body part, the user might have some mechanism external to the UA that can extract useful information from the body part. Moreover, it does not allow for the fact that the message may eventually be gatewayed back into an X.400 MHS, where the non-textual information would definitely become useful again. This memo describes several mechanisms that combine to solve these problems. In particular, it describes: 1. A Content-type header field, generalized from RFC 1049 [RFC-1049], which can be used to describe the type and subtype of data in the body of a message and to fully specify the representation (encoding) of such data. 2. A Content-TransferEncoding header field, which can be used to describe an auxilliary encoding that was applied to the data in order to allow it to pass through the mail transport layer. 3. A "text" content-type value, which can be used to represent text information in a number of character sets in a standardized manner. 4. A "multipart" content-type value, which can be used to combine several separate body-parts, which may be made of different types of data, into a single message. 5. A "binary" content-type value, which can be used to transmit uninterpreted or partially-interpreted binary data, and hence to implement an email file transfer service. 6. A "message" content-type value, for encapsulating a mail message. 7. Several additional content-type values and subtypes, which can be used by consenting User Agents to interoperate with additional message types such as audio, images, and more. 8. Several optional header fields that can be used to further describe the data in a message body or body-part, in particular the Content-Size, Content-ID, and Content-Description header fields. Finally, to specify and promote a minimal level of interoperability, this memo describes a subset of the above mechanisms that defines "compliance" with this memo. That is, it specifies the minimal subset required for an implementation to be called "RFC-XXXX-compliant." 2 The Content-Type Header Field The Content-Type header field was first defined in RFC 1049. This section extends and supersedes that definition. RFC 1049 content-types are all compliant with the new, more general syntax. (In particular, RFC 1049 content-types omitted the subtype/character-set specification, and always had at most two of the parts now called "parameters", which were distinguished by their position as indicating a version number and a resource reference.) The Content-Type header field is used to specify the type of data in a message, by giving a type name, and to provide auxiliary information that may be required for certain types. In addition. a distinguished syntax is defined for specifying subtype information, including character set information in the case of text. After the type name and the optional subtype, the remainder of the header field is simply a set of parameter specifications, as defined for each named type, and an optional comment. (COMPATIBILITY NOTE: Readers familiar with RFC 1049 Content-types will notice that the syntax has been generalized substantiallly. However, RFC 1049 content-types are all compliant with the new syntax. In particular, RFC 1049 content-types omitted the subtype specification, and always had at most two of the parts now called "parameters", which were distinguished by their position as indicating a version number and a resource reference.) In the Extended BNF notation of RFC-822, we define a Content-type header field value as follows: Content-Type:= type ["/" subtype] *[";" parameter] [comment] parameter := local-part subtype := local-part type := local-part The type and subtype values are not case sensitive. TEXT, Text, and TeXt are all equivalent. An initial set of nine content-types are defined by this memo. This set of type names is not intended to be completely exhaustive. More may be defined later, by a future RFC. However, it is expected that most extensions to the set of objects that are sent through the mail can be accomplished by the creation of new subtypes of these initial types. The only constraint on the definition of subtype names is the desire that their uses not conflict. That is, it would be undesirable to have two different communities using "Content-type: binary/foobar" to mean two different things. The process of defining new content-subtypes, then, is not intended to be a mechanism for imposing restrictions, but simply a mechanism for publicizing the usages. There are, therefore, two acceptable mechanisms for defining new content-type subtypes: 1. Private values (starting with "X-") may be defined bilaterally between two cooperating agents without outside approval or standardization 2. "Standard" values may be defined by the publication of an Internet RFC. The RFC need not be very long, but must define the content-type and subtype, its associated parameter syntax, and the format of the body of a message so marked. 3. The value may be inferred as an obvious and unambiguous extension of the subtypes defined in a previous RFC. For example, this memo defines an "image" type with subtypes that denote image formats such as G3Fax. An additional image type for which there is one clear and obvious name is an obvious extension of the subtypes of "image." The nine initial predefined content-types are detailed in the appendices of this memo. The are: text -- textual information, with character set given by the subtype text-plus -- mostly textual information, with embedded formatting commands. A simple default type is defined, with possible subtypes including troff, TeX, and so on. message -- an encapsulated message, with initial subtypes for partial messages and privacy-enhanced messages multipart -- a message consisting of multiple parts of independent type values, with initial subtype digest. audio -- a message containing audio data, with initial subtypes a-law and u-law. image -- a message containing image data, with initial subtypes G3fax, gif, pbm, ppm, and pgm. video -- a message containing video data. binary -- a message containing some other form of binary data. application -- a message containing data to be processed by a mail-based application. If no Content-type header field is present, "text" is assumed, with the default (undefined) character set as specified later in this memo. This is consistent with the default message body type as defined by RFC 822. It should be noted that the list of Content-type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially. We have simply attempted, in this RFC, to give as many standard Content-type definitions as was possible given the current state of our knowledge. 3 The Content-TransferEncoding Header Field Many content-types are represented, in their "natural" format, as 8-bit or binary data. Such data can not be transmitted over existing Internet mail mechanisms because both RFC 821 and RFC 822 restrict mail messages to 7 bit data with reasonably short lines. It is necessary, therefore, to define a standard mechanism for encoding such data in an acceptable manner. This RFC specifies that such encodings will be indicated by a new "Content-TransferEncoding" header field. The Content-TransferEncoding field is used to indicate the type of transformation that has been used to represent the message body in an acceptable manner. It may seem that the Content-TransferEncoding could be inferred from the characteristics of the Content-Type that is to be encoded, or, at the very least, certain Content-TransferEncodings could be mandated for use with specific Content-Types. There are several reasons why this is not the case. First, given the varying types of transports used for mail, some encodings may be appropriate for some Content-Type/transport combinations and not for others. Second, certain Content-Types may require different types of transfer encoding under different circumstances. For example, many PostScript messages may consist entirely of short lines of 7-bit data and hence require little or no encoding. Other PostScript messages (especially those using Level 2 PostScript's binary encoding mechanism) may only be resonably represented using a binary transport encoding. Finally, since Content-Type is intended to be an open-ended specification mechanism, strict specification of an association between Content-Types and encodings effectively couples the specification of an application protocol with a specific lower-level transport. This is not desireable since the developers of a Content-Type may be and should not have to be aware of all the transports in use and what their limitations are. It should be noted, also, that there is considerable interest and effort being expended on extending mail transport to permit 8-bit or binary data. If such extensions ever become commonplace, the Content-TransferEncoding mechanism will quickly become irrelevant, and it is therefore desirable not to "overload" Content-TransferEncoding with additional mechanisms that might still be useful in such a future. For this reason, Content-TransferEncoding is restricted in its scope to refer to nothing but the 7-bit encoding question. Matters such as the basic format in which information is "encoded" are to be handled by other mechanisms. Unlike Content-types, which are expected to proliferate, it is expected that there will never be more than a few different Content-TransferEncoding values, both because there is less need for variation and because the effect of variation in Content-TransferEncoding would be more problematic. However, establishing only a single Content-TransferEncoding mechanism does not seem possible. In particular, there is a tradeoff between the desire for a compact and efficient encoding of binary data and the desire for a readable encoding of data that is mostly, but not entirely, 7-bit data. For this reason, at least two encoding mechanisms are necessary, a "readable" encoding and a "dense" encoding. A third encoding, for compressed ("super-dense") data, is also strongly desirable. This RFC does not specify a "compressed" encoding, due to the uncertain legal state of the UNIX "compress" command and a lack of certainty, during the drafting of this RFC, regarding the right way to define a standard compression algorithm. It is hoped that a compressed Content-TransferEncoding will be defined in a future RFC. Any compression algorithm for such a use should be unambiguously defined and without legal encumbrances. The Content-TransferEncoding field is designed to specify a two-way mapping between the "native" representation of a type of data and a representation that can be readily exchanged using 7 bit mail transport protocols as defined by RFC 821 (SMTP). This field has not been defined by any previous RFC. The field's value is a single atom specifying the type of encoding, as enumerated below. Formally: Content-TransferEncoding:= "BASE64"/ "QUOTED-PRINTABLE"/ "8BIT"/"BINARY"/ "7BIT"/"X-"atom These values are not case sensitive. That is, Base64 and BASE64 and bAsE64 are all equivalent. An encoding type of 7BIT implies that the message is already in a seven-bit mail-ready representation. This value is assumed if the Content-TransferEncoding header field is not present. If the message is stored or transported via a mechanism that permits 8-bit data, a Content-TransferEncoding of "8bit" may be used. If the message is stored or transported via a mechanism that permits arbitary binary data, a Content-TransferEncoding of "binary" may nonetheless be used. In particular, "8bit" or "binary" should be used in the case where there is fear that the message may "leak" into a more restricted (7-bit) transport environment. (DISCUSSION: The distinction between the Content-TransferEncoding values of "binary," "8bit," and "7bit" may seem unimportant in an 8-bit binary environment, but clear labeling will be of enormous value to gateways between 8-bit and 7-bit systems. The difference between "8bit" and "binary" is that "8bit" implies adherence to SMTP limits on line length and CR/LF semantics, whereas "binary" does not.) Implementors may, if necessary, define new Content-TransferEncoding values, but should prefix them with "x-" to indicate their non-standard status, e.g. "Content-TransferEncoding: x-my-new-encoding". However, unlike Content-types and subtypes, the creation of new Content-TransferEncoding values is explicitly discouraged, as it seems likely to hinder interoperability with little potential benefit. If a Content-TransferEncoding header field appears as part of a message header, it applies to the entire message body, whether or not that body is of type "multipart." If it is of type multipart, the encoding applies recursively to all of the encapsulated parts, including their encapsulated headers. If a Content-TransferEncoding header field appears as part of an encapsulation's headers, it applies only to the body of the encapsulated part. If the encapsulated part is itself of type "multipart", the encoding applies recursively to all of the encapsulated parts within that encapsulated part. It should be noted that, because email is character-oriented, the mechanisms describe here are mechanisms for encoding arbitrary byte streams, not bit streams. If a bit stream is to be encoded via one of these mechanisms, it should first be converted to a byte stream using the network standard bit order ("big-endian"), in which the earlier bits in a stream become the higher-order bits in a byte. A bit stream not ending at an 8-bit boundary should be padded with zeroes. If the precise bit count is needed, it can be given in the Content-Size header field, described later in this document. The following sections will define the two standard encoding mechanisms. 3.1 Quoted-Printable Content-TransferEncoding The Quoted-Printable encoding is intended to represent data that largely contains octets less than 127. It encodes the data in such a way that the resulting octets are both unlikely to be modified by mail transport, and, when read as ASCII text, are largely recognisable by humans. A message which is entirely ASCII may also be encoded in Printed-Quotable to insure it's survival in an environment which is anticipated to transverse a character translating gateway such as those onto Bitnet. In this encoding, ASCII characters 33 through 57, inclusive, 59, 60, and 62 through 126, inclusive, are unchanged. All other characters, including characters 32 (SPACE), 58 (:), 61 (=), 127 (DEL), and all control characters, are to be represented as determined by the following rules: Rule #1: Any 8 bit value may be represented by a ":" followed by a two digit hexadecimal representation of the character's 8-bit value. Thus, for example, character 12 (control-L, or formfeed) can be represented by ":0C", the equal-sign character (61) can be represented by ":3D", and the colon character (58) itself can be represented by ":3A". Rule #1 is the REQUIRED representation for characters 127 through 160 and for character 255. Rule #2: An 8 bit value from 161 through 254 may, alternately, be represented by an equal-sign character followed by the single character obtained by the removal of the high order bit, i.e. by subtracting 128 from the value. Thus the 8 bit value 193 may be represented as "=A". Rule #2 is completely optional, given rule #1, but is provided for improved readability of some 8-bit character sets in which turning on the 8th bit produces a character similar to the corresponding 7 bit character, e.g. the 8th bit simply adds an umlaut. Rule #3: The literal equal-sign and colon characters must themselves be quoted by colons. Thus, the colon may be represented as "::" and the equal-sign as ":=". Note that this is not ambiguous with regard to the first clause, because neither ":" nor "=" are part of the hexadecimal alphabet. Rule #4: A colon at the end of a line may be used to indicate a non-significant line break. That is, if one needs to include a long line without line breaks, a message encoded with the quoted-printable encoding should include "soft" line breaks in which the line break is preceded by a colon. Thus if the "raw" form of the line is a single line that says: Now's the time for all men to come to the aid of their country. Now's the time for all men to come to the aid of their country. Now's the time for all men to come to the aid of their country. This could be represented, in the quoted-printable encoding, as Now's the time for all men to come to the aid of their country. : Now's the time for all men to come to the aid of their country. : Now's the time for all men to come to the aid of their country. This provides a mechanism with which long lines are encoded in such a way as to be restored by the user agent. The quoted-printable encoding REQUIRES that lines be broken so that they are no more than 78 characters long, using soft line breaks when necessary. Rule #5: Although the SPACE (32) and TAB (9) characters may generally be represented as themselves, they should NOT be so represented at the end of a line, because some MTA's are known to remove "white space" from the end of a line. In such cases, the characters MUST be represented as in rule #1 (as ":20" and ":09" respectively) or as themselves, followed by a soft line break followed by a real line break. Of course, these characters can be so represented within a line as well, if this is desired; in the case of the TAB character, representing it as ":09" may be somewhat more robust even in the middle of a line. Note that in decoding a quoted-printable message, any trailing white space on a line should be deleted, as it will necessarily have been added by intermediate transport agents. Rule #6: A CR LF pair normally constitutes a line break and should be represented by a line break in the quoted-printable encoding if that is its meaning. Isolated CRs, LFs, and LF CR sequences must be represented using the :0D, :0A, and :0D:0A notations respectively. CR LF sequences that are not intended to represent a line break should be encoded as :0D:0A to reflect this usage. In other words, the concept "end of line" is represented, in the quoted-printable encoding, by CR LF, although this may be modified in local storage formats. Literal occurrences of CR or LF that do not occur as CRLF or are not intended to represent end-of-line markers must be represented in hexadecimal. Since the hyphen character ("-") is represented as itself in the Quoted-Printable encoding, the usual care must be taken, when encapsulating a quoted-printable encoded message or body part in a multipart message, to ensure that the encapsulation boundary does not appear anywhere in the message. See the definition of multipart messages, later in this memo. 3.2 Base64 Content-TransferEncoding The Base64 Content-TransferEncoding is designed to represent arbitrary 8 bit data in a form that is not humanly readable. The encoding and decoding algorithms are simple, but the encoded data is only about 33 percent larger than the unencoded data. This encoding is based on the one used in Privacy Enhanced Mail applications, as defined in RFC 1113. The base64 encoding is adapted from RFC 1113, with two changes: base64 elminates the "*" mechanism for embedded clear text and defines a new syntax for portable end-of-line markers, using the comma character. A 66-character subset of International Alphabet IA5 is used, enabling 6 bits to be represented per printable character. (The extra 65th and 66th characters "=" and "," are used to signify special processing functions.) This subset has the important property that it is represented identicially in IA5 and ASCII, and all characters in the subset are part of the so-called invariant subset of EBCDIC. Other popular encodings such as the encoding used by the UUENCODE utility and the base85 encoding specified as part of Level 2 PostScript do not share these properties, and thus do not fulfill the portability requirements placed on a binary transport encoding for mail. The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right across a 24-bit input group is formed by concatenating 3 8-bit input groups, this is then treated as 4 concatenated 6-bit groups. When encoding a bit stream via the base64 encoding, the bit stream should be presumed to be ordered with the most-significant-bit first. That is, the first bit in the stream will be the high-order bit in the first byte, and the eighth bit with be the low-order bit in the first byte, and so on. Each 6-bit group is used as an index into an array of 64 printable characters. The character referenced by the index is placed in the output string. These characters, identified in Table 1 below, are selected so as to be universally representable, and the set excludes characters with particular significance to SMTP (e.g., ".", "CR", "LF"). Table 1 Value Encoding Value Encoding Value Encoding Value Encoding 0 A 17 R 34 i 51 z 1 B 18 S 35 j 52 0 2 C 19 T 36 k 53 1 3 D 20 U 37 l 54 2 4 E 21 V 38 m 55 3 5 F 22 W 39 n 56 4 6 G 23 X 40 o 57 5 7 H 24 Y 41 p 58 6 8 I 25 Z 42 q 59 7 9 J 26 a 43 r 60 8 10 K 27 b 44 s 61 9 11 L 28 c 45 t 62 + 12 M 29 d 46 u 63 / 13 N 30 e 47 v 14 O 31 f 48 w (pad) = 15 P 32 g 49 x (eol) , 16 Q 33 h 50 y Special processing is performed if fewer than 24 bits are available at the end of a message or encapsulated part of a message. A full encoding quantum is always completed at the end of a message. When fewer than 24 input bits are available in an input group, zero bits are added (on the right) to form an integral number of 6-bit groups. Output character positions which are not required to represent actual input data are set to the character "=". Since all canonically encoded output is an integral number of octets, only the following cases can arise: (1) the final quantum of encoding input is an integral multiple of 24 bits; here, the final unit of encoded output will be an integral multiple of 4 characters with no "=" padding, (2) the final quantum of encoding input is exactly 8 bits; here, the final unit of encoded output will be two characters followed by two "=" padding characters, or (3) the final quantum of encoding input is exactly 16 bits; here, the final unit of encoded output will be three characters followed by one "=" padding character. One addition is made to the RFC 1113 specification of this encoding: The comma character (",", ASCII 44) may be used to represent an "end-of-line" or "end-of-record" marker. If line-oriented data are encoded using base64, it is desirable to restore end-of-line markers according to the local convention. The RFC 1113 specification, as given above, offers no way to differentiate between a binary file including a CRLF sequence and a portable end-of-line marker. This memo augments that mechanism to permit such differentiation, as follows. To represent an end-of-line marker: 1. Treat the byte stream preceding the end-of-line as terminating with at the end of the line -- that is, pad with "=" characters as appropriate to complete the representation of the line. 2. Insert a comma character. 3. Resume the encoding starting a new 24-bit input group with the first character on the next line. Thus, while encoding the binary sequence "a-b-c-CR-LF-a-b-c" yields the octets which are represented in ASCII as "YWJjDQphYmM=", encoding "a-b-c" followed by an end-of-line followed by "a-b-c" yields "YWJj,YWJj" They will be translated back into the same thing if the local end-of-line convention is CRLF, but they will be translated back differently if the end-of-line convention is anything other than CRLF. Note: There is no need to worry about quoting apparent encapsulation boundaries within base64-encoded parts of multipart messages, because no hyphen characters are used in the base64 encoding. 4 Additional Optional Content- Header Fields 4.1 Optional Content-ID Header Field In constructing a high-level user agent, it may be desirable to allow one message body-part to make reference to another. This may be done using the "Content-ID" header field, which is syntactically identical to the "Message-ID" header field: Content-ID := "<" msg-id ">" 4.2 Optional Content-Description Header Field It may be desirable to associate some descriptive information with a given body-part. For example, it may be useful to mark an "image" body-part as "a picture of the Space Shuttle Endeavor." Such text may be placed in the Content-Description header field. Content-Description := *text 4.3 Optional Content-Size Header Field In the discussions of earlier drafts of this memo, some people indicated a strong preference for using a size-counting scheme to delimit the boundaries between encapsulated parts of multipart messages. This was rejected because such schemes are not, in general, sufficiently robust across the SMTP transport layer. For example, line counts can be altered by line-wrapping MTA's, and byte counts can be altered in any number of ways, and may be confused by crossing boundaries in which the size of an end-of-line marker changes. However, there are restricted environments in which either or both of these counts can be relied upon, and in such environments it may be desirable to implement a count-based approach to delimiters. Therefore this memo specifies a conventional way to do this, in order to promote interoperability among systems that are able to take this approach. In such cases, boundary delimiters, as defined above, are still required. However, the header area of an encapsulated part may include an optional Content-Size header which indicates where the encapsulated part ends, if its size has not been altered. The size may be measured in either bytes or lines. Those who use the Content-Size header field should still preserve the encapsulation boundaries, and should recognize that other agents are free to ignore it in favor of complete reliance on encapsulation boundaries. It should also be noted that those who wish to use the Content-Size mechanism have two rather different possible motivations. One is to find the end of the data as represented for mail transport, an enterprise which, as noted above, can be counted on to provide no better than an estimate. The other is to declare the initial size of the object before mail transport, to be used as a check on the integrity of the data. Accordingly, the Content-Size header field allows the sender to distinguish whether he is measuring the size of the original object or its encoded form. The Content-Size header field is defined as follows: Content-Size := 1*DIGIT unit when unit := "lines" / "bytes" / "bits" when := "original" / "encoded" Note that each encapsulated part should still end with an end-of-line followed by an encapsulation boundary. However, a message store that wishes, for example, to use a storage format that is largely RFC 822-compliant, but includes binary storage of binary objects, can use the Content-Size header field to indicate whether or not the final end-of-line is to be interpreted as part of the binary object. If the end-of-line follows the number of bytes specified for the encapsulation, then it is not part of the encapsulation. The size given by the Content-Size header field is the size of the encapsulation's body only, not counting the blank line that separates the header from the body. In other words, the four bytes CRLF CRLF, which separate header from body, are NOT counted as part of the content-size. 5 The Nine Predefined Content-type Values This memo defines nine initial content-type values and an extension mechanism for private or experimental types. Further types must be defined and published by a new RFC. It is expected that most innovation in new types of mail take place as subtypes of the nine types defined here. 5.1 The TEXT Content-type and the US-ASCII Character Set The text content-type is intended for sending textual email. It is the default content-type. Subtype names are used, for text, to indicate character sets. In keeping with historical practice and expectations, the default content-type for internet mail is "text", and the default subtype (character set) is unspecified. This content-type can be explicitly specified as "text", and the character set that many people seem to think of as the default can be specified as "US-ASCII". However, it must be noted that because of the lack of character set specification in RFC 822, nothing can be assumed about mail with content-type "text" but no character set specification. Alternately, a different character set subtype may be specified, in which case the body text is in the specified character set. A recommended list of predefined subtype names can be found at the end of this appendix. Note that if the specified character set includes 8-bit data, the Content-TransferEncoding header field is required in order to transmit the message via SMTP. The default character set, US-ASCII, has been the subject of some confusion and ambiguity in the past. Not only were there some ambiguities in the definition, there have been wide variations in practice. In order to elminate such ambiguity and variations in the future, it is strongly recommended that new user agents explicitly specify a character set via the content-type header field. The US-ASCII character set is based on a series of standards and on the historical standard practice in the Internet mail community. However, the precise meaning of this character set has been the subject of some debate. In this appendix, therefore, we define the US-ASCII character set. The message body is coded in the character set of American Standard Code for Information Interchange, sometimes known as "7-bit ASCII". This is not an arbitrary seven-bit character code, but indicates that the message body uses character coding that uses the exact correspondence of codes to characters specified in ASCII. National use variations of ISO646 [REF-ISO646] are not ASCII, and neither an explicit "ASCII" character set, nor "US-ASCII", nor the default (omission of a character set) should be used when characters are coded using them. (Discussion: RFC821 very explicitly specifies "ASCII", and references an earlier version of the American Standard cited in [REF-ANSI]. Whether that specification, rather than a reference to an International Standard, was done deliberately or out of convenience or ignorance, is no longer interesting: insofar as one of the purposes of specifying a content-type and character set is to permit the receiver to unambiguously determine how the sender intended the coded message to be interpreted, assuming anything other than "strict ASCII" as the default would risk unintentional and incompatible changes to the semantics of messages now being transmitted. This also implies that messages containing characters coded according to national variations on ISO646, or using code-switching procedures (e.g., those of ISO2022), as well as 8-bit or multiple octet character encodings MUST use an appropriate character set specification to be consistent with this specification.) The complete US-ASCII character set is listed below: 0 nul 16 dle 32 sp 48 0 64 @ 80 P 96 ` 112 p 1 soh 17 dc1 33 ! 49 1 65 A 81 Q 97 a 113 q 2 stx 18 dc2 34 " 50 2 66 B 82 R 98 b 114 r 3 etx 19 dc3 35 # 51 3 67 C 83 S 99 c 115 s 4 eot 20 dc4 36 $ 52 4 68 D 84 T 100 d 116 t 5 enq 21 nak 37 % 53 5 69 E 85 U 101 e 117 u 6 ack 22 syn 38 & 54 6 70 F 86 V 102 f 118 v 7 bel 23 etb 39 ' 55 7 71 G 87 W 103 g 119 w 8 bs 24 can 40 ( 56 8 72 H 88 X 104 h 120 x 9 ht 25 em 41 ) 57 9 73 I 89 Y 105 i 121 y 10 lf 26 sub 42 * 58 : 74 J 90 Z 106 j 122 z 11 vt 27 esc 43 + 59 ; 75 K 91 [ 107 k 123 { 12 np 28 fs 44 , 60 < 76 L 92 \ 108 l 124 | 13 cr 29 gs 45 - 61 = 77 M 93 ] 109 m 125 } 14 so 30 rs 46 . 62 > 78 N 94 ^ 110 n 126 ~ 15 si 31 us 47 / 63 ? 79 O 95 _ 111 o 127 del Beyond US-ASCII, one can imagine an enormous proliferation of character sets. It is the opinion of the authors of this memo that a large number of character sets is NOT a good thing. We would prefer to specify a single character set that can be used universally for representing all of the world's languages in electronic mail. Unfortunately, there is no clear choice for such a universal representation, and existing practice in several communities seems to point to the continuing use of multiple character sets in the near future. For this reason, we define names for a small number of character sets for which a strong consituent base exists. We recommend the use of ISO-10646 wherever possible. The defined subtypes of text, which name alternate character sets, are: US-ASCII -- as defined above. ISO-10646 -- as defined in [REF-ISO-10646] ISO-8859-X -- where "X" is to be replaced, as necessary, for the national use variants of ISO-8859 [REF-ISO-8859] ISO-2022 -- as defined in [REF-ISO-2022] In the opinion of the authors, this is already far more character sets than are really desirable, and implementors are discouraged from defining new ones unless absolutely necessary. ***** I AM SURE THAT I NEED SOME FLESHING OUT OF THE ABOVE DEFINITIONS & REFERENCES 5.2 The "Multipart" Content-Type In the case of multiple part messages, a "multipart" Content-type field should appear in the RFC 822 message header. The message body is then assumed to contain multiple parts separated by encapsulation boundaries. Each of the parts is defined, syntactically, as a complete RFC 822 message in miniature. That is, what is found between the encapsulation boundaries is a header area, a blank line, and a body area, in accordance with the RFC 822 syntax for a message. However body parts are NOT to be interpreted as actually being RFC 822 messages. To begin with, NO header fields are actually required in body parts. A body part that starts with a blank line, therefore, is a body part for which all default values are to be assumed. In such a case, of course, the absence of a Content-type header field implies that the encapsulation is US-ASCII text. The only header fields that have defined meaning for body-parts are those the names of which begin with "Content-". All other header fields are to be ignored in body-parts, and may be discarded by gateways. They are permitted to appear in body parts only for ease of conversion between messages and body parts. It must be understood that body parts are NOT messages. For example, a gateway between Internet and X.400 mail must be able to tell the difference between a body part that consists of an image and a bodypart that consists of an encapsulated message, the body of which is an image. In order to represent the latter, the body part should have "Content-type: message", and its body (after the blank line) should be the encapsulated message, with its own "Content-type: image" header field. Body parts use the same syntax as messages because there are many legitimate cases in which a body part might be converted into a message, or vice versa. The identical syntax makes such conversions easy, but must be understood by implementors. (For the special case in which all parts are actually messages, a "digest" subtype is also defined.) As stated previously, each pair of consecutive body parts are separated by an encapsulation boundary. The encapsulation boundary MUST NOT appear inside any of the encapsulated parts. Thus, it is crucial that the composing agent be able to choose and specify the boundary that will separate the parts. The Content-type field for multipart messages requires two supplementary fields. The first is used to specify a version number and should be either "1-S" and "1-P". The two versions have identical syntax, but the "-P" is intended as a hint, to receivers, that the parts are intended to be viewed in parallel rather than sequentially. Implementations that can not show the parts in parallel, or that choose not to do so, are free to treat all multipart messages of version "1-P" as if they were version "1-S". However, all implementations should check the version number, to ensure graceful behavior in the event that an incompatible future version of multipart messages is defined later. The second supplementary field, which is always required for multipart messages, is used to specify the format of the encapsulation boundary. The encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the second parameter of the Content-type header field with any leading or trailing white space removed. (DISCUSSION: The specification that white space be removed is intended to eliminate the possible introduction of ambiguity caused by the addition or deletion of white space by message transport agents. They hyphens are for rough compatibility with the earlier RFC 934 method of message encapsulation, and for ease of searching for the boundaries in some implementations. However, it should be noted that multipart messages are NOT completely compatible with RFC 934 encapsulations; in particular, they do not obey RFC 934 quoting conventions for embedded lines that begin with hyphens.) Thus, a typical multipart content-type header field might look like this: Content-type: multipart; 1-S; gc0p4Jq0M2Yt08jU534c0p This indicates that the message consists of several parts, each itself structured as an RFC 822 message, which are intended to be viewed one-at-a-time, and that the parts are separated by the line --gc0p4Jq0M2Yt08jU534c0p The encapsulation boundaries must not appear within the encapsulations, and should be no longer than 70 characters, not counting the two leading hyphens. The encapsulation boundary following the last body-part should be a distinguished delimiter that indicates that no further body-parts will follow. Such a delimiter is identical to the previous delimiters, with the addition of two more hyphens at the end of the line: --gc0p4Jq0M2Yt08jU534c0p-- It should be noted that there appears to be room for additional information prior to the first encapsulation boundary and following the final such boundary. For several reasons, however, it is specified that these areas should be left blank, and that implementations should ignore anything that appears before the first boundary or after the last one. The use of "Content-Type: Multipart" as a message part within another "Content-Type: Multipart" is explicitly allowed. In such cases, for obvious reasons, care must be taken to ensure that each nested mulitpart message should use a different boundary delimiter. See Appendix II for an example of nested multipart messages. The use of content-type "Multipart" with only a single included part may be useful in certain contexts, and is explicitly permitted. Overall, the body of a multipart message may be specified as follows: body := prefix 1*encapsulation close-delimiter postfix encapsulation := delimiter CRLF message delimiter := "--" <delimiter from Content-type resource> close-delimiter := delimiter "--" prefix := *text postfix := *text message = <as defined in RFC 822, with all header fields optional, containing no lines matching "delimiter"> The above description defines the default subtype of the multipart type, "mixed", which may be explicitly specified with a content-type of "multipart/mixed". Other subtypes are possible, but should be defined to be syntactically compatible with the "mixed" subtype. Unrecognized subtypes should be treated as being of subtype "mixed." (DISCUSSION: Conspicuously missing from the multipart type is a notion of structure. In general, it seems premature to try to standardize structure yet. It is recommended that those wishing to provide a more structured or integrated multipart messaging facility should define a subtype of multipart that is syntactically identical, but that always expects the inclusion of a distinguished part (e.g. with a content-type of "Application/x-my-structure-subtype") that can be used to specify the structure and integration of the other parts, probably referring to them by their Content-ID field. If this approach is used, other implementations will not recognize the subtype, but will treat it as the default subtype (multipart/mixed) and will thus be able to show the user the parts that are recognized.) This memo defines one particular subtype of multipart, the "digest" subtype. This type is syntactically identical to multipart, but the semantics are different. In particular, in a digest, all of the parts are assumed to be of type "Message". That is, each part is implicitly prefixed by a line that says "Content-type: message" followed by a blank line. This is provided in order to allow a more readable digest format that is largely compatible (except for the quoting convention) with RFC 934. 5.3 The "Text-Plus" Content-Type and "RichMail" subtype There are many formats for representing what might be known as "extended text" -- text with embedded formatting and presentation information. An interesting characteristic of most such representations is that they are to some extent readable even without the software that interprets them. It is useful, then, to distinguish them, at the highest level, from such non-readable data as images or audio messages. In the absence of appropriate interpreting software, it is reasonable to show extended text to the user, while it is not reasonable to do so with binary data. To represent such data, this memo defines a "text-plus" content-type. Plausible subtypes of text-plus are typically given by the common name of the representation format, e.g. "text-plus/Troff" or "text-plus/TeX". Character sets are not specified as subtypes; in general it is assume that rich text formats will have their own mechanisms for representing alternate or multiple character sets. Initial subtypes include troff, tex, PostScript, DVI, and ODA. **** Should the latter three really be binary???? In order to promote the wider interoperability of simple formatted text, this memo defines a default subtype for "text-plus", the "richmail" subtype. This subtype was designed to meet the following criteria: 1. All special formatting characters are extremely portable (only "%", "(", and ")" are used). 2. The syntax is extremely simple to parse, so that even teletype-oriented mail systems can easily strip away the formatting information and leave only the readable text. 3. The syntax is easily extended to allow for new formatting commands that are deemed essential. 4. The capabilities are extremely limited, to ensure that it can represent no more than is likely to be representable by the user's primary word processor. While this limits what can be sent, it increases the likelihood that it can be properly displayed. 5. The syntax does not correspond exactly to any known existing system, thus giving no special preference to anyone's current syntax. The syntax of "richmail" is very simple. It is assumed, at the top-level, to be in the US-ASCII character set. All characters represent themselves, with the following exceptions: 1. The "%" character may be used to quote characters that need quoting, particularly itself and the left and right parenthesis characters. 2. The "%" character is used to begin a formatting token, which is no more than 28 characters long, consists only of case-insensitive alphanumeric characters or the hyphen character "-", and ends with a "(" character. 3. After a formatting token, subsequent text is affected by that formatting command until the next unquoted ")" character. Thus, for example, the following "text-plus/richmail" body fragment: %bold(Now) is the time for %italic(all) good men @%smaller((and women%)) to come to the aid of their country. represents the following formatted text (which will, no doubt, look cryptic in the text-only version of this memo): Now is the time for all good men (and women) to come to the aid of their country. Initially defined formatting tokens are: Bold -- causes the enclosed text to be in a bold font, if possible. Italic -- causes the enclosed text to be in an italic font, if possible. Fixed -- causes the enclosed text to be in a fixed width font, if possible. Smaller -- causes the enclosed text to be in a smaller font, if possible. Bigger -- causes the enclosed text to be in a bigger font, if possible. Underline -- causes the enclosed text to be underlined, if possible. Center -- causes the enclosed text to be centered, if possible. FlushLeft -- causes the enclosed text to be left justified, if possible. FlushRight -- causes the enclosed text to be right justified, if possible. Indent -- causes the enclosed text to be indented at both margins, if possible. ISO-10646 -- causes the enclosed text to be interpreted as text in the ISO-10646 character set, if possible. ISO-8859-X (for any registered value of X) -- causes the enclosed text to be interpreted as text in the appropriate character set, if possible. ISO-2022 -- causes the enclosed text to be interpreted as text in the ISO-2022 multiple character set representation, if possible. US-ASCII -- causes the enclosed text to be interpreted as text in the US-ASCII character set, if possible. Although this is the default character set, it might be usefully nested inside another character set. Invisible -- causes the enclosed text to be regarded as invisible, and not shown to the user. This can be used by systems that wish to translate a "richer" format into "richmail" for mail transport, but want to be able to restore the formatting more fully if it is read with a completely compatible system. No-op -- has no effect on the enclosed text. Implementations should regard any unrecognized formatting token as equivalent to "No-op", thus facilitating future extensions to "richmail". Richmail also differentiates betweeen "hard" and "soft" line breaks. A "soft" line break is represented by a single CRLF (end-of-line marker), and may be ignored for purposes of presentation. A sequence of one or more hard line breaks may be represented by one plus that number of CRLF markers. Thus, a sequence of three consecutive CRLFs represents two hard line breaks. This allows portable wrapped and justified text, independent of window-size or line-length restrictions. A minimal richmail implementation is one that implements the "Invisible" and "No-op" formatting tokens, regards all others as synonyms for "no-op", and understands the richmail newline conventions. 5.4 The Message Content-Type It is frequently desirable, in sending mail, to encapsulate another mail message. For this common operation, a special content-type, "message", is hereby defined. A content-type of "message" with the default subtype of "822" indicates that the body or body part is an encapsulated message, with the syntax of an RFC 822 message. This default subtype may be explicitly specified as "Content-type: message/822" The special subtype "pem" may be used to indicate that the body or body part is a message conforming to the Privacy Enhanced Mail protocol [RFC-1113]. The special subtype "partial" may be used to indicate that the body or body part is a fragment of a larger message. Three subfields must be specified in the content-type field: The first is a unique identifier, to be used to match the parts together. The second, an integer, is the part number. The third, another integer, is the total number of parts. Thus, part 2 of a 3-part message might have the following header field: Content-type: Message/Partial; oc=jpbe0M2Yt4s; 2; 3 When the parts of a message broken up in this manner are put together, the result is a complete RFC-822 format message, which may have its own Content-type header field, and thus may contain any other data type. (EXPLANATION: The purpose of the MESSAGE/PARTIAL type is to allow large objects to be delivered as several separate pieces of mail and automatically reassembled by the receiving user agent. This may be desirable when intermediate transport agents limit the size of messages that can be sent.) Additionally, all the character set subtypes of text are defined as subtypes of "message." If a character set subtype is given, it applies to the uninterpreted textual fields in the RFC 822 message header area. Thus it can be used to represent address and subject information in non-ASCII character sets. The character set subtype does NOT apply to the body of the encapsulated message. Thus, to encapsulate a message with non-ASCII characters in both the header fields and in the body, you would need something like the following: From: <ASCII form> Subject: <ASCII form> Content-type: message/iso-10646 From: <iso-10646-form> Subject <iso-10646-form> Content-type: text/iso-10646 Message body in iso-10646 character set. 5.5 The Binary Content-Type A content-type of "binary" may be used to Indicate that the body or body part is binary data. A subtype may be specified, but none are defined here. The parameters for type binary are a set of attribute/value pairs, of the form "NAME=VALUE", separated by the usual semicolons. The set of possible attributes to be defined includes, but is not limited to: NAME -- a suggested name for the binary data as a file. TYPE -- the type of binary data CONVERSIONS -- the set of operations that have been performed on the data before putting it in the mail (and before any Content-TransferEncoding that might have been applied). If multiple conversions have occurred, they should be specified in the order they were applied, and separated by commas. The values for these attributes are left undefined at present, but may require specification in the future. An example of a common (though discouraged) usage might be: Content-type: binary; name=foo.tar; type=tar; conversions=compress,uuencode However, the use of such mechanisms as uuencode and compress is explicitly discouraged, in favor of the more standardized Content-TransferEncoding mechanism. In particular, uuencode is not well-suited for mail transport because it is ill-defined, it comes in several incompatible versions, many of which do not work in a pipe, and which use characters that do not translate well into certain representations (e.g. EBCDIC) and are not transmitted reliably over certain connections (e.g. those that remove trailing white space from a line). The recommended action for an implementation that receives binary mail of an unrecognized type is to simply offer to put the data in a file, with any Content-TransferEncoding undone, or perhaps to use it as input to a user-specified process. Implementations are warned NOT to implement a path-search mechanism whereby an arbitrary program named in the Content-type header (e.g. the "type=" subfield) is found and executed using the binary data as input. Such an implementation could open up a significant security problem, the elucidation of which is left as an exercise for the reader. 5.6 The Application Content-Type Value The "application" content-type is to be usedor mail-based applications. The notion of mail-based application is an application that defines a standard format for representing intermediate data that is to be manipulated by cooperating user agents. For example, a meeting scheduler might define a standard representation for information about proposed meeting dates. An intelligent user agent would use this information to conduct a dialog with the user, and might then send further more based on that dialog. Such applications may be defined as subtypes of the "application" content-type. There is no default subtype for application, and this memo defines only one subtype, the "external-reference" subtype. The External-Reference subtype indicates that the body or body part is primarily a placeholder for the data that are intended to be conveyed, presumably because too much data is involved for the underlying mail transport mechanism to handle. The subfields are, as in the case of the "binary" content-type, attribute-value pairs. In this case, the subfields describe a mechanism for accessing the binary data. The set of possible attributes includes, but is not limited to: FILENAME -- The name of a file that contains the external data. SITE -- one or more domain names, comma separated, of machines that are known to have access to the data file. REAL-TYPE -- The real content-type of the data, once retrieved. EXPIRATION -- The date (in the format "month day, year") after which the existence of the external data is not guaranteed. With the emerging possibility of very wide-area file systems, it becomes very hard to know in advance the set of machines where a file will and will not be accessible directly from the file system. Therefore it makes sense to provide both a file name, to be tried directly, and the name of one or more sites from which the file is known to be accessible. An implementation can try to retrieve remote files using FTP or any other protocol, using anonymous file retrieval or prompting the user for the necessary name and password. However, the external-reference mechanism is not intended to be limited to file retrieval. One can imagine, for example, using a LISTSERV mechanism, or using unique identifiers and a video server for external references to video clips. However, this memo explicitly defines only the FILENAME and SITE attributes for retrieval purposes, as this is the only retrieval method that is currently widely applicable. Other attributes may be defined as needed. The "REAL-TYPE" attribute may be used to specify a new content-type header field to be applied to the data once retrieved, as the data are assumed to be only the body of a message, not including any header information. Note that semicolons may be quoted within subfields. Thus an external reference to an image in G3FAX format might have the following content-type header field: Content-Type: application/external-reference; name=/usr/local/images/contact.g3; site=thumper.bellcore.com; real-type="image/g3fax" expiration = "September 23, 1997" If a message is of content-type "application/external-reference", then the actual body of the message is ignored. 5.7 The Audio, Image, and Video, and X- Content-Type Values This memo defines several morecontent-type values that are defined only incompletely here, and await further practical experience before their values can be more completely specified. AUDIO -- Indicates that the body or body part contains audio data. The subtype specifies the audio representation format; predefined case-insensitive values are "U-law" and "A-law". (U-law and A-law are the American and European audio telephony standards.) The parameters are attribute/value pairs, as in the binary content-type, and may be used to name a header format (e.g. "header=Sun"), to specify the size of the header that precedes the actual audio data (e.g. "headersize=234568 bytes", or for other purposes. IMAGE --Indicates that the body or body part contains an image. The subtype names the specific image format; predefined case insensitive values include "G3Fax" for Group Three Fax and "pbm", "pgm", and "ppm" for the "portable bitmap" formats for black and white, grey scale, or color images. VIDEO -- Indicates that the body or body part contains a video sequence. The subtype and possible parameter values are left undefined by this memo. "X-" anything -- Any type value beginning with the characters "X-" and not defined here or in another RFC is a private value, to be used by consenting mail systems by mutual agreement. Any format without a rigorous and public definition should be named with an "X-" prefix. The widely-used Andrew system uses the "X-BE2" name, so new systems should probably choose a different name. 6 RFC-XXXX Compliance The mechanisms described in this memo are open-ended. It is definitely not expected that all implementations will implement all of the content-types described, nor that they will all share the same extensions. In order to promote interoperability, however, it is useful to define the concept of "RFC-XXXX-Compliance" to define a certain level of implementation that allows the useful interworking of messages with content that differs from US ASCII text. In this section, we specify the requirements for such compliance. An RFC-XXXX-Compliant mail user agent must: 1. Recognize the Content-TransferEncoding header field, and decode data encoded with either the quoted-printable or base64 implementations. (If a compressed encoding is ever agreed to, it should also become part of all compliant user agents.) 2. Recognize and interpret the Content-type header field, and avoid showing an unsuspecting user raw data that has a content-type field other than text. 3. Explicitly handle the following content-type values, as defined in the appendices: -- text, with at least the US-ASCII character set. -- text-plus, with the default "richmail" subtype, at least the minimal implementation specified above. -- message, at least the default (simple) encapsulation. -- multipart, although parallel parts may be serialized, with all unrecognized subtypes treated as the default subtype, multipart/mixed. -- binary, although no particular subtype recognition is required. 4. Upon encountering an unrecognized content-type, an implementation should treat it as if it had a content-type of "binary" with no parameter sub-arguments. How such data is handled is up to an implementation, but likely options for handling such unrecognized data include offering the user to write it into a file (decoded from its mail transport format) or offering the user to name a program to which the decoded data should be passed as input. Unrecognized predefined types, which might include audio, image, video, or application, should also be treated in this way. A user agent that meets the above conditions is said to be RFC-XXXX compliant. The meaning of this phrase is that it is assumed to be "safe" to send virtually any kind of properly-marked data to users of such mail systems, because they will at least be able to treat the data as undifferentiated binary, and will not simply splash it onto the screen of unsuspecting users. Of course, there is another sense in which it is always "safe" to send RFC-XXXX format data, which is that it such data will not break or be broken by any known systems that are compliant with RFC 821 and RFC 822. User agents that are RFC-XXXX compliant have the additional guarantee that the user will not be shown data that were never intended to be viewed as text. Appendix I -- Guidelines For Sending Data Via Email Because of the restriction imposed on message bodies by RFC 822 and, in practice, by Message Transport Agents that are more-or-less compliant with RFC 821, implementors should be careful in several ways regarding mail transport of any data through the mail. The primary limitations imposed by RFC 821 are that only seven-bit data may be transmitted, and that the data must be broken up into lines of no more than 1000 characters. However, in practice, widely-used MTA's are known to impose some additional restrictions. The following guidelines may be useful to anyone devising a data format (content-type) that will survive such MTA's unscathed. (Note that anything encoded in the base64 or quoted-printable encodings will satisfy these rules, but that some well-known mechanisms, notably the UNIX uuencode facility, will not.) (1) Delimiters other than CR-LF pairs may be used in the local representation of a message on some systems. The persistence of CR-LF pairs should not be relied on. (2) Isolated CR and LF characters are not well tolerated in general; they may be lost or converted to delimiters on some systems, and hence should not be relied on. (3) TAB characters may be misinterpreted or may be automatically converted to variable numbers of spaces. This is unavoidable in some environments, notably those not based on the ASCII character set. Such conversion is STRONGLY DISCOURAGED, but it may occur, and users of US-ASCII format should not rely on the persistence of TAB characters. (4) Lines longer than 78 characters may be wrapped or truncated in some environments. Line wrapping and line truncation are STRONGLY DISCOURAGED, but unavoidable in some cases. Applications which depend on lines not being wrapped should use mechanisms other than unencoded US-ASCII bodyparts to transmit messages. (5) Trailing "white space" characters (SPACE, TAB, etc.) on a line may be discarded by some transport agents, and hence should not be relied on. Please note that the above list is NOT a list of recommended practices for MTA's -- we do not recommend that MTA's alter the character of white space, or wrap long lines. These are known BAD practices on established networks, and implementors must guard against the bad effects they can cause. Thus the above might be seen as a list of recommended defensive actions to be taken by User Agents to defend themselves against the known ways in which MTA's sometimes modify messages. Appendix II -- Examples Example 1 Simple Non-ASCII Text Example ***** FILL IN HERE WITH AN EXAMPLE OF NON-ASCII TEXT. Can somone provide me with a cute example from a non-ASCII character set? Example 2 A Complex Multipart Example What follows is the outline of a complex multipart message. This message has three parts to be displayed serially: an introductory plain text part, an embedded multipart message, and a closing encapsulated text message in a non-ASCII character set. The embedded multipart message has two parts to be displayed in parallel, a picture and an audio fragment. From: ... Subject: ... Content-type: multipart; 1-s; tweedledum This is a multipart message. Since I've not specified another character set, this "prefix" area is in US ASCII. --tweedledum ...Some more text appears here... [Note that the preceding blank line means no header fields were given and this is text, with charset US ASCII.] --tweedledum Content-type: multipart; 1-p; tweedledee This is a multipart message. If you are reading this text, you might want to consider changing to a user agent that understands how to properly display multipart messages. --tweedledee Content-type: u-law; 8000 HZ; X-NEXT Content-TransferEncoding: base64 ... base64-encoded NeXT-format audio data goes here.... --tweedledee Content-type: image; G3FAX Content-TransferEncoding: Base64 ... base64-encoded FAX data goes here.... --tweedledee-- --tweedledum Content-type: message/ISO-8859-1 From: Keld J|rn Simonsen (name can be non-ASCII) Subject: whatever Content-type: Text/ISO-8859-1 Content-TransferEncoding: Quoted-printable ... Closing text goes here ... --tweedledum-- Summary Using the Content-Type and Content-TransferEncoding header fields, it is possible to include, in a standardized way, arbitrary types of data objects with RFC 822 compliant mail messages. No restrictions imposed by either RFC 821 or RFC 822 or broken, and care has been taken to avoid problems caused by additional restrictions imposed by the characteristics of some Internet mail transport mechanisms (see Appendix I). The "multipart" and "message" content-types allow mixing and heirarchical structuring of objects of different types in a single message. Further content-tyes allow a standardized mechanism for tagging messages or mesage parts as audio, image, or several other kinds of data. Additional optional header fields provide conventional mechanisms for certain extensions deemed desirable by many implementors. Finally, a number of useful content-types are defined for general use by consenting user agents. Contacts For more information, the authors of this document may be contacted via Internet mail: Nathaniel Borenstein <nsb@thumper.bellcore.com> Ned Freed <ned@innosoft.com> Acknowledgements This RFC is the result of the collective effort of a large number of people, at several IETF meetings and on the IETF-SMTP and IETF-822 mailing lists. Although any enumeration seems doomed to suffer from egregious omissions, the following are among the many contributors to this effort: Harald Alvestrand, Randall Atkinson, Kevin Carosso, Mark Crispin, Dave Crocker, Walt Daniels, Frank Dawson, Hitoshi Doi, Kevin Donnelly, Johnny Eriksson, Craig Everhart, Roger Fajman, Alain Fontaine, Phil Gross, David Herron, Bruce Howard, Bill Janssen, Risto Kankkunen, Phil Karn, Tim Kehres, Neil Katin, Steve Kille, Anders Klemets, John Klensin, Vincent Lau, Timo Lehtinen, John MacMillan, Rick McGowan, Leo Mclaughlin, Goli Montaser-Kohsari, Keith Moore, Mark Needleman, John Noerenberg, David J. Pepper, Jonathan Rosenberg, Jan Rynning, Mark Sherman, Keld Simonsen, Bob Smart, Einar Stefferud, Michael Stein, Taro Suzuki, Steve Uhler, Stuart Vance, Erik van der Poel, Peter Vanderbilt, Greg Vaudreuil, Brian Wideen, Glenn Wright, and David Zimmerman. The authors apologize for any omissions from this list, which were certainly unintentional. References [REF-ISO646] International Standard--Information Processing--ISO 7-bit coded character set for information interchange, ISO 646:1983. [REF-ISO-2022] International Standard--Information Processing--ISO 7-bit and 8-bit coded character sets--Code extension techniques, ISO 2022:1986. [REF-ANSI] Coded Character Set--7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986. [REF-X400] Schicker, Pietro, "Message Handling Systems, X.400", Message Handling Systems and Distributed Applications, E. Stefferud, O-j. Jacobsen, and P. Schicker, eds., North-Holland, 1989, pp. 3-41. [RFC-821] Postel, J.B. Simple Mail Transfer Protocol. August, 1982, Network Information Center, RFC-821. [RFC-822] Crocker, D. Standard for the format of ARPA Internet text messages. August, 1982, Network Information Center, RFC-822. [RFC-934] Rose, M.T.; Stefferud, E.A. Proposed standard for message encapsulation. January, 1985, Network Information Center, RFC-934. [RFC-1049] Sirbu, M.A. Content-type header field for Internet messages. March, 1988, Network Information Center, RFC-1049. [RFC-1113] Linn, J. Privacy enhancement for Internet electronic mail: Part I - message encipherment and authentication procedures [Draft]. August, 1989, Network Information Center, RFC-1113. [RFC-1154] Robinson, D.; Ullmann, R. Encoding header field for internet messages. April, 1990, Network Information Center, RFC-1154. [REF-ISO-10646] ************ [REF-ISO-8859] **********
- May Draft, #2 -- explanation Nathaniel Borenstein
- May Draft, #2 -- TEXT Nathaniel Borenstein
- Re: May Draft, #2 -- explanation John C Klensin
- Re: May Draft, #2 -- explanation Nathaniel Borenstein
- Re: May Draft, #2 -- explanation Mark Crispin
- Re: May Draft, #2 -- explanation John C Klensin
- Re: May Draft, #2 -- explanation Nathaniel Borenstein
- Re: May Draft, #2 -- explanation Mark Crispin
- Re: May Draft, #2 -- explanation John C Klensin
- Re: May Draft, #2 -- explanation Nathaniel Borenstein
- PS -- Re: May Draft, #2 -- explanation Nathaniel Borenstein