Byte-order Marks

A Byte Order Mark (BOM) is a special two-, three-, or four-byte sequence at the beginning of a text file. It indicates the byte order (endianness) and encoding of the text. Without a BOM, programs might misinterpret byte order, causing rendering errors.

While optional, BOMs are useful for ensuring that programs correctly interpret file encoding, especially in UTF-16 and UTF-32.

BOMs are not recommended for UTF-8 files. The byte order is fixed and consistent, so a BOM is unnecessary. In fact, since the BOM is usually unexpected in UTF-8 files, its presence may confuse systems that try to read it, resulting in extraneous characters appearing where they shouldn't.

Examples

Byte-order marks:

UTF-8:: EF BB BF (not recommended)
UTF-16 Big-Endian:: [FE FF]
UTF-16 Little-Endian:: [FF FE]
UTF-32 Big-Endian:: [00 00 FE FF]
UTF-32 Little-Endian:: [FF FE 00 00]

These BOMs tell whatever system is reading the file how the bytes comprising each 16-bit unit are ordered. So, for example, for the character 𐍈 (U+10348) the bytes in UTF-16 will be:

UTF-16BE:: [D8 00] [DF 48]
UTF-16LE:: [00 D8] [48 DF]

Network-Transmitted Data

BOMs in network-transmitted data are less reliable and less common in network protocols, where explicit encoding declarations in headers (e.g., HTTP headers) are preferred.

Examples: Declaring Encoding in HTTP Headers

The Content-Type header specifies the MIME type and character encoding of the content that follows.

Syntax:

Content-Type: <mime-type>; charset=<encoding>

UTF-8 (no BOM):

Content-Type: text/plain; charset=utf-8

UTF-16 with BOM (not common):

Content-Type: text/plain; charset=utf-16

UTF-16 without BOM (a Bad Thing™*):

Content-Type: text/plain; charset=utf-16

UTF-16LE (explicit endianness, no BOM):

Content-Type: text/plain; charset=utf-16le

UTF-16BE (explicit endianness, no BOM):

Content-Type: text/plain; charset=utf-16be

* This is a Bad Thing™ because it leave byte order ambiguous. It's then up to the receiving system to figure out the order of bytes in each 16-bit unit. Generally, the rule is that systems should be precise about what data they send and fault-tolerant about data they receive, but this assumes that the receiving system is well-implemented enough to do that. And given how much code** is written at 3 a.m. by sleep-deprived programmers on life-threatening doses of caffeine, that's not a good assumption.

** A lot.

Byte-order Marks

Examples

Network-Transmitted Data

Examples: Declaring Encoding in HTTP Headers

Comments

Leave a question or comment: