Byte-order Marks
A Byte Order Mark (BOM) is a special two-, three-, or four-byte sequence at the beginning of a text file. It indicates the byte order (endianness) and encoding of the text. Without a BOM, programs might misinterpret byte order, causing rendering errors.
While optional, BOMs are useful for ensuring that programs correctly interpret file encoding, especially in UTF-16 and UTF-32.
BOMs are not recommended for UTF-8 files. The byte order is fixed and consistent, so a BOM is unnecessary. In fact, since the BOM is usually unexpected in UTF-8 files, its presence may confuse systems that try to read it, resulting in extraneous characters appearing where they shouldn't.
Examples
Byte-order marks:
- UTF-8:
EF BB BF
(not recommended)- UTF-16 Big-Endian:
- [
FE FF
] - UTF-16 Little-Endian:
- [
FF FE
] - UTF-32 Big-Endian:
- [
00 00 FE FF
] - UTF-32 Little-Endian:
- [
FF FE 00 00
]
These BOMs tell whatever system is reading the file how the bytes comprising each 16-bit unit are ordered. So, for example, for the character 𐍈 (U+10348
) the bytes in UTF-16 will be:
- UTF-16BE:
- [
D8 00
] [DF 48
] - UTF-16LE:
- [
00 D8
] [48 DF
]
Network-Transmitted Data
BOMs in network-transmitted data are less reliable and less common in network protocols, where explicit encoding declarations in headers (e.g., HTTP headers) are preferred.
Examples: Declaring Encoding in HTTP Headers
The Content-Type
header specifies the MIME type and character encoding of the content that follows.
Syntax:
Content-Type: <mime-type>; charset=<encoding>
UTF-8 (no BOM):
Content-Type: text/plain; charset=utf-8
UTF-16 with BOM (not common):
Content-Type: text/plain; charset=utf-16
UTF-16 without BOM (a Bad Thing™*):
Content-Type: text/plain; charset=utf-16
UTF-16LE (explicit endianness, no BOM):
Content-Type: text/plain; charset=utf-16le
UTF-16BE (explicit endianness, no BOM):
Content-Type: text/plain; charset=utf-16be
* This is a Bad Thing™ because it leave byte order ambiguous. It's then up to the receiving system to figure out the order of bytes in each 16-bit unit. Generally, the rule is that systems should be precise about what data they send and fault-tolerant about data they receive, but this assumes that the receiving system is well-implemented enough to do that. And given how much code** is written at 3 a.m. by sleep-deprived programmers on life-threatening doses of caffeine, that's not a good assumption.
** A lot.