Endianness
In the next chapter, we're going to learn about byte-order marks for UTF-16 and UTF-32, but for that to make any sense, you first need to understand "Endianness". Endianness determines the order of bytes in 16-bit units, and if it's wrong, you'll just get scrambled text.
Various systems arrange multi-byte sequences (integers, floating-point numbers, and multi-byte characters) differently according to their “endianness.” In “big-endian” systems, the most significant byte (MSB) comes first. Conversely, in a “little-endian” system, the least significant byte (LSB) comes first.
Endianness becomes relevant in connection with byte-order markers (see the next section) in UTF-16 and UTF-32 encodings, because it determines the arrangement of the bytes that represent characters. Getting endianness wrong can mean streams of garbled text.
The endianness of a system — whether it stores the most significant byte first (big-endian) or the least significant byte first (little-endian) — is determined by the architecture of the CPU and other hardware.
Big-endian: Often associated with older systems, mainframes, and network protocols.
Little-endian: Dominant in modern personal computers and consumer devices due to x86 and ARM architectures.
Bi-endian: Increasingly common in flexible architectures like ARM and PowerPC, offering compatibility for various use cases.
Big-Endian Systems
In big-endian systems, the most significant byte (MSB) comes first. Big-endian storage order feels more “natural” when humans read hexadecimal representations, as the bytes appear in order of magnitude, as the numbers we deal with in everyday use.
Big-endian is the standard for most Internet protocols (e.g., IP, TCP, UDP). It is also found in mainframes, like IBM Power Systems, IBM z/Architecture, and older RISC architectures like Motorola 68k processors (used in older Macintosh systems) and SPARC (historically big-endian, though some models support both endianness).
Some embedded systems and devices still use big-endian storage.
Importantly, the way in which numeric HTML entities, CSS entities, and Unicode code points are written corresponds to big-endian byte order. For example, consider the Chinese character 鳥 ("bird"):
- Unicode code point:
U+9CE5
- HTML Entity:
鳥
- CSS Entity:
\9CE5
- UTF-16 big-endian bytes:
9C E5
- UTF-16 little-endian bytes:
E5 9C
- UTF-32 big-endian bytes:
00 00 9C E5
- UTF-32 little-endian bytes:
00 00 E5 9C
Little-Endian Systems
As illustrated above, in little-endian systems, the least significant byte (LSB) comes first. Little-endian architecture is prevalent in modern personal computers and many consumer electronics. It is easier for hardware to process low-order arithmetic operations, as the LSB is stored at the lowest memory address.
Examples include x86 and x86-64 architectures, like Intel processors (Pentium, Core series) and AMD processors, so PCs running Windows, macOS, and most Linux distributions are mostly little-endian.
ARM architecture is primarily little-endian, though it supports bi-endian configurations (big-endian is optional).
Most embedded systems are little-endian, due to the prevalence of ARM and x86 processors.
Bi-Endian Systems
Some systems are bi-endian, meaning they can operate in either big-endian or little-endian mode, depending on configuration. These systems offer flexibility for compatibility with software or communication protocols requiring a specific endianness.
For example, ARM architecture is little-endian by default, but some ARM processors support big-endian mode. SPARC has historically been big-endian, but supports both modes in some implementations.
PowerPC has historically been big-endian but supports bi-endian configurations. MIPS is bi-endian, with endianness determined by system configuration.