Endianness

2019 April 9

I always get confused by endianness. Endianness concerns the machine layout of bytes in a number, but in my mental image I picture bytes written as sequences of bits and become distracted by the order of the bits, or I think about bytes in arrays which are always laid out the same way, regardless of endianness. I hope that writing out a deep exploration of bit and byte order will help me better visualize when thinking of endianness.

Bits

In English (and many languages), we write numbers in positional notation with the least significant digit on the right. For example, the current year is written "2019" instead of "9102". This extends to other bases, including binary.

Our programming languages reflect this convention with their bitwise operators: shifting bits to the right moves them down in significance, and shifting left moves them up. We can assume that (2 >> 1) == 1. In a different system where the least significant bit is on the left, right shifting would have the opposite effect, e.g. (2 >> 1) == 4.

Thus, when I think about a number as a sequence of bits, however many, I picture it as written, with the most significant bit on the left and the least significant bit on the right.

Separate from the written order of bits is bit numbering. We can choose to number digits from most to least significant (left to right in written English) or vice versa.

Our computers do not address bits, so it hardly matters what we choose. We can choose whatever is most comfortable. In every conversation that I can remember, I've numbered bits from least to most significant (right to left). It has the advantage that the same-valued bits in different representation widths (e.g. 32-bit vs 64-bit) have the same numbered positions. This scheme is called "LSB 0". The opposite scheme is called "MSB 0".

Bytes

Now, let's think of the binary representation of numbers when broken into 8-bit bytes. How do we number these bytes? Just like with bits, we have two choices: most to least significant (left to right in written English), or vice versa. The scheme where we number them starting from the most significant byte, i.e. from the "big" end, is called big-endian. The opposite scheme is little-endian.

Why does this matter? The numbering of bits had no effect on our software, letting us choose whatever we liked, but the numbering of bytes determines their order in memory, i.e. their addresses. A sequence of bytes is the universal language; it is how we read and write data. Endianness determines how we interpret a sequence of bytes as one number.

The only time we need to worry about endianness is when converting between different serialization formats. There are two prime examples:

  1. Instruction set architectures. We can think of each instruction as a serialization of the code for that instruction and its arguments, which may include numbers. Most desktops and laptops are using processors with x86 or x86-64 architectures, both of which are little-endian. A CPU will write numbers to memory according to its endianness.

  2. Wire protocols. The most common network protocols (e.g. TCP and UDP) are big-endian. When preparing a message for these protocols, we must take care to convert between the endianness of the CPU and the endianness of the protocol (but only for the numbers interpreted by the protocol, e.g. the port number in a TCP header).