published
26 December 2024
by
Ray Morgan
updated
3 January 2025

Numeric Entities in HTML and CSS

- Understand how numeric entities in HTML and CSS map directly to Unicode code points and their significance in web content development. - Learn the syntax of numeric HTML entities, including the required semicolons and the use of decimal and hexadecimal formats. - Explore the use cases for numeric entities, such as handling non-ASCII characters, invisible or ambiguous characters, and special-purpose symbols. - Practice representing characters with numeric entities in HTML, including examples like accented characters, non-Latin scripts, and emoji. - Identify and use numeric entities for invisible or ambiguous characters, such as non-breaking spaces, zero-width spaces, soft hyphens, and directional marks. - Differentiate between characters that may appear visually similar but serve different typographical or functional purposes (e.g., various dashes and spaces). - Learn about the use of numeric entities in multilingual, mathematical, and typographically precise contexts to ensure compatibility and intended behavior. - Understand the parallels between numeric HTML entities and UTF-32 encoding in their direct mapping to Unicode code points. - Explore the use of Unicode escape sequences in CSS for styling internationalized content and working with custom fonts or typography. - Practice writing and applying CSS rules with Unicode escape sequences, including their use in the `content` property and selectors. - Recognize the syntax rules for CSS Unicode escape sequences, including the use of spaces to avoid ambiguity with alphanumeric characters. - Compare CSS Unicode escape sequences with UTF-32, understanding their conceptual similarities and differences.

Like UTF-32, numeric entities in HTML and CSS map directly to Unicode code points. They are encoded as escape sequences, independent of the HTML document’s encoding. This gives developers a powerful way to include special characters and symbols in web content. These entities are especially useful for ensuring compatibility across browsers, managing special characters in different encodings, or including characters that are otherwise difficult to type or display.

Numeric Entities in HTML

Like other HTML entities, numeric HTML entities begin with & and end with a semicolon (;). (Note that closing semicolons are required; omitting them is a common source of incorrectly-rendered text.)

Numeric entities are distinguished by a number sign (#) preceding the value. For example, the regular ASCII uppercase letter Z (decimal Unicode code point 90) can be written as: Z.

The numeric values for entities can be written in either decimal or hexadecimal. Both forms are equally valid, but since Unicode code points are usually referred to by the hexadecimal values (and hex is more compact), doing the same in HTML is probably preferable.

Entity values in hex are preceded by x, so that same letter Z could be written as Z. (Note that neither the ‘x’ nor the hex value is case-sensitive; Z works just like Z. For legibility, I’ll use a lowercase x and uppercase hex values.)

Practical Examples

While the examples above are valid and work as expected, they're not what numeric entities are typically used for. While numeric entities can be used for any Unicode code point (or at least all of those that a browser and its fonts can display), they are typically used for non-ASCII characters.

For example, numeric entities can represent characters with diacritics, ensuring they render correctly in all environments. For example, the word “Café” uses a Latin “e” with an acute accent, which can be written as Café or Caf##xE9;. (This is the exact equivalent of the HTML named entity é.)

They can also represent characters from non-Latin scripts. For example, the Russian word дом ("house") may be written with its numeric HTML entities:

дом

In addition to non-ASCII and non-Latin characters, numeric entities are especially useful for:

  • invisible or ambiguous characters
  • characters that are hard to differentiate in source code
  • special purpose characters

Using numeric entities for these characters ensures that their intended behavior or meaning is preserved across different environments. This is especially critical in multilingual, mathematical, or typographically precise contexts where the character's presence or function might not be evident from its visual representation.

Using numeric entities can also prevent errors or misinterpretations when copying and pasting special characters into a code editor or environment that doesn’t handle certain characters properly.

Invisible or Ambiguous Characters

Non-Breaking Space (U+00A0) — A non-breaking space (  or  ) is visually identical to a regular space but prevents line breaks at its position.

<!-- Prevent line break between "Hello" and "World" -->
Hello&nbsp;World

Zero Width Space (U+200B) — A zero-width space (&#8203;) is entirely invisible but can be used to allow line breaks at specific points in text.

<!-- Suggest a line break after "longword" -->
longword&#8203;continuation

Soft Hyphen (U+00AD) — A soft hyphen (&shy; or &#173;) indicates a preferred line break location. It only appears as a hyphen if the line breaks at that point.

hyphen&shy;ation

Zero Width Non-Joiner (U+200C) — Prevents ligatures or connections between characters in scripts like Arabic. Represented as &#8204;.

<!-- Prevents the Arabic letter forms from connecting -->
&#1575;&#8204;&#1604;&#1605;

Zero Width Joiner (U+200D) — Forces characters to join, often used in emoji sequences. Represented as &#8205;.

<!-- Creates a family emoji by joining multiple characters -->
&#x1F468;&#8205;&#x1F469;&#8205;&#x1F467;

Characters That Are Hard to Differentiate

En Space (U+2002) and Em Space (U+2003) — These spaces are wider than a regular space and used for precise text alignment.

<!-- En space -->
Word&#8194;Word
<!-- Em space -->
Word&#8195;Word

Thin Space (U+2009) — A thinner version of a regular space, often used in typesetting.

1&#8201;000&#8201;000

Interrobang (U+203D) — A combination of a question mark and exclamation point, often not included in default fonts.

What&#x203D;
Curly Quotes (U+201C and U+201D) — Curly quotes (“, ”) can be confused with straight quotes ("). Using entities like &#8220; and &#8221; ensures the correct character is rendered.
“Curly quotation marks”
"Regular quotation marks"

Hyphen (U+2010) vs. En Dash (U+2013) vs. Em Dash (U+2014) — These dashes have different lengths but may appear similar or identical in source code. Use &#8208;, &#8211;, and &#8212; respectively.

ASCII dash: "-" (&#45;)

Hyphen: ‐ (&#8208;)

En Dash: "–" (&#8211;)

Em Dash: "—" (&#8212;)

Special Purpose Characters

Left-to-Right Mark (U+200E) and Right-to-Left Mark (U+200F) — These marks are invisible but enforce text directionality.

<!-- Ensure text aligns correctly for mixed-direction content -->
English&#8206;العربية

Invisible Mathematical Operators

Invisible times (&#8290;) — Used to indicate multiplication where the operator is implied.

Invisible separator (&#8291;) — Ensures clear mathematical grouping.

3&#8290;x

Emojis

The emoji character 😀 (U+1F600) can be written as &#x1F600; or its decimal equivalent, &#128512;.

UTF-32 Parallels

Numeric HTML entities directly map to Unicode code points, much like UTF-32 encoding. For example, the Chinese character 月 (yuè, meaning “moon”) is at code point U+6708, so its numeric entity is &#x6708;.

This direct mapping ensures that numeric entities are a reliable way to represent any Unicode character in HTML.

Numeric Entities in CSS

Just as numeric HTML entities provide a way to reference Unicode characters directly by their code points, CSS offers a similar mechanism through escape sequences. These escape sequences allow you to include Unicode characters in styles, making them an essential tool for working with internationalized content and custom typography.

Unicode Escape Sequences in CSS

In CSS, Unicode characters are represented using the backslash (\) followed by up to six hexadecimal (not decimal!) digits, which correspond to the Unicode code point. Optionally, a space can be included after the sequence to separate it from the next character if it is alphanumeric.

For example the Unicode code point U+1F600 (😀) is written in CSS as \1F600.

Unicode escape sequences are often used in the content property to add special characters.

/* Adds a smiling face emoji before each paragraph */
p::before {
    content: "\1F600 ";
}

This rule inserts the 😀 emoji before each paragraph.

Custom Font Glyphs

When using a custom font that maps specific glyphs to Unicode code points, escape sequences can specify those glyphs.

@font-face {
    font-family: 'CustomFont';
    src: url('customfont.woff2') format('woff2');
}
    
span.special { font-family: 'CustomFont'; content: "\E001"; /* Reference to a private-use Unicode character */ }

Selectors

Unicode escape sequences are also valid in CSS selectors. This is useful when selecting elements with Unicode-based IDs or classes.

/* Selects an element with an ID of "🙂" (U+1F642) */
/* Please, NEVER do this. You'll get fired and deserve it. */
#\1F642 {
    color: red;
}

Notes on Syntax

If the escape sequence is immediately followed by a valid hexadecimal digit or letter, include a space or other non-alphanumeric character to separate it.

content: "\41 A"; /* Renders "AA" (U+0041 is 'A') */

Comparison to UTF-32

Like numeric HTML entities, CSS Unicode escape sequences map directly to Unicode code points. This correspondence makes them conceptually similar to UTF-32, which encodes characters using their full 32-bit Unicode values. However, unlike UTF-32, CSS escape sequences are a syntax designed for text styling and do not affect the underlying encoding of the document.