What is a surrogate pair in UTF-16?

UTF-16 uses surrogate pairs to encode Unicode codepoints above U+FFFF. A surrogate pair consists of a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). Together they represent a single supplementary character. JavaScript strings use UTF-16 internally, so emoji and some CJK characters are stored as surrogate pairs.

What does the bidirectional class mean?

The Unicode bidirectional (Bidi) algorithm determines the display order of mixed left-to-right and right-to-left text. Every character has a bidi class: L (Left-to-Right), R (Right-to-Left), AN (Arabic Number), EN (European Number), WS (Whitespace), and others. The algorithm uses these classes to correctly render mixed Arabic/Hebrew and Latin text on the same line.

Why does my emoji show two codepoints?

Most emoji have codepoints above U+FFFF (in the supplementary planes), which means JavaScript represents them as surrogate pairs in UTF-16. When you paste an emoji into the inspector, it shows both surrogates. Some emoji are also sequences: a base emoji followed by variation selectors, skin-tone modifiers, or zero-width joiners that combine into a single visible glyph.

What is the difference between a codepoint and a grapheme cluster?

A codepoint is a single Unicode number (e.g., U+0041 = A). A grapheme cluster is one or more codepoints that together form a single visible character as perceived by a user. For example, a letter with a combining accent mark, or a complex emoji with skin-tone modifier, may be a single grapheme cluster made of multiple codepoints. The Segmentation Inspector in this tool identifies grapheme clusters.

Unicode Character Property Inspector

Paste any text to see per-character Unicode properties: codepoint, UTF-8, UTF-16, category, block, and bidirectional class.

Glyph	Codepoint	Decimal	Name	UTF-8	UTF-16	Category	Block	Bidi

Paste text above to inspect its Unicode properties.

How to Use the Unicode Property Inspector

Inspect mode — paste any text into the input area. The table below shows every character with its codepoint, UTF-8 bytes, UTF-16 representation, Unicode category, block, and bidirectional class. Click any glyph cell to copy that character.
Compare mode — paste two strings that look identical but may differ. The tool highlights characters with different codepoints, revealing homoglyphs or invisible characters.
Encode mode — type a single character to see all its encoding representations: decimal, hex, HTML entity, JavaScript escape, Python escape, CSS escape, and raw bytes.

Unicode Character Properties Explained

Every Unicode character carries a rich set of properties beyond just its visual shape. The most useful for developers are the general category, block, script, and bidirectional class. Understanding these properties is essential when building text-processing systems, input validation, search engines, or any application that needs to handle multilingual text correctly.

General Category

The general category property classifies every codepoint into one of 30 categories. The broad groups are: Letter (L), Mark (M), Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). Letters are further divided into Uppercase (Lu), Lowercase (Ll), Titlecase (Lt), Modifier (Lm), and Other Letter (Lo). Numbers include Decimal Digit (Nd), Letter Number (Nl), and Other Number (No). Understanding categories is critical for regular expressions — the pattern \p{L} matches any letter in Unicode-aware regex engines, regardless of script.

Unicode Blocks

Blocks are contiguous named ranges of codepoints. Unlike scripts, blocks do not always align perfectly with a single writing system, but they are a useful navigation aid. The Basic Latin block (U+0000–U+007F) corresponds to ASCII. The Miscellaneous Symbols block (U+2600–U+26FF) contains weather symbols, card suits, and other decorative characters. Knowing the block of an unknown character helps you understand its intended use and find similar characters nearby.

UTF-8 and UTF-16 Encoding

The inspector shows the raw byte sequences for both UTF-8 and UTF-16. UTF-8 uses 1 byte for ASCII characters (U+0000–U+007F), 2 bytes for characters up to U+07FF, 3 bytes for U+0800–U+FFFF, and 4 bytes for supplementary characters (U+10000+). UTF-16 always uses 2 bytes for characters in the Basic Multilingual Plane and a 4-byte surrogate pair for supplementary characters. JavaScript's charCodeAt() returns UTF-16 code units, while codePointAt() returns the actual Unicode codepoint — this distinction matters when processing emoji or historic script characters.

Bidirectional Class

The Unicode Bidirectional Algorithm (UBA, defined in Unicode Standard Annex #9) governs how characters are rendered when left-to-right and right-to-left text appear in the same paragraph. Every character has a bidi class: L (Left-to-Right, for Latin, Greek, etc.), R (Right-to-Left, for Hebrew), AL (Arabic Letter), AN (Arabic Number), EN (European Number), and several neutral classes like WS (Whitespace) and ON (Other Neutral). The algorithm resolves the display order based on the sequence of these classes. Incorrect bidi handling is a common source of rendering bugs in multilingual applications.

Grapheme Clusters vs. Codepoints

What users perceive as a single character may be composed of multiple Unicode codepoints. A grapheme cluster can include a base character followed by combining marks (accent or diacritic), emoji with skin-tone modifiers (U+1F3FB–U+1F3FF), or emoji joined by Zero-Width Joiners (U+200D) such as the family emoji sequence. This means that string.length in JavaScript does not necessarily equal the number of visible characters. Use the Intl.Segmenter API (available in modern browsers) or a library like grapheme-splitter for accurate grapheme counting in applications.

Practical Applications

The Unicode inspector is invaluable for debugging text that looks correct visually but causes issues in code. Common scenarios include: detecting invisible control characters (U+200B zero-width space, U+FEFF BOM) that cause unexpected string comparisons to fail; identifying characters in wrong encoding (the garbled text known as "mojibake" often appears as sequences of Latin-1 supplement characters); checking whether a string contains right-to-left override characters (U+202E) that might be used maliciously; and verifying that a font supports all the codepoints in a string before rendering.

Frequently Asked Questions

The Unicode general category classifies every codepoint into one of 30 types. Major categories include Letter (L), Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). Subcategories are more specific — for example Lu = Uppercase Letter, Sm = Math Symbol, Po = Other Punctuation. Categories are used by regex engines (e.g., \p{L} matches any letter) and text-processing algorithms.

UTF-16 uses surrogate pairs to encode codepoints above U+FFFF. A high surrogate (U+D800–U+DBFF) is followed by a low surrogate (U+DC00–U+DFFF) to form a pair representing one supplementary character. JavaScript strings use UTF-16 internally, so emoji and some other characters appear as two code units. Use codePointAt() instead of charCodeAt() to get the actual codepoint.

The bidi class tells the Unicode Bidirectional Algorithm how to display a character when left-to-right and right-to-left text are mixed. Common classes are L (Left-to-Right), R (Right-to-Left), AL (Arabic Letter), AN (Arabic Number), and WS (Whitespace). The algorithm uses these to determine the correct visual display order in multilingual text.

Most emoji have codepoints above U+FFFF, which requires surrogate pairs in UTF-16. JavaScript string methods see them as two code units. Some emoji are actually sequences — a base emoji combined with skin-tone modifiers, gender signs, or Zero-Width Joiners (U+200D) — which can appear as multiple separate codepoints that combine into one visible glyph.

A codepoint is a single Unicode number. A grapheme cluster is one or more codepoints that together form a single perceived character. For example, the letter é can be encoded as U+00E9 (one codepoint) or as U+0065 U+0301 (e + combining acute accent, two codepoints). Both look identical but have different byte sequences. Use Intl.Segmenter for accurate grapheme counting.