Unicode Character Property Inspector
Paste any text to see per-character Unicode properties: codepoint, UTF-8, UTF-16, category, block, and bidirectional class.
| Glyph | Codepoint | Decimal | Name | UTF-8 | UTF-16 | Category | Block | Bidi |
|---|
How to Use the Unicode Property Inspector
- Inspect mode — paste any text into the input area. The table below shows every character with its codepoint, UTF-8 bytes, UTF-16 representation, Unicode category, block, and bidirectional class. Click any glyph cell to copy that character.
- Compare mode — paste two strings that look identical but may differ. The tool highlights characters with different codepoints, revealing homoglyphs or invisible characters.
- Encode mode — type a single character to see all its encoding representations: decimal, hex, HTML entity, JavaScript escape, Python escape, CSS escape, and raw bytes.
Unicode Character Properties Explained
Every Unicode character carries a rich set of properties beyond just its visual shape. The most useful for developers are the general category, block, script, and bidirectional class. Understanding these properties is essential when building text-processing systems, input validation, search engines, or any application that needs to handle multilingual text correctly.
General Category
The general category property classifies every codepoint into one of 30 categories. The broad groups are: Letter (L), Mark (M), Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). Letters are further divided into Uppercase (Lu), Lowercase (Ll), Titlecase (Lt), Modifier (Lm), and Other Letter (Lo). Numbers include Decimal Digit (Nd), Letter Number (Nl), and Other Number (No). Understanding categories is critical for regular expressions — the pattern \p{L} matches any letter in Unicode-aware regex engines, regardless of script.
Unicode Blocks
Blocks are contiguous named ranges of codepoints. Unlike scripts, blocks do not always align perfectly with a single writing system, but they are a useful navigation aid. The Basic Latin block (U+0000–U+007F) corresponds to ASCII. The Miscellaneous Symbols block (U+2600–U+26FF) contains weather symbols, card suits, and other decorative characters. Knowing the block of an unknown character helps you understand its intended use and find similar characters nearby.
UTF-8 and UTF-16 Encoding
The inspector shows the raw byte sequences for both UTF-8 and UTF-16. UTF-8 uses 1 byte for ASCII characters (U+0000–U+007F), 2 bytes for characters up to U+07FF, 3 bytes for U+0800–U+FFFF, and 4 bytes for supplementary characters (U+10000+). UTF-16 always uses 2 bytes for characters in the Basic Multilingual Plane and a 4-byte surrogate pair for supplementary characters. JavaScript's charCodeAt() returns UTF-16 code units, while codePointAt() returns the actual Unicode codepoint — this distinction matters when processing emoji or historic script characters.
Bidirectional Class
The Unicode Bidirectional Algorithm (UBA, defined in Unicode Standard Annex #9) governs how characters are rendered when left-to-right and right-to-left text appear in the same paragraph. Every character has a bidi class: L (Left-to-Right, for Latin, Greek, etc.), R (Right-to-Left, for Hebrew), AL (Arabic Letter), AN (Arabic Number), EN (European Number), and several neutral classes like WS (Whitespace) and ON (Other Neutral). The algorithm resolves the display order based on the sequence of these classes. Incorrect bidi handling is a common source of rendering bugs in multilingual applications.
Grapheme Clusters vs. Codepoints
What users perceive as a single character may be composed of multiple Unicode codepoints. A grapheme cluster can include a base character followed by combining marks (accent or diacritic), emoji with skin-tone modifiers (U+1F3FB–U+1F3FF), or emoji joined by Zero-Width Joiners (U+200D) such as the family emoji sequence. This means that string.length in JavaScript does not necessarily equal the number of visible characters. Use the Intl.Segmenter API (available in modern browsers) or a library like grapheme-splitter for accurate grapheme counting in applications.
Practical Applications
The Unicode inspector is invaluable for debugging text that looks correct visually but causes issues in code. Common scenarios include: detecting invisible control characters (U+200B zero-width space, U+FEFF BOM) that cause unexpected string comparisons to fail; identifying characters in wrong encoding (the garbled text known as "mojibake" often appears as sequences of Latin-1 supplement characters); checking whether a string contains right-to-left override characters (U+202E) that might be used maliciously; and verifying that a font supports all the codepoints in a string before rendering.