Homoglyph Detector
Detect Unicode look-alike characters in text, URLs, and emails. Generate confusable strings for security testing. Compare visually identical strings.
Suspicious characters are highlighted in yellow below. Click any highlighted character for details.
What Is a Homoglyph Attack?
A homoglyph attack (also called a homograph attack or Unicode spoofing) exploits the visual similarity between characters from different scripts. The most common attack vector is domain name spoofing: an attacker registers a domain that looks identical to a legitimate domain but uses Cyrillic, Greek, or other Unicode characters instead of ASCII letters. For example, the Cyrillic letter "а" (U+0430) is indistinguishable from the Latin letter "a" (U+0061) in most fonts. A domain like pаypal.com using the Cyrillic а would look exactly like paypal.com to most users.
Common Confusable Character Pairs
The most frequently exploited homoglyphs are cross-script lookalikes between Latin and other alphabets. Latin "a" (U+0061) is confused with Cyrillic "а" (U+0430) and Greek "α" (U+03B1). Latin "c" (U+0063) looks identical to Cyrillic "с" (U+0441). Latin "e" (U+0065) is confused with Cyrillic "е" (U+0435). Latin "o" (U+006F) resembles Cyrillic "о" (U+043E), Greek "ο" (U+03BF), and Armenian "օ" (U+0585). Latin "p" (U+0070) looks like Cyrillic "р" (U+0440). Latin "x" (U+0078) resembles Cyrillic "х" (U+0445). These pairings are documented in the official Unicode confusables.txt data file maintained by the Unicode Consortium.
How to Detect Homoglyph Attacks
The primary detection method is checking for mixed-script identifiers — strings that contain characters from more than one Unicode script. A domain like "pаypal.com" that mixes Latin and Cyrillic is a strong indicator of a spoofing attempt. Modern browsers display such domains in Punycode (xn-- notation) to warn users. At the application level, you can normalize strings using Unicode NFKC normalization and compare codepoints, or use the Unicode Consortium's confusables data to check every character against known lookalike mappings. This tool implements a subset of the most dangerous confusable pairs for quick detection.
Protecting Against Homoglyph Attacks
For web applications, always display the Punycode representation of domain names when they contain non-ASCII characters. For user input validation, reject or warn when identifiers contain mixed scripts. Implement email authentication (DMARC, DKIM, SPF) to prevent email spoofing. For brand protection, register defensive IDN variants of your domain. ICANN and major registrars have policies restricting IDN registrations that confuse with existing well-known domains, but coverage is incomplete. Certificate transparency logs (crt.sh) can be monitored for suspicious certificates issued for lookalike domains.
Security Research Applications
The Generate mode in this tool allows security researchers to create homoglyph variants of strings for testing purposes. This is useful for testing email filters, URL scanners, domain name validation logic, and phishing detection systems. When building these defenses, it is important to test with a comprehensive set of confusables rather than just the most obvious Latin-Cyrillic pairs. The Unicode Consortium's confusables.txt lists thousands of character pairs across all scripts.
Zero-Width and Invisible Characters
In addition to look-alike characters, attackers also use invisible Unicode characters like Zero Width Space (U+200B), Zero Width Non-Joiner (U+200C), and the Soft Hyphen (U+00AD) to create strings that look identical visually but differ at the byte level. These are especially dangerous in URLs, where they can bypass regex-based blacklists, and in email addresses, where they can defeat spam filters. Our Zero-Width Encoder tool demonstrates this technique.