UtilityKit

500+ fast, free tools. Most run in your browser only; Image & PDF tools upload files to the backend when you run them.

Unicode Normalizer

Normalize Unicode text across NFC, NFD, NFKC, and NFKD with diff summary.

About Unicode Normalizer

The same visible character can be encoded multiple ways in Unicode — an 'é' can be stored as a single precomposed code point (U+00E9) or as a base 'e' followed by a combining acute accent (U+0301). This matters more than it appears: two strings that look identical on screen may fail a strict equality check, break text search, or produce duplicate entries in a database because they use different normalization forms. Unicode Normalizer converts text between the four Unicode normalization forms defined by the Unicode Standard — NFC, NFD, NFKC, and NFKD — and shows a summary of how many characters changed. NFC is the preferred form for most storage and interchange; NFKC additionally maps compatibility characters like full-width letters and ligatures to their canonical equivalents.

Why use Unicode Normalizer

All Four Unicode Normal Forms

Convert to NFC, NFD, NFKC, or NFKD — the complete set of Unicode normalization forms defined by the standard.

Change Summary Display

See exactly how many characters were added, removed, or replaced during normalization so nothing changes unexpectedly.

Database & Search Consistency

Normalize before storing text to prevent the same visible string from producing duplicate records or search misses.

Compatibility Mapping (NFKC/NFKD)

NFKC maps full-width characters, Roman numerals, ligatures, and compatibility forms to their standard equivalents.

NFC for Storage & Interchange

NFC is the recommended form for most web content, JSON, and database storage — precomposed characters use fewer code units.

Invisible Issue Detection

Normalization surface-level reveals combining characters and compatibility variants that are otherwise invisible in rendered text.

How to use Unicode Normalizer

  1. Paste your text into the input area.
  2. Select the target normalization form: NFC, NFD, NFKC, or NFKD.
  3. The normalized output appears instantly with a change summary.
  4. Compare the original and normalized character counts to understand the transformation.
  5. Use the diff summary to see which characters were affected.
  6. Click Copy to copy the normalized text to your clipboard.

When to use Unicode Normalizer

  • When preparing user-input text for database storage to ensure consistent string comparison.
  • When building a search index and needing all text variants to match the same normalized form.
  • When comparing two strings that look identical but fail equality checks due to different normalization.
  • When processing text from multiple sources (APIs, OCR, clipboard) that may use different normalization conventions.
  • When cleaning up text containing full-width Unicode characters from East Asian keyboard input that should be standard ASCII.
  • When preparing multilingual text for NLP pipelines that require a consistent Unicode representation.

Examples

NFC precomposition

Input: café (composed as e + combining acute: NFD form)

Output: NFC: café (single code point U+00E9 for é — 4 chars) NFD: café (5 chars with combining accent)

NFKC compatibility mapping

Input: Abc123 (full-width Unicode)

Output: NFKC: Abc123 (standard ASCII equivalents)

Ligature normalization

Input: file and flow (fi and fl ligatures)

Output: NFKC: file and flow

Tips

  • Always apply NFC normalization to user-submitted text before storing it in a database to prevent identical-looking duplicate records.
  • Use NFKC when sanitizing text from East Asian input methods where full-width ASCII characters are common.
  • The change summary helps you audit whether normalization accidentally modified text you did not intend to change.
  • For text search, normalize both the query string and the indexed content to the same form (NFC or NFKC) so searches never miss matches due to form differences.
  • Pair Unicode normalization with Hidden Character Detector to catch zero-width spaces and other invisible characters that normalization does not remove.

Frequently Asked Questions

What is the difference between NFC and NFD?
NFC (Canonical Decomposition, followed by Canonical Composition) represents characters in their precomposed form — 'é' as a single code point (U+00E9). NFD (Canonical Decomposition) decomposes characters into their base letter plus combining marks — 'é' becomes 'e' + combining acute (U+0301).
What does NFKC add over NFC?
NFKC additionally applies compatibility mappings: full-width ASCII characters (A → A), circled numbers (① → 1), ligatures (fi → fi), superscripts (² → 2), and other compatibility forms are replaced with their standard equivalents.
Which normalization form should I use for most cases?
NFC is the recommended default for web content, JSON data, and database storage. It uses precomposed characters that are compact and widely compatible. NFKC is better when you need to strip compatibility variants from user input.
Can normalization change the visible appearance of text?
NFC and NFD do not change how text renders — only the internal encoding differs. NFKC and NFKD can change rendering by replacing ligatures, full-width characters, and other compatibility forms with their canonical single-width equivalents.
Why do two visually identical strings sometimes fail equality checks?
Because one may be NFC-normalized (precomposed) and the other NFD-normalized (decomposed). String equality in most systems is byte-level comparison, so 'é' as U+00E9 and 'é' as U+0065+U+0301 are not equal.
Does normalization work on emoji?
Most standard emoji are already in NFC form. Normalization rarely affects emoji, but combining modifiers (like skin tone sequences) may be reordered during canonical decomposition.
Does Unicode normalization fix encoding issues like mojibake?
No. Mojibake (garbled text from encoding mismatch, e.g., UTF-8 read as Latin-1) is an encoding problem that requires re-encoding the bytes, not Unicode normalization. Use proper encoding detection tools for that.
Is this tool useful for programming language identifiers?
Yes. Python 3, JavaScript, and other modern languages allow Unicode in identifiers, but normalization ensures that visually identical variable names using different codepoint sequences are treated consistently.

Explore the category

Glossary

NFC (Canonical Decomposition + Composition)
Unicode normalization form that represents characters in precomposed form where possible. The recommended form for most text storage and interchange.
NFD (Canonical Decomposition)
Unicode normalization form that decomposes precomposed characters into their base characters and combining marks.
NFKC (Compatibility Decomposition + Composition)
Like NFC but also maps compatibility variants (full-width characters, ligatures, circled numbers) to their standard equivalents.
NFKD (Compatibility Decomposition)
Like NFD but also applies compatibility mappings. The most decomposed form, mapping all compatibility and precomposed characters to base forms.
Combining character
A Unicode character that modifies the preceding base character. The acute accent U+0301 is a combining character that turns 'e' into 'é' in NFD form.
Canonical equivalence
Two Unicode strings are canonically equivalent if they represent the same abstract character sequence, even if the underlying code point sequences differ due to normalization.