Beyond std::string: Unicode Text Processing in C++ — Part 1: Normalization and Case-Folding

2026-04-09 · 10 min read

Scope: std::string is a byte container — a sequence of bytes is not the same thing as a sequence of human-readable text. The gap between them is where subtle (and sometimes serious) bugs hide. This is Part 1 of 2, covering normalization and case-folding: why the C++ standard doesn't solve these problems, and how ICU provides production-ready solutions with working code and benchmarks.

TL;DR — Normalize to NFC before comparing or hashing. Use foldCase() instead of tolower. If a human typed it, use ICU.

Why the standard library doesn't solve this

The C++ standard's fundamental constraint is that Unicode compliance requires runtime data that evolves independently of the language standard.

The Unicode Character Database (UCD) — which defines normalization mappings, case-folding tables, grapheme break properties, collation rules, and more — is updated with every Unicode release. Unicode 17.0 shipped in September 2025, and ICU 78 (released October 2025) brought support for it. A standard library that bundled these tables would be forced to track an external release cycle. Different vendors could ship different data versions, leading to non-portable behavior across conforming implementations.

The standard library does provide typed string variants: std::u8string (char8_t, C++20), std::u16string (char16_t, C++11), and std::u32string (char32_t, C++11). These solve an encoding ambiguity problem — a std::u8string at least guarantees UTF-8 storage, unlike the unspecified encoding of std::string. But they provide no operations beyond what std::string already has. You can store a UTF-8 string in a std::u8string and still have no way to normalize it, case-fold it, or iterate its grapheme clusters. std::wstring (wchar_t) is worse: wchar_t is 2 bytes on Windows (UTF-16) and 4 bytes on Linux/macOS (UTF-32), making any code that assumes a particular width non-portable.

C11's <uchar.h> (available in C++ via <cuchar>) goes one level deeper: mbrtoc32(), mbrtoc16(), and the C23 addition mbrtoc8() convert between narrow multibyte strings and fixed-width Unicode encodings, with mbstate_t tracking conversion state across calls for variable-length sequences. These are useful low-level primitives for decoding — mbrtoc32() will hand you a code point. What to do with that code point once you have it — whether it needs normalization, how to compare it, whether it forms part of a grapheme cluster — is entirely outside their scope.

The typed string types and <uchar.h> solve encoding and storage problems. They do not solve text processing problems.

There were attempts at processing. C++11 introduced <codecvt> for encoding conversion, along with std::wstring_convert as a higher-level wrapper. Both were deprecated in C++17 and removed in C++26 via separate papers — P2871R3 for <codecvt> and P2872R3 for wstring_convert. The stated reason in P2871R3, echoed verbatim by cppreference, is unambiguous: "this feature no longer implements the current Unicode Standard, supporting only the obsolete UCS-2 encoding." UCS-2 is a fixed-width 16-bit encoding that predates supplementary characters — it cannot represent anything above U+FFFF, which rules out emoji, many CJK extensions, and four of the scripts added in Unicode 17.0 alone. Their removal was not controversial.

The C++ committee's Study Group 16 (SG16) was formed in 2018 to work on text and Unicode. char8_t landed in C++20. C++26 adds std::text_encoding (<text_encoding>, P1885), which lets you identify and query encodings by name — useful for knowing what a string is, but it provides no processing operations. Normalization, segmentation, and locale-aware comparison have no standardized C++ solution as of C++26.

The normalization problem

Unicode permits multiple equivalent representations of the same text. The letter é can appear as:

Both forms render identically but compare differently as bytes.

U+00E9 — precomposed (NFC)
U+0065 U+0301 — base letter + combining acute accent (NFD)

std::string nfc = "\xC3\xA9";   // 2 bytes
std::string nfd = "e\xCC\x81";  // 3 bytes

nfc == nfd;                 // false
nfc.size() == nfd.size();   // false

This leads to real bugs: search indexes miss matches, std::unordered_set stores duplicates, and equality checks fail unexpectedly.

Unicode Normalization defines canonical forms. NFC (Canonical Decomposition followed by Composition) is the most compact and the recommended form for interchange and comparison.

Correct Unicode-aware string comparison

Here is a realistic implementation of caseless string comparison — demonstrated with anagrams, but the principle applies anywhere you need to compare strings while ignoring case and structural separators: checking usernames for duplicates, searching indices, validating user input against stored credentials. The code normalizes to NFC, applies case folding, and filters structural separators using full Unicode General Category data:

bool IsAnagramUnicode(const std::string &s1, const std::string &s2)
{
    auto NormalizeAndCasefold = [](const std::string &utf8) -> icu::UnicodeString {
        icu::ErrorCode          ec;
        const icu::Normalizer2 *nf = icu::Normalizer2::getNFCInstance(ec);
        if (ec.isFailure())
            throw std::runtime_error("ICU getNFCInstance failed");

        // Strictly convert UTF-8 -> UChar (UTF-16) and detect malformed input.
        UErrorCode uerr   = U_ZERO_ERROR;
        int32_t    needed = 0;
        // Query needed length; will set an error on malformed input.
        u_strFromUTF8(nullptr, 0, &needed, utf8.data(), static_cast<int32_t>(utf8.size()), &uerr);
        if (uerr != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(uerr))
            throw std::runtime_error("invalid UTF-8");

        std::vector<UChar> buf(needed + 1);
        uerr = U_ZERO_ERROR;
        u_strFromUTF8(buf.data(),
                      static_cast<int32_t>(buf.size()),
                      nullptr,
                      utf8.data(),
                      static_cast<int32_t>(utf8.size()),
                      &uerr);
        if (U_FAILURE(uerr))
            throw std::runtime_error("invalid UTF-8");

        icu::UnicodeString src(buf.data(), needed);

        icu::UnicodeString normalized;
        nf->normalize(src, normalized, ec);
        if (ec.isFailure())
            throw std::runtime_error("ICU normalize failed");
        normalized.foldCase();

        return normalized;
    };

    // Filter separators and punctuation using Unicode General Category.
    // Not just ASCII; Unicode has thousands of separator/punctuation characters
    // across all scripts. u_charType() handles them all correctly.
    auto ShouldSkip = [](UChar32 cp) -> bool {
        int type = u_charType(cp);
        return type == U_SPACE_SEPARATOR  || type == U_LINE_SEPARATOR      ||
               type == U_PARAGRAPH_SEPARATOR || type == U_CONNECTOR_PUNCTUATION ||
               type == U_DASH_PUNCTUATION || type == U_INITIAL_PUNCTUATION  ||
               type == U_FINAL_PUNCTUATION || type == U_OTHER_PUNCTUATION   ||
               type == U_START_PUNCTUATION || type == U_END_PUNCTUATION;
    };

    icu::UnicodeString us1 = NormalizeAndCasefold(s1);
    icu::UnicodeString us2 = NormalizeAndCasefold(s2);

    std::unordered_map<char32_t, int> cnt;
    cnt.reserve(us1.length());

    for (int32_t i = 0; i < us1.length();) {
        UChar32 cp = us1.char32At(i);
        if (!ShouldSkip(cp))
            ++cnt[static_cast<char32_t>(cp)];
        i += U16_LENGTH(cp);
    }
    for (int32_t i = 0; i < us2.length();) {
        UChar32 cp = us2.char32At(i);
        if (!ShouldSkip(cp)) {
            auto it = cnt.find(static_cast<char32_t>(cp));
            if (it == cnt.end()) return false;
            if (--it->second == 0) cnt.erase(it);
        }
        i += U16_LENGTH(cp);
    }
    return cnt.empty();
}

For comparison-only ASCII input, the simple byte-frequency version remains appropriate (remember to cast to unsigned char when indexing to avoid UB on signed char).

bool IsAnagram(const std::string &s1, const std::string &s2)
{
    std::array<int, 256> cnt {};
    for (const auto &c : s1)
        if (!std::isspace(c) && !std::ispunct(c))
            ++cnt[static_cast<unsigned char>(c)];
    for (const auto &c : s2)
        if (!std::isspace(c) && !std::ispunct(c))
            --cnt[static_cast<unsigned char>(c)];
    return std::all_of(cnt.begin(), cnt.end(), [](auto i) { return i == 0; });
}

Case-folding: why `tolower` is wrong for global text

Turkish has four distinct I characters: I (uppercase dotless), İ (uppercase with dot, U+0130), i (lowercase with dot), ı (lowercase dotless, U+0131). When Turkish text is lowercased, uppercase I maps to ı — not i.

There are two tolower functions in C++ and neither works correctly for Unicode text.

std::tolower(int ch) from <cctype> converts using the currently installed C locale — in the default "C" locale, only ASCII A–Z are affected; everything else is returned unchanged. It also has a documented undefined behavior trap: the int argument must be representable as unsigned char or equal to EOF. Passing a plain signed char with a negative value (any extended character on platforms where char is signed) is undefined behavior — cppreference explicitly requires the cast pattern std::tolower(static_cast<unsigned char>(ch)). Nothing above U+007F is reachable.

std::tolower(CharT, const locale&) from <locale> accepts a locale explicitly, but has two problems of its own documented on cppreference. First, it can only perform 1:1 character mapping: "Only 1:1 character mapping can be performed by this function, e.g. the Greek uppercase letter 'Σ' has two lowercase forms depending on the position in a word: 'σ' and 'ς'." Unicode case mapping is not always 1:1. Second, the type problem: char cannot represent U+0131 (two bytes in UTF-8). Using char32_t sidesteps the width issue but std::ctype<char32_t> is not a required standard specialization — only char and wchar_t are guaranteed — so std::use_facet<std::ctype<char32_t>>(loc) throws std::bad_cast if the locale doesn't provide it. And even if it did, the function signature is CharT tolower(CharT ch, const locale& loc) — it returns a single CharT. It cannot produce multiple code points, so context-dependent mappings like German ß → ss or Greek Σ → σ/ς are structurally impossible regardless of type.

Unicode Case Folding is a locale-independent mapping defined in the Unicode standard (Section 5.18) for caseless matching. ICU ships the complete folding table.

icu::UnicodeString u("Iİıi");
u.foldCase();
std::string result;
u.toUTF8String(result); // "ii̇ıi": result uses the correct locale-independent folded mappings

The Unicode Consortium also enforces a Case Folding Stability Policy: any string correctly case-folded under Unicode 5.0 rules is guaranteed to remain correctly case-folded in all subsequent Unicode versions. This means foldCase() results are stable across ICU and Unicode upgrades — unlike locale-dependent tolower, which has no such guarantee.

For locale-sensitive sorting — Turkish text that needs to sort in Turkish collation order — use icu::Collator with an explicit icu::Locale("tr"). Case folding and collation are separate operations.

Benchmark: normalization and case-folding

Benchmarks use Google Benchmark on Apple M2 Pro (arm64, Apple clang 17.0.0, C++26, ICU 78.2, -O3).

static void BM_IsAnagram_ASCII(benchmark::State &state)
{
    // Baseline: plain ASCII anagrams, no Unicode processing needed
    std::string s1 = "listen";
    std::string s2 = "silent";
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagram(s1, s2));
}

static void BM_IsAnagram_Unicode_NFC(benchmark::State &state)
{
    // Same anagrams as ASCII, but with accents (NFC form)
    // After normalization/case-folding, "lísten" and "síLent" are equivalent to "listen"/"silent"
    std::string s1 = "l\xC3\xAD" "sten";   // NFC: l + í (U+00ED) + sten
    std::string s2 = "s\xC3\xAD" "Lent";   // NFC: s + í (U+00ED) + Lent (uppercase)
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagramUnicode(s1, s2));
}

static void BM_IsAnagram_Unicode_NFD(benchmark::State &state)
{
    // Same anagrams as NFC, but with decomposed accents (NFD form)
    // Forces normalizer to compose combining marks before comparison
    std::string s1 = "l" "i\xCC\x81" "sten";  // NFD: l + i + combining ◌́ (U+0301) + sten
    std::string s2 = "s" "i\xCC\x81" "Lent";  // NFD: s + i + combining ◌́ (U+0301) + Lent
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagramUnicode(s1, s2));
}

Benchmark	Time	Relative
`IsAnagram` ASCII — byte frequency	110 ns	1×
`IsAnagram` Unicode, NFC input	420 ns	3.8×
`IsAnagram` Unicode, NFD input	482 ns	4.4×

NFD vs NFC costs 62 ns — anagrams are the same regardless of normalization, so the difference is pure ICU composition overhead. Both inputs go through the same normalizer code path; NFD simply has more combining marks to resolve. Normalization itself is not the bottleneck — ICU's per-call overhead dominates at these string lengths.

Longer strings (to show how the overhead scales) use "café " repeated many times:

static void BM_IsAnagram_ASCII_Long(benchmark::State &state)
{
    // ~300 bytes: "cafe " repeated 60 times
    const std::string long_text =
        "cafe " "cafe " "cafe " ... // 60 repetitions
    std::string s1 = long_text;
    std::string s2 = long_text;
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagram(s1, s2));
}

static void BM_IsAnagram_Unicode_NFC_Long(benchmark::State &state)
{
    // ~360 bytes: "café " (NFC form) repeated 60 times
    const std::string long_text =
        "caf\xC3\xA9 " "caf\xC3\xA9 " "caf\xC3\xA9 " ... // 60 repetitions
    std::string s1 = long_text;
    std::string s2 = long_text;
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagramUnicode(s1, s2));
}

static void BM_IsAnagram_Unicode_NFD_Long(benchmark::State &state)
{
    // ~420 bytes: "café " (NFD form, decomposed) repeated 60 times
    const std::string long_text =
        "cafe\xCC\x81 " "cafe\xCC\x81 " "cafe\xCC\x81 " ... // 60 repetitions
    std::string s1 = long_text;
    std::string s2 = long_text;
    for (auto _ : state)
        benchmark::DoNotOptimize(ch01::IsAnagramUnicode(s1, s2));
}

At larger string sizes the picture changes — but not in the direction you might expect. Testing with ~400 byte strings (60 repetitions of "café " in each form, mirrored by an ASCII "cafe " baseline):

Benchmark	Time	Relative to ASCII
`IsAnagram` ASCII — byte frequency (300 bytes)	762 ns	1×
`IsAnagram` Unicode, NFC (360 bytes)	6,250 ns	8.2×
`IsAnagram` Unicode, NFD (420 bytes)	10,214 ns	13.4×

The overhead grows, it does not amortize. At short strings the ratio is 3.8× for NFC and 4.4× for NFD; at ~400 bytes it widens to 8.2× and 13.4×. Why? The two implementations have fundamentally different costs:

ASCII: a single-pass scan into a fixed 256-element std::array<int, 256>. Stack-allocated, cache-friendly, O(n) with a tiny constant. Dense data.
Unicode: allocates a UnicodeString, converts UTF-8 → UTF-16, runs the normalizer, case-folds, then walks the string inserting into an unordered_map<char32_t, int>. Sparse data.

Why the difference? The ASCII alphabet is 256 code points (dense enough for a fixed array). The Unicode code point space is ~1.1M (U+0000 to U+10FFFF) — too sparse for a fixed array, even though only ~160K characters are currently assigned. The unordered_map is not a pessimization — it's the only sane choice for a sparse ~1.1M-element space. You cannot allocate a ~1M-element array on the stack for every comparison; that is the fundamental trade-off between dense (ASCII) and sparse (Unicode) data structures.

The slowdown is the cumulative cost of string conversion, normalization, allocation, and the map's hashing overhead. These benchmarks only have ~4 unique code points in the map (c, a, f, é) — spaces are filtered by ShouldSkip — so the slowdown is not primarily map scaling. Real-world Unicode text with thousands of unique characters would see even worse scaling due to hashing and cache-miss costs. But the core problem—that Unicode processing is expensive compared to fixed-array byte counting—is already visible here.

NFD widens the gap further because decomposition literally makes the string longer (e + combining accent = 3 bytes vs precomposed é = 2 bytes in UTF-8), so the normalizer and map insertion process more data.

The practical consequence: do not normalize on every comparison. Normalize once at ingestion and store the NFC form; compare the pre-normalized values. Profile with your actual input sizes before drawing conclusions.

Where this matters: Part 1

Search & indexing — "café" must match cafe\u0301; normalize at index time and at query time.
Deduplication — std::unordered_set<std::string> will store NFC and NFD variants as distinct entries; uniqueness at the byte level is wrong for user-controlled text.
Security — normalization attacks are documented in CVE databases; two strings that render identically can bypass byte-level checks if one is NFC and the other NFD. NFC helps but is not sufficient alone; combine with confusable detection (UTS #39) where needed.

When `std::string` is sufficient

Internal tokens, protocol keywords, configuration keys, and any string whose character set is controlled and ASCII-only do not need ICU. The overhead is real; apply it where the correctness gap is real too.

Linking ICU

icu-config --cppflags   # compile flags
icu-config --ldflags    # link flags: -licui18n -licuuc -licudata

icuuc — core (UnicodeString, UTF converters, UChar32 utilities)
icui18n — internationalization (Normalizer2, BreakIterator, Collator)
icudata — the Unicode data tables; omitting it links but crashes at runtime

Summary: Part 1

std::string is a byte container, not a text container; the C++ standard has no Unicode text processing and won't until the Unicode data problem is solved at the standard level.
Normalization — the same character can be encoded multiple ways; normalize to NFC before any comparison or hashing.
Use Unicode Case Folding: foldCase() instead of tolower.
ICU provides both, backed by the same Unicode tables used by the JVM, V8, and Qt. The Correctness Tax is 3×–4× for small strings, up to 13× for longer ones — justified for user-generated text and security-sensitive comparisons.

Part 2 preview: Normalization and case-folding fix comparison bugs, but what about everything else? Try truncating a string containing emoji, or counting how long a message really is. A single emoji like 👨‍👩‍👧‍👦 breaks most string operations because it looks like one character but is made of pieces. Part 2 shows how real text processing works, why emoji and international text break naive implementations, and where Unicode handling actually makes a difference in performance vs correctness.

Read Part 2: [link coming soon]