Does C++23 now provide support for Unicode characters in its basic char
type, and to what degree?
So on cppreference for character literals, a character literal:
'c-char'
is defined as either:
- a
basic-c-char
- an escape sequence, as defined in escape sequences
- a universal character name, as defined in escape sequences
and then for basic-c-char
, it’s defined that:
A character from the basic source character set (until C++23) translation character set (since C++23), except the single-quote
'
, backslash, or new-line character
On the cppreference’s page for character sets, it then defines the "translation character set" as consisting of the following:
- each abstract character assigned a code point in the Unicode codespace, and (since C++23)
- a distinct character for each Unicode scalar value not assigned to an abstract character.
and states:
The translation character set is a superset of the basic character set and the basic literal character set (see below).
It seems to me that the "basic character set" (given on the just-above page) is basically a subset of ASCII. I also always thought of char
as namely being ASCII (with support for ISO-8859 character sets, such as per Microsoft’s page on the character types). But now with the change to the translation character set for basic-c-char
, it seems it supports Unicode to some fuller extent.
I’m aware that the actual encoding is implementation defined (apart from the null character and incrementing decimal digit characters it seems). But my main question is what characters are really supported by this "translation character set"? Is it all of Unicode? I feel as though I’m reading more into this than is actually the case.
2
2 Answers
Effectively not much changed (with two important differences):
Before C++23 the first translation phase defined that any character in the source file that isn’t an element of the basic source character set (which is a subset of the ASCII character set) was to be mapped to a universal-character-name, i.e. it would be replaced by a sequence of the form UXXXXXXXX
where XXXXXXXX
is the number of the ISO/IEC 10646 (equivalently Unicode) code point for the character.
Then when writing a character literal 'X'
where X
is replaced with a character that is not in the basic source character set you would get 'UXXXXXXXX'
after the first translation phase and then the c-char -> universal-character-name grammar applied.
So you could always write non-ASCII characters in a character literal, assuming the source encoding permitted to write such character. Source file encoding and supported source characters outside the basic source character set were implementation-defined as the source character set (encoding). Regardless of source character set, you could already write any Unicode scalar value directly into a character literal with a universal character name.
How this character literal will then behave is a different question, because the encoding used for to determine the value of the char
from the universal-character-name (or any character of the basic source character set) is implementation-defined as well (the execution character set encoding in C++20 or ordinary literal encoding in C++23). Obviously if char
is 8bit wide it can’t represent all Unicode scalar values. If the character was not representable in char
, then the behavior was implementation-defined.
The changes for C++23 are now that support for UTF-8 source encoding became mandatory, implying support for all Unicode scalar values in the source file, (although other encodings can of course also be supported) and that the first phase was changed, so that instead of rewriting everything to the basic source character set via universal character names, now the source characters are mapped to a translation character set sequence which is essentially a Unicode scalar value sequence. Unicode code points that are not Unicode scalar value, i.e. surrogate code points, aren’t elements of the translation character set (and can’t be produced by decoding any source file).
Therefore, in C++23 when getting to the translation phase where the character literal’s value is determined, a single Unicode scalar value in the source file matches the basic-c-char grammar as you showed in your question.
The value of the character literal is still determined as before by implementation-defined encoding. However, in contrast to C++20, the literal is now ill-formed if the character is not representable in char
via this encoding.
So the two differences are that UTF-8 source file encoding must be supported and that a single source character (meaning a single Unicode scalar value) in the character literal that is not representable in the implementation-defined ordinary literal encoding will now cause the literal to be ill-formed instead of having an implementation-defined value.
Analogously to the above, string literals (rather than character literals) haven’t really changed either. The encoding is still implementation-defined using the same ordinary literal encoding and primarily only the internal representation in the translation phases changed. And in the same way as for character literals, with C++23 the literal becomes ill-formed if a character (i.e. translation character set element or Unicode scalar value) is not representable in the ordinary literal character encoding. However that encoding may be e.g. UTF-8, so that a single Unicode scalar value in the source file may map to multiple char
in the encoded string, as has always been the case.
what characters are really supported by this "translation character set"?
As you already quoted (I’ll quote from latest C++ standard draft):
[lex.charset]
The translation character set consists of the following elements:
- each abstract character assigned a code point in the Unicode codespace, and
- a distinct character for each Unicode scalar value not assigned to an abstract character.
Let’s look up definitions for the terms used in the rule (quote from Unicode 14):
For the first point:
Characters and Encoding
Abstract character: A unit of information used for the organization,
control, or representation of textual data.
- When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or
visual). Examples of such symbolic data include letters, ideographs,
digits, punctuation, technical symbols, and dingbats.- An abstract character has no concrete form and should not be confused with a glyph.
- An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme.
- The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
- Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences
For the second point:
Unicode Encoding Forms
Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
- As a result of this definition, the set of Unicode scalar values consists of the
ranges 0 to D7FF 16 and E000 16 to 10FFFF 16, inclusive.
The C++ standard also has a clarifying note:
[Note 1: Unicode code points are integers in the range [0, 10FFFF]
(hexadecimal). A surrogate code point is a value in the range [D800,
DFFF] (hexadecimal). A Unicode scalar value is any code point that is
not a surrogate code point. — end note]
Is it all of Unicode?
TLDR: No. For example. surrogate code points, and combining character sequences are not in the translation character set.
Furthermore, this is important rule from C++:
A character-literal with a c-char-sequence consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name is the code unit value of the specified character as encoded in the literal’s associated character encoding.
If the specified character lacks representation in the literal’s associated character encoding or if it cannot be encoded as a single code unit, then the program is ill-formed.
If your system has an 8 bit char
, then it will not be able to represent all 10FFFF code points of the Unicode codespace.
P.S. Unicode in char
literals has never been disallowed by the C++ standard; This change is just making Unicode support mandatory.
4
-
Isn't lex.charset the set of characters you can write C++ code in, not the set of characters the C++ libraries handle?
– Yakk – Adam Nevraumont13 hours ago
-
@Yakk-AdamNevraumont Yes. This question is about character literals as far as I can tell. Libraries can handle any character set they want to.
– eerorika13 hours ago
-
@user3840170 Ah fair enough. I was only thinking of Unicode as execution charset.
– eerorika13 hours ago
-
2
"If your system has an 8 bit char, then it will not be able to represent all 10FFFF code points of the Unicode codespace" A single 8-bit
char
literal obviously can't do it today, and C++23 is not going to magically give it this ability. A string literal however is potentially able to represent all of Unicode, and C++23 is not going to take this ability away (the question doesn't mention string literals, but a good answer IMHO should).– n. m. could be an AI13 hours ago
A bunch of the weird phrasing in the standard is basically saying "We want C++ implementations to support Unicode, but we don't want to declare any existing code as nonstandard just because it or its platform is non-Unicode-aware."
9 hours ago
What do you mean with "supports Unicode"? If you think it more precisely, probably you can answer. In short: just keep data as black box string (e.g. as UTF-8). On input and output do the conversion on your black box predefined format (nobody can guess the expected encoding on input and output, so neither a C++ standard). And for handling, you need good Unicode libraries (do not think a single codepoint is a good unit to handle unicode strings).
1 hour ago