5 Lexical conventions [lex]

5.3 Characters [lex.char]

5.3.1 Character sets [lex.charset]

The translation character set consists of the following elements:

(1.1)
each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and
(1.2)
a distinct character for each Unicode scalar value not assigned to an abstract character.

[Note 1:

Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).

A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).

A Unicode scalar value is any code point that is not a surrogate code point.

— end note]

The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1 .

In this document, glyphs are often used to identify elements of the basic character set.

[Note 2:

Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

— end note]

Table 1 — Basic character set [tab:lex.charset.basic]

🔗 character		glyph
🔗 U+0009	character tabulation
🔗 U+000b	line tabulation
🔗 U+000c	form feed
🔗 U+0020	space
🔗 U+000a	line feed	new-line
🔗 U+0021	exclamation mark	!
🔗 U+0022	quotation mark	"
🔗 U+0023	number sign	#
🔗 U+0024	dollar sign	$
🔗 U+0025	percent sign	%
🔗 U+0026	ampersand	&
🔗 U+0027	apostrophe	'
🔗 U+0028	left parenthesis	(
🔗 U+0029	right parenthesis	)
🔗 U+002a	asterisk	*
🔗 U+002b	plus sign	+
🔗 U+002c	comma	,
🔗 U+002d	hyphen-minus	-
🔗 U+002e	full stop	.
🔗 U+002f	solidus	/
🔗 U+0030 .. U+0039	digit zero .. nine	0 1 2 3 4 5 6 7 8 9
🔗 U+003a	colon	:
🔗 U+003b	semicolon	;
🔗 U+003c	less-than sign	<
🔗 U+003d	equals sign	=
🔗 U+003e	greater-than sign	>
🔗 U+003f	question mark	?
🔗 U+0040	commercial at	@
🔗 U+0041 .. U+005a	latin capital letter a .. z	A B C D E F G H I J K L M
🔗		N O P Q R S T U V W X Y Z
🔗 U+005b	left square bracket	[
🔗 U+005c	reverse solidus	\
🔗 U+005d	right square bracket	]
🔗 U+005e	circumflex accent	^
🔗 U+005f	low line	_
🔗 U+0060	grave accent	`
🔗 U+0061 .. U+007a	latin small letter a .. z	a b c d e f g h i j k l m
🔗		n o p q r s t u v w x y z
🔗 U+007b	left curly bracket	{
🔗 U+007c	vertical line	\|
🔗 U+007d	right curly bracket	}
🔗 U+007e	tilde	~

The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2 .

Table 2 — Additional control characters in the basic literal character set [tab:lex.charset.literal]

🔗 character
🔗 U+0000	null
🔗 U+0007	alert
🔗 U+0008	backspace
🔗 U+000d	carriage return

A code unit is an integer value of character type ([basic.fundamental]).

Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); this is termed the respective literal encoding.

The ordinary literal encoding is the encoding applied to an ordinary character or string literal.

The wide literal encoding is the encoding applied to a wide character or string literal.

A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

[Note 3:

A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.

— end note]

The U+0000 null character is encoded as the value 0.

No other element of the translation character set is encoded with a code unit of value 0.

The code unit value of each decimal digit character after the digit 0 (U+0030) is one greater than the value of the previous.

The ordinary and wide literal encodings are otherwise implementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.