5 Lexical conventions [lex]

5.3 Character sets [lex.charset]

The translation character set consists of the following elements:
  • each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
  • a distinct character for each UCS scalar value where no named character is assigned.
[Note 1:
ISO/IEC 10646 code points are integers in the range [0, 10FFFF] (hexadecimal).
A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).
A UCS scalar value is any code point that is not a surrogate code point.
β€” end note]
The basic character set is a subset of the translation character set, consisting of 96 characters as specified in Table 1.
[Note 2:
Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.
β€” end note]
Table 1: Basic character set [tab:lex.charset.basic]
character
glyph
U+0009
character tabulation
U+000b
line tabulation
U+000c
form feed
U+0020
space
U+000a
line feed
new-line
U+0021
exclamation mark
!
U+0022
quotation mark
"
U+0023
number sign
#
U+0025
percent sign
%
U+0026
ampersand
&
U+0027
apostrophe
'
U+0028
left parenthesis
(
U+0029
right parenthesis
)
U+002a
asterisk
*
U+002b
plus sign
+
U+002c
comma
,
U+002d
hyphen-minus
-
U+002e
full stop
.
U+002f
solidus
/
U+0030 ..
U+0039
digit zero .. nine
0 1 2 3 4 5 6 7 8 9
U+003a
colon
:
U+003b
semicolon
;
U+003c
less-than sign
<
U+003d
equals sign
=
U+003e
greater-than sign
>
U+003f
question mark
?
U+0041 ..
U+005a
latin capital letter a .. z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
U+005b
left square bracket
[
U+005c
reverse solidus
\
U+005d
right square bracket
]
U+005e
circumflex accent
^
U+005f
low line
U+0061 ..
U+007a
latin small letter a .. z
a b c d e f g h i j k l m
n o p q r s t u v w x y z
U+007b
left curly bracket
{
U+007c
vertical line
|
U+007d
right curly bracket
}
U+007e
tilde
~
The universal-character-name construct provides a way to name other characters.
n-char: one of
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
U+002d hyphen-minus
U+0020 space
A universal-character-name of the form \u hex-quad, \U hex-quad hex-quad, or \u{simple-hexadecimal-digit-sequence} designates the character in the translation character set whose UCS scalar value is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name.
The program is ill-formed if that number is not a UCS scalar value.
A universal-character-name that is a named-universal-character designates the character named by its n-char-sequence.
A character is so named if the n-char-sequence is equal to
  • the associated character name or associated character name alias specified in ISO/IEC 10646 subclause β€œCode charts and lists of character names” or
  • the control code alias given in Table 2.
    [Note 3:
    The aliases in Table 2 are provided for control characters which otherwise have no associated character name or character name alias.
    These names are derived from the Unicode Character Database's NameAliases.txt.
    For historical reasons, control characters are formally unnamed.
    β€” end note]
[Note 4:
None of the associated character names, associated character name aliases, or control code aliases have leading or trailing spaces.
β€” end note]
Table 2: Control code aliases [tab:lex.charset.ucn]
U+0000 null
U+007f delete
U+0001 start of heading
U+0082 break permitted here
U+0002 start of text
U+0083 no break here
U+0003 end of text
U+0084 index
U+0004 end of transmission
U+0085 next line
U+0005 enquiry
U+0086 start of selected area
U+0006 acknowledge
U+0087 end of selected area
U+0007 alert
U+0088 character tabulation set
U+0008 backspace
U+0088 horizontal tabulation set
U+0009 character tabulation
U+0089 character tabulation with justification
U+0009 horizontal tabulation
U+0089 horizontal tabulation with justification
U+000a line feed
U+008a line tabulation set
U+000a new line
U+008a vertical tabulation set
U+000a end of line
U+008b partial line forward
U+000b line tabulation
U+008b partial line down
U+000b vertical tabulation
U+008c partial line backward
U+000c form feed
U+008c partial line up
U+000d carriage return
U+008d reverse line feed
U+000e shift out
U+008d reverse index
U+000e locking-shift one
U+008e single shift two
U+000f shift in
U+008e single-shift-2
U+000f locking-shift zero
U+008f single shift three
U+0010 data link escape
U+008f single-shift-3
U+0011 device control one
U+0090 device control string
U+0012 device control two
U+0091 private use one
U+0013 device control three
U+0091 private use-1
U+0014 device control four
U+0092 private use two
U+0015 negative acknowledge
U+0092 private use-2
U+0016 synchronous idle
U+0093 set transmit state
U+0017 end of transmission block
U+0094 cancel character
U+0018 cancel
U+0095 message waiting
U+0019 end of medium
U+0096 start of guarded area
U+001a substitute
U+0096 start of protected area
U+001b escape
U+0097 end of guarded area
U+001c information separator four
U+0097 end of protected area
U+001c file separator
U+0098 start of string
U+001d information separator three
U+009a single character introducer
U+001d group separator
U+009b control sequence introducer
U+001e information separator two
U+009c string terminator
U+001e record separator
U+009d operating system command
U+001f information separator one
U+009e privacy message
U+001f unit separator
U+009f application program command
If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds to a control character or to a character in the basic character set, the program is ill-formed.
[Note 5:
A sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name.
β€” end note]
The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 3.
[Note 6:
The alias bell for U+0007 shown in ISO 10646 is ambiguous with U+1f514 bell.
β€” end note]
Table 3: Additional control characters in the basic literal character set [tab:lex.charset.literal]
character
U+0000
null
U+0007
alert
U+0008
backspace
U+000d
carriage return
A code unit is an integer value of character type ([basic.fundamental]).
Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); this is termed the respective literal encoding.
The ordinary literal encoding is the encoding applied to an ordinary character or string literal.
The wide literal encoding is the encoding applied to a wide character or string literal.
A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.
[Note 7:
A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.
β€” end note]
The U+0000 null character is encoded as the value 0.
No other element of the translation character set is encoded with a code unit of value 0.
The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous.
The ordinary and wide literal encodings are otherwise implementation-defined.
For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.