5 Lexical conventions [lex]

r-char:
any member of the translation character set, except a U+0029 right parenthesis followed by
the initial d-char-sequence (which may be empty) followed by a U+0022 quotation mark

d-char-sequence:
d-char d-char-sequence

_{o p t}

d-char:
any member of the basic character set except:
U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,
U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line

The kind of a string-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 12 where n is the number of encoded code units that would result from an evaluation of the string-literal (see below).

Table 12 — String literals [tab:lex.string.literal]

🔗 Enco-	Kind	Type	Associated	Examples
🔗 ding			character
🔗 prefix			encoding
🔗 none	ordinary string literal	array of n const char	ordinary literal encoding	"ordinary string" R"(ordinary raw string)"
🔗 L	wide string literal	array of n const wchar_t	wide literal encoding	L"wide string" LR"w(wide raw string)w"
🔗 u8	UTF-8 string literal	array of n const char8_t	UTF-8	u8"UTF-8 string" u8R"x(UTF-8 raw string)x"
🔗 u	UTF-16 string literal	array of n const char16_t	UTF-16	u"UTF-16 string" uR"y(UTF-16 raw string)y"
🔗 U	UTF-32 string literal	array of n const char32_t	UTF-32	U"UTF-32 string" UR"z(UTF-32 raw string)z"

A string-literal that has an R in the prefix is a raw string literal.

The d-char-sequence serves as a delimiter.

The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.

A d-char-sequence shall consist of at most 16 characters.

[Note 1:

The characters '(' and ')' can appear in a raw-string.

Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

— end note]

[Note 2:

A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.

Assuming no whitespace at the beginning of lines in the following example, the assert will succeed: const char* p = R"(a\ b c)"; assert(std::strcmp(p, "a\\\nb\nc") == 0);

— end note]

[Example 1:

The raw string R"a( )\ a" )a" is equivalent to "\n)\\\na\"\n".

The raw string R"(x = "\"y\"")" is equivalent to "x = \"\\\"y\\\"\"".

— end example]

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.

The string-literals in any sequence of adjacent string-literals shall have at most one unique encoding-prefix among them.

The common encoding-prefix of the sequence is that encoding-prefix, if any.

[Note 3:

A string-literal's rawness has no effect on the determination of the common encoding-prefix.

— end note]

In translation phase 5 ([lex.phases]), adjacent string-literals are concatenated.

The lexical structure and grouping of the contents of the individual string-literals is retained.

[Example 2:

"\xA" "B" represents the code unit '\xA' and the character 'B' after concatenation (and not the single code unit '\xAB').

Similarly, R"(\u00)" "41" represents six characters, starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a universal-character-name).

Table 13 has some examples of valid concatenations.

— end example]

Table 13 — String literal concatenations [tab:lex.string.concat]

🔗 Source		Means	Source		Means	Source		Means
🔗 u"a"	u"b"	u"ab"	U"a"	U"b"	U"ab"	L"a"	L"b"	L"ab"
🔗 u"a"	"b"	u"ab"	U"a"	"b"	U"ab"	L"a"	"b"	L"ab"
🔗 "a"	u"b"	u"ab"	"a"	U"b"	U"ab"	"a"	L"b"	L"ab"

Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]).

[Note 4:

String literal objects are potentially non-unique ([intro.object]).

Whether successive evaluations of a string-literal yield the same or a different object is unspecified.

— end note]

[Note 5:

The effect of attempting to modify a string literal object is undefined.

— end note]

String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 null character, in order as follows:

(10.1)
The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding.

If a character lacks representation in the associated character encoding, then the program is ill-formed.

[Note 6:
No character lacks representation in any Unicode encoding form.
— end note]

When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence.

[Note 7:
The encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.
— end note]
(10.2)
Each numeric-escape-sequence ([lex.ccon]) contributes a single code unit with a value as follows:
- (10.2.1)
  Let v be the integer value represented by the octal number comprising the sequence of octal-digits in an octal-escape-sequence or by the hexadecimal number comprising the sequence of hexadecimal-digits in a hexadecimal-escape-sequence.
- (10.2.2)
  If v does not exceed the range of representable values of the string-literal's array element type, then the value is v.
- (10.2.3)
  Otherwise, if the string-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the string-literal's array element type, then the value is the unique value of the string-literal's array element type T that is congruent to v modulo $2^{N}$ , where N is the width of T.
- (10.2.4)
  Otherwise, the program is ill-formed.
When encoding a stateful character encoding, these sequences should have no effect on encoding state.
(10.3)
Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence.

When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.