5 Lexical conventions [lex]

5.13 Literals [lex.literal]

5.13.5 String literals [lex.string]

basic-s-char:
any member of the translation character set except the U+0022 quotation mark,
   U+005c reverse solidus, or new-line character
r-char:
any member of the translation character set, except a U+0029 right parenthesis followed by
   the initial d-char-sequence (which may be empty) followed by a U+0022 quotation mark
d-char-sequence:
d-chard-char-sequence
d-char:
any member of the basic character set except:
   U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,
   U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line
The kind of a string-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 12 where n is the number of encoded code units as described below.
Table 12 — String literals [tab:lex.string.literal]
Enco-
Kind
Type
Associated
Examples
ding
character
prefix
encoding
none
array of n
const char
ordinary literal encoding
"ordinary string"
R"(ordinary raw string)"
L
array of n
const wchar_t
wide literal
encoding
L"wide string"
LR"w(wide raw string)w"
u8
array of n
const char8_t
UTF-8
u8"UTF-8 string"
u8R"x(UTF-8 raw string)x"
u
array of n
const char16_t
UTF-16
u"UTF-16 string"
uR"y(UTF-16 raw string)y"
U
array of n
const char32_t
UTF-32
U"UTF-32 string"
UR"z(UTF-32 raw string)z"
A string-literal that has an R in the prefix is a raw string literal.
The d-char-sequence serves as a delimiter.
The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.
A d-char-sequence shall consist of at most 16 characters.
[Note 1: 
The characters '(' and ')' can appear in a raw-string.
Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
— end note]
[Note 2: 
A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.
Assuming no whitespace at the beginning of lines in the following example, the assert will succeed: const char* p = R"(a\ b c)"; assert(std::strcmp(p, "a\\\nb\nc") == 0);
— end note]
[Example 1: 
The raw string R"a( )\ a" )a" is equivalent to "\n)\\\na\"\n".
The raw string R"(x = "\"y\"")" is equivalent to "x = \"\\\"y\\\"\"".
— end example]
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
The string-literals in any sequence of adjacent string-literals shall have at most one unique encoding-prefix among them.
The common encoding-prefix of the sequence is that encoding-prefix, if any.
[Note 3: 
A string-literal's rawness has no effect on the determination of the common encoding-prefix.
— end note]
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.
The lexical structure and grouping of the contents of the individual string-literals is retained.
[Example 2: 
"\xA" "B" represents the code unit '\xA' and the character 'B' after concatenation (and not the single code unit '\xAB').
Similarly, R"(\u00)" "41" represents six characters, starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a universal-character-name).
Table 13 has some examples of valid concatenations.
— end example]
Table 13 — String literal concatenations [tab:lex.string.concat]
Source
Means
Source
Means
Source
Means
u"a"
u"b"
u"ab"
U"a"
U"b"
U"ab"
L"a"
L"b"
L"ab"
u"a"
"b"
u"ab"
U"a"
"b"
U"ab"
L"a"
"b"
L"ab"
"a"
u"b"
u"ab"
"a"
U"b"
U"ab"
"a"
L"b"
L"ab"
Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]).
[Note 4: 
String literal objects are potentially non-unique ([intro.object]).
Whether successive evaluations of a string-literal yield the same or a different object is unspecified.
— end note]
[Note 5: 
The effect of attempting to modify a string literal object is undefined.
— end note]
String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 null character, in order as follows:
  • The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding.
    If a character lacks representation in the associated character encoding, then the program is ill-formed.
    [Note 6: 
    No character lacks representation in any Unicode encoding form.
    — end note]
    When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence.
    [Note 7: 
    The encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.
    — end note]
  • Each numeric-escape-sequence ([lex.ccon]) contributes a single code unit with a value as follows:
    When encoding a stateful character encoding, these sequences should have no effect on encoding state.
  • Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence.
    When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.