1. | An implementation shall support input files
that are a sequence of UTF-8 code units (UTF-8 files). It may also support
an implementation-defined set of other kinds of input files, and,
if so, the kind of an input file is determined in
an implementation-defined manner
that includes a means of designating input files as UTF-8 files,
independent of their content.
If an input file is determined to be a UTF-8 file,
then it shall be a well-formed UTF-8 code unit sequence and
it is decoded to produce a sequence of UCS scalar values
that constitutes the sequence of elements of the translation character set. In the resulting sequence,
each pair of characters in the input sequence consisting of
U+000d carriage return followed by U+000a line feed,
as well as each
U+000d carriage return not immediately followed by a U+000a line feed,
is replaced by a single new-line character. For any other kind of input file supported by the implementation,
characters are mapped, in an
implementation-defined manner,
to a sequence of translation character set elements ([lex.charset]),
representing end-of-line indicators as new-line characters. |
2. | Each sequence of a backslash character (\)
immediately followed by
zero or more whitespace characters other than new-line followed by
a new-line character is deleted, splicing
physical source lines to form logical source lines. Only the last
backslash on any physical source line shall be eligible for being part
of such a splice. Except for splices reverted in a raw string literal, if a splice results in
a character sequence that matches the
syntax of a universal-character-name, the behavior is
undefined. A source file that is not empty and that does not end in a new-line
character, or that ends in a splice,
shall be processed as if an additional new-line character were appended
to the file. |
3. | The source file is decomposed into preprocessing
tokens ([lex.pptoken]) and sequences of whitespace characters
(including comments).
Each comment is replaced by one space character. New-line characters are
retained. Whether each nonempty sequence of whitespace characters other
than new-line is retained or replaced by one space character is
unspecified. As characters from the source file are consumed
to form the next preprocessing token
(i.e., not being consumed as part of a comment or other forms of whitespace),
except when matching a
c-char-sequence,
s-char-sequence,
r-char-sequence,
h-char-sequence, or
q-char-sequence,
universal-character-names are recognized and
replaced by the designated element of the translation character set. The process of dividing a source file's
characters into preprocessing tokens is context-dependent. |
4. | Preprocessing directives are executed, macro invocations are
expanded, and _Pragma unary operator expressions are executed. A #include preprocessing directive causes the named header or
source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted. |
5. | For a sequence of two or more adjacent string-literal tokens,
a common encoding-prefix is determined
as specified in [lex.string]. |
6. | |
7. | Whitespace characters separating tokens are no longer
significant. Each preprocessing token is converted into a
token ([lex.token]). The resulting tokens are syntactically and
semantically analyzed and translated as a translation unit. [Note 2: The process of analyzing and translating the tokens can occasionally
result in one token being replaced by a sequence of other
tokens ([temp.names]). — end note]
It is
implementation-defined
whether the sources for
module units and header units
on which the current translation unit has an interface
dependency ([module.unit], [module.import])
are required to be available. [Note 3: Source files, translation
units and translated translation units need not necessarily be stored as
files, nor need there be any one-to-one correspondence between these
entities and any external representation. The description is conceptual
only, and does not specify any particular implementation. — end note] |
8. | Translated translation units and instantiation units are combined
as follows:
Each translated translation unit is examined to
produce a list of required instantiations. [Note 5: This can include
instantiations which have been explicitly
requested ([temp.explicit]). — end note]
The definitions of the
required templates are located. It is implementation-defined whether the
source of the translation units containing these definitions is required
to be available.
All the required instantiations
are performed to produce
instantiation units.
The
program is ill-formed if any instantiation fails. |
9. |
character | glyph | ||
U+0009 | character tabulation | ||
U+000b | line tabulation | ||
U+000c | form feed | ||
U+0020 | space | ||
U+000a | line feed | new-line | |
U+0021 | exclamation mark | ! | |
U+0022 | quotation mark | " | |
U+0023 | number sign | # | |
U+0025 | percent sign | % | |
U+0026 | ampersand | & | |
U+0027 | apostrophe | ' | |
U+0028 | left parenthesis | ( | |
U+0029 | right parenthesis | ) | |
U+002a | asterisk | * | |
U+002b | plus sign | + | |
U+002c | comma | , | |
U+002d | hyphen-minus | - | |
U+002e | full stop | . | |
U+002f | solidus | / | |
U+0030 .. U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 | |
U+003a | colon | : | |
U+003b | semicolon | ; | |
U+003c | less-than sign | < | |
U+003d | equals sign | = | |
U+003e | greater-than sign | > | |
U+003f | question mark | ? | |
U+0041 .. U+005a | latin capital letter a .. z | A B C D E F G H I J K L M | |
N O P Q R S T U V W X Y Z | |||
U+005b | left square bracket | [ | |
U+005c | reverse solidus | \ | |
U+005d | right square bracket | ] | |
U+005e | circumflex accent | ^ | |
U+005f | low line | _ | |
U+0061 .. U+007a | latin small letter a .. z | a b c d e f g h i j k l m | |
n o p q r s t u v w x y z | |||
U+007b | left curly bracket | { | |
U+007c | vertical line | | | |
U+007d | right curly bracket | } | |
U+007e | tilde | ~ |
U+0000 null | U+007f delete | |
U+0001 start of heading | U+0082 break permitted here | |
U+0002 start of text | U+0083 no break here | |
U+0003 end of text | U+0084 index | |
U+0004 end of transmission | U+0085 next line | |
U+0005 enquiry | U+0086 start of selected area | |
U+0006 acknowledge | U+0087 end of selected area | |
U+0007 alert | U+0088 character tabulation set | |
U+0008 backspace | U+0088 horizontal tabulation set | |
U+0009 character tabulation | U+0089 character tabulation with justification | |
U+0009 horizontal tabulation | U+0089 horizontal tabulation with justification | |
U+000a line feed | U+008a line tabulation set | |
U+000a new line | U+008a vertical tabulation set | |
U+000a end of line | U+008b partial line forward | |
U+000b line tabulation | U+008b partial line down | |
U+000b vertical tabulation | U+008c partial line backward | |
U+000c form feed | U+008c partial line up | |
U+000d carriage return | U+008d reverse line feed | |
U+000e shift out | U+008d reverse index | |
U+000e locking-shift one | U+008e single shift two | |
U+000f shift in | U+008e single-shift-2 | |
U+000f locking-shift zero | U+008f single shift three | |
U+0010 data link escape | U+008f single-shift-3 | |
U+0011 device control one | U+0090 device control string | |
U+0012 device control two | U+0091 private use one | |
U+0013 device control three | U+0091 private use-1 | |
U+0014 device control four | U+0092 private use two | |
U+0015 negative acknowledge | U+0092 private use-2 | |
U+0016 synchronous idle | U+0093 set transmit state | |
U+0017 end of transmission block | U+0094 cancel character | |
U+0018 cancel | U+0095 message waiting | |
U+0019 end of medium | U+0096 start of guarded area | |
U+001a substitute | U+0096 start of protected area | |
U+001b escape | U+0097 end of guarded area | |
U+001c information separator four | U+0097 end of protected area | |
U+001c file separator | U+0098 start of string | |
U+001d information separator three | U+009a single character introducer | |
U+001d group separator | U+009b control sequence introducer | |
U+001e information separator two | U+009c string terminator | |
U+001e record separator | U+009d operating system command | |
U+001f information separator one | U+009e privacy message | |
U+001f unit separator | U+009f application program command |
alignas | constinit | false | public | true | |
alignof | const_cast | float | register | try | |
asm | continue | for | reinterpret_cast | typedef | |
auto | co_await | friend | requires | typeid | |
bool | co_return | goto | return | typename | |
break | co_yield | if | short | union | |
case | decltype | inline | signed | unsigned | |
catch | default | int | sizeof | using | |
char | delete | long | static | virtual | |
char8_t | do | mutable | static_assert | void | |
char16_t | double | namespace | static_cast | volatile | |
char32_t | dynamic_cast | new | struct | wchar_t | |
class | else | noexcept | switch | while | |
concept | enum | nullptr | template | ||
const | explicit | operator | this | ||
consteval | export | private | thread_local | ||
constexpr | extern | protected | throw |
and | and_eq | bitand | bitor | compl | not | |
not_eq | or | or_eq | xor | xor_eq |
integer-literal other than decimal-literal | |||
none | int | int | |
long int | unsigned int | ||
long long int | long int | ||
unsigned long int | |||
long long int | |||
unsigned long long int | |||
u or U | unsigned int | unsigned int | |
unsigned long int | unsigned long int | ||
unsigned long long int | unsigned long long int | ||
l or L | long int | long int | |
long long int | unsigned long int | ||
long long int | |||
unsigned long long int | |||
Both u or U | unsigned long int | unsigned long int | |
and l or L | unsigned long long int | unsigned long long int | |
ll or LL | long long int | long long int | |
unsigned long long int | |||
Both u or U | unsigned long long int | unsigned long long int | |
and ll or LL | |||
z or Z | the signed integer type corresponding | the signed integer type | |
to std::size_t ([support.types.layout]) | corresponding to std::size_t | ||
std::size_t | |||
Both u or U | std::size_t | std::size_t | |
and z or Z |
Encoding | Kind | Type | Associated char- | Example | |
prefix | acter encoding | ||||
none | char | ordinary | 'v' | ||
non-encodable ordinary character literal | int | literal | '\U0001F525' | ||
ordinary multicharacter literal | int | encoding | 'abcd' | ||
L | wchar_t | wide literal | L'w' | ||
encoding | |||||
u8 | char8_t | UTF-8 | u8'x' | ||
u | char16_t | UTF-16 | u'y' | ||
U | char32_t | UTF-32 | U'z' |
character | |||
U+000a | line feed | \n | |
U+0009 | character tabulation | \t | |
U+000b | line tabulation | \v | |
U+0008 | backspace | \b | |
U+000d | carriage return | \r | |
U+000c | form feed | \f | |
U+0007 | alert | \a | |
U+005c | reverse solidus | \\ | |
U+003f | question mark | \? | |
U+0027 | apostrophe | \' | |
U+0022 | quotation mark | \" |
type | ||
none | double | |
f or F | float | |
l or L | long double | |
f16 or F16 | std::float16_t | |
f32 or F32 | std::float32_t | |
f64 or F64 | std::float64_t | |
f128 or F128 | std::float128_t | |
bf16 or BF16 | std::bfloat16_t |
Encoding | Kind | Type | Associated | Examples | |
prefix | character | ||||
encoding | |||||
none | array of n const char | ordinary literal encoding | "ordinary string" R"(ordinary raw string)" | ||
L | array of n const wchar_t | wide literal encoding | L"wide string" LR"w(wide raw string)w" | ||
u8 | array of n const char8_t | UTF-8 | u8"UTF-8 string" u8R"x(UTF-8 raw string)x" | ||
u | array of n const char16_t | UTF-16 | u"UTF-16 string" uR"y(UTF-16 raw string)y" | ||
U | array of n const char32_t | UTF-32 | U"UTF-32 string" UR"z(UTF-32 raw string)z" |