| 1. | An implementation shall support input files
that are a sequence of UTF-8 code units (UTF-8 files). It may also support
an implementation-defined set of other kinds of input files, and,
if so, the kind of an input file is determined in
an implementation-defined manner
that includes a means of designating input files as UTF-8 files,
independent of their content.
If an input file is determined to be a UTF-8 file,
then it shall be a well-formed UTF-8 code unit sequence and
it is decoded to produce a sequence of Unicode8
scalar values. A sequence of translation character set elements ([lex.charset]) is then formed
by mapping each Unicode scalar value
to the corresponding translation character set element. In the resulting sequence,
each pair of characters in the input sequence consisting of
U+000d carriage return followed by U+000a line feed,
as well as each
U+000d carriage return not immediately followed by a U+000a line feed,
is replaced by a single new-line character. For any other kind of input file supported by the implementation,
characters are mapped, in an
implementation-defined manner,
to a sequence of translation character set elements,
representing end-of-line indicators as new-line characters. |
| 2. | Each sequence comprising a backslash character (\)
immediately followed by
zero or more whitespace characters other than new-line followed by
a new-line character is deleted, splicing
physical source lines to form logical source lines. Only the last
backslash on any physical source line is eligible for being part
of such a splice. [Note 2: — end note]
A source file that is not empty and that (after splicing)
does not end in a new-line character
is processed as if an additional new-line character were appended
to the file. |
| 3. | The source file is decomposed into preprocessing
tokens ([lex.pptoken]) and sequences of whitespace characters
(including comments). New-line characters are
retained. Whether each nonempty sequence of whitespace characters other
than new-line is retained or replaced by one U+0020 space character is
unspecified. As characters from the source file are consumed
to form the next preprocessing token
(i.e., not being consumed as part of a comment or other forms of whitespace),
except when matching a
c-char-sequence,
s-char-sequence,
r-char-sequence,
h-char-sequence, or
q-char-sequence,
universal-character-names are recognized ([lex.universal.char]) and
replaced by the designated element of the translation character set ([lex.charset]). The process of dividing a source file's
characters into preprocessing tokens is context-dependent. [Example 1: — end example] |
| 4. | Preprocessing directives ([cpp]) are executed, macro invocations are
expanded ([cpp.replace]), and _Pragma unary operator expressions are executed ([cpp.pragma.op]). A #include preprocessing directive ([cpp.include]) causes the named header or
source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted. Whitespace characters separating preprocessing tokens are no longer significant. |
| 5. | For a sequence of two or more adjacent string-literal preprocessing tokens,
a common encoding-prefix is determined
as specified in [lex.string]. Each such string-literal preprocessing token is then considered to have
that common encoding-prefix. |
| 6. | Each preprocessing token is converted into a token ([lex.token]). |
| 7. | The tokens constitute a translation unit and
are syntactically and
semantically analyzed as a translation-unit ([basic.link]) and
translated. [Note 3: The process of analyzing and translating the tokens can occasionally
result in one token being replaced by a sequence of other
tokens ([temp.names]). — end note]
It is
implementation-defined
whether the sources for
module units and header units
on which the current translation unit has an interface
dependency ([module.unit], [module.import])
are required to be available. [Note 4: Source files, translation
units and translated translation units need not necessarily be stored as
files, nor need there be any one-to-one correspondence between these
entities and any external representation. The description is conceptual
only, and does not specify any particular implementation. — end note][Note 5: Previously translated translation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example)
calls to functions whose names have external or module linkage,
manipulation of variables whose names have external or module linkage, or
manipulation of data files. — end note]While the tokens constituting translation units
are being analyzed and translated,
required instantiations are performed. [Note 6: This can include
instantiations which have been explicitly
requested ([temp.explicit]). — end note]The contexts from which instantiations may be performed
are determined by their respective points of instantiation ([temp.point]). [Note 7: Other requirements in this document can further constrain
the context from which an instantiation can be performed. For example, a constexpr function template specialization
might have a point of instantiation at the end of a translation unit,
but its use in certain constant expressions could require
that it be instantiated at an earlier point ([temp.inst]). — end note]Each instantiation results in new program constructs. The program is ill-formed if any instantiation fails. During the analysis and translation of tokens,
certain expressions are evaluated ([expr.const]). Constructs appearing at a program point P are analyzed
in a context where each side effect of evaluating an expression E
as a full-expression is complete if and only if
[Example 2: class S {
class Incomplete;
class Inner {
void fn() {
/* */ Incomplete i; // OK
}
}; /* */
consteval {
define_aggregate(^^Incomplete, {});
}
}; /* */
— end example] |
| 8. | Translated translation units are combined, and
all external entity references are resolved ([basic.link]). Library
components are linked to satisfy external references to
entities not defined in the current translation. All such translator
output is collected into a program image which contains information
needed for execution in its execution environment. |