cpp.info: Initial processing

Go forward to Tokenization
Go up to Overview
Go to the top op cpp

Initial processing

   The preprocessor performs a series of textual transformations on its
input.  These happen before all other processing.  Conceptually, they
happen in a rigid order, and the entire file is run through each
transformation before the next one begins.  GNU CPP actually does them
all at once, for performance reasons.  These transformations correspond
roughly to the first three "phases of translation" described in the C
standard.
  1. The input file is read into memory and broken into lines.
     GNU CPP expects its input to be a text file, that is, an
     unstructured stream of ASCII characters, with some characters
     indicating the end of a line of text.  Extended ASCII character
     sets, such as ISO Latin-1 or Unicode encoded in UTF-8, are also
     acceptable.  Character sets that are not strict supersets of
     seven-bit ASCII will not work.  We plan to add complete support
     for international character sets in a future release.
     Different systems use different conventions to indicate the end of
     a line.  GCC accepts the ASCII control sequences `LF', `CR LF',
     `CR', and `LF CR' as end-of-line markers.  The first three are the
     canonical sequences used by Unix, DOS and VMS, and the classic Mac
     OS (before OSX) respectively.  You may therefore safely copy
     source code written on any of those systems to a different one and
     use it without conversion.  (GCC may lose track of the current
     line number if a file doesn't consistently use one convention, as
     sometimes happens when it is edited on computers with different
     conventions that share a network file system.)  `LF CR' is
     included because it has been reported as an end-of-line marker
     under exotic conditions.
     If the last line of any input file lacks an end-of-line marker,
     the end of the file is considered to implicitly supply one.  The C
     standard says that this condition provokes undefined behavior, so
     GCC will emit a warning message.
  2. If trigraphs are enabled, they are replaced by their corresponding
     single characters.
     These are nine three-character sequences, all starting with `??',
     that are defined by ISO C to stand for single characters.  They
     permit obsolete systems that lack some of C's punctuation to use
     C.  For example, `??/' stands for `\', so '??/n' is a character
     constant for a newline.  By default, GCC ignores trigraphs, but if
     you request a strictly conforming mode with the `-std' option, then
     it converts them.
     Trigraphs are not popular and many compilers implement them
     incorrectly.  Portable code should not rely on trigraphs being
     either converted or ignored.  If you use the `-Wall' or
     `-Wtrigraphs' options, GCC will warn you when a trigraph would
     change the meaning of your program if it were converted.
     In a string constant, you can prevent a sequence of question marks
     from being confused with a trigraph by inserting a backslash
     between the question marks.  "(??\?)" is the string `(???)', not
     `(?]'.  Traditional C compilers do not recognize this idiom.
     The nine trigraphs and their replacements are

Trigraph: ??( ??) ??< ??> ??= ??/ ??' ??! ??- Replacement: [ ] { } # \ ^ | ~

  3. Continued lines are merged into one long line.
     A continued line is a line which ends with a backslash, `\'.  The
     backslash is removed and the following line is joined with the
     current one.  No space is inserted, so you may split a line
     anywhere, even in the middle of a word.  (It is generally more
     readable to split lines only at white space.)
     The trailing backslash on a continued line is commonly referred to
     as a "backslash-newline".
     If there is white space between a backslash and the end of a line,
     that is still a continued line.  However, as this is usually the
     result of an editing mistake, and many compilers will not accept
     it as a continued line, GCC will warn you about it.
  4. All comments are replaced with single spaces.
     There are two kinds of comments.  "Block comments" begin with `/*'
     and continue until the next `*/'.  Block comments do not nest:
          /* this is /* one comment */ text outside comment
     "Line comments" begin with `//' and continue to the end of the
     current line.  Line comments do not nest either, but it does not
     matter, because they would end in the same place anyway.
          // this is // one comment
          text outside comment
   It is safe to put line comments inside block comments, or vice versa.
     /* block comment
        // contains line comment
        yet more comment
      */ outside comment
     // line comment /* contains block comment */
   But beware of commenting out one end of a block comment with a line
comment.
      // l.c.  /* block comment begins
         oops! this isn't a comment anymore */
   Comments are not recognized within string literals.  "/* blah */" is
the string constant `/* blah */', not an empty string.
   Line comments are not in the 1989 edition of the C standard, but they
are recognized by GCC as an extension.  In C++ and in the 1999 edition
of the C standard, they are an official part of the language.
   Since these transformations happen before all other processing, you
can split a line mechanically with backslash-newline anywhere.  You can
comment out the end of a line.  You can continue a line comment onto the
next line with backslash-newline.  You can even split `/*', `*/', and
`//' onto multiple lines with backslash-newline.  For example:

/\ * */ # /* */ defi\ ne FO\ O 10\ 20

is equivalent to `#define FOO 1020'.  All these tricks are extremely
confusing and should not be used in code intended to be readable.
   There is no way to prevent a backslash at the end of a line from
being interpreted as a backslash-newline.

"foo\\ bar"

is equivalent to `"foo\bar"', not to `"foo\\bar"'.  To avoid having to
worry about this, do not use the deprecated GNU extension which permits
multi-line strings.  Instead, use string literal concatenation:

"foo\\" "bar"

Your program will be more portable this way, too.