8sa1-gcc/gcc/cppinternals.texi

\input texinfo
@setfilename cppinternals.info
@settitle The GNU C Preprocessor Internals

@ifinfo
@dircategory Programming
@direntry
* Cpplib: (cppinternals).      Cpplib internals.
@end direntry
@end ifinfo

@c @smallbook
@c @cropmarks
@c @finalout
@setchapternewpage odd
@ifinfo
This file documents the internals of the GNU C Preprocessor.

Copyright 2000, 2001 Free Software Foundation, Inc.

Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.

@ignore
Permission is granted to process this file through Tex and print the
results, provided the printed document carries copying permission
notice identical to this one except for the removal of this paragraph
(this paragraph not being relevant to the printed manual).

@end ignore
Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions.
@end ifinfo

@titlepage
@c @finalout
@title Cpplib Internals
@subtitle Last revised Jan 2001
@subtitle for GCC version 3.0
@author Neil Booth
@page
@vskip 0pt plus 1filll
@c man begin COPYRIGHT
Copyright @copyright{} 2000, 2001
Free Software Foundation, Inc.

Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.

Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions.
@c man end
@end titlepage
@page

@node Top, Conventions,, (DIR)
@chapter Cpplib - the core of the GNU C Preprocessor

The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
now implemented as a library, cpplib, so it can be easily shared between
a stand-alone preprocessor, and a preprocessor integrated with the C,
C++ and Objective C front ends.  It is also available for use by other
programs, though this is not recommended as its exposed interface has
not yet reached a point of reasonable stability.

This library has been written to be re-entrant, so that it can be used
to preprocess many files simultaneously if necessary.  It has also been
written with the preprocessing token as the fundamental unit; the
preprocessor in previous versions of GCC would operate on text strings
as the fundamental unit.

This brief manual documents some of the internals of cpplib, and a few
tricky issues encountered.  It also describes certain behaviour we would
like to preserve, such as the format and spacing of its output.

Identifiers, macro expansion, hash nodes, lexing.

@menu
* Conventions::	    Conventions used in the code.
* Lexer::	    The combined C, C++ and Objective C Lexer.
* Whitespace::      Input and output newlines and whitespace.
* Concept Index::   Index of concepts and terms.
* Index::           Index.
@end menu

@node Conventions, Lexer, Top, Top

cpplib has two interfaces - one is exposed internally only, and the
other is for both internal and external use.

The convention is that functions and types that are exposed to multiple
files internally are prefixed with @samp{_cpp_}, and are to be found in
the file @samp{cpphash.h}.  Functions and types exposed to external
clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.

We are striving to reduce the information exposed in cpplib.h to the
bare minimum necessary, and then to keep it there.  This makes clear
exactly what external clients are entitled to assume, and allows us to
change internals in the future without worrying whether library clients
are perhaps relying on some kind of undocumented implementation-specific
behaviour.

@node Lexer, Whitespace, Conventions, Top

The lexer is contained in the file @samp{cpplex.c}.  We want to have a
lexer that is single-pass, for efficiency reasons.  We would also like
the lexer to only step forwards through the input files, and not step
back.  This will make future changes to support different character
sets, in particular state or shift-dependent ones, much easier.

This file also contains all information needed to spell a token, i.e. to
output it either in a diagnostic or to a preprocessed output file.  This
information is not exported, but made available to clients through such
functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.

The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines.  Trigraphs are processed before
any interpretation of the meaning of a character is made, and unfortunately
there is a trigraph representation for a backslash, so it is possible for
the trigraph @samp{??/} to introduce an escaped newline.

Escaped newlines are tedious because theoretically they can occur
anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
within the characters of an identifier, and even between the @samp{*}
and @samp{/} that terminates a comment.  Moreover, you cannot be sure
there is just one - there might be an arbitrarily long sequence of them.

So the routine @samp{parse_identifier}, that lexes an identifier, cannot
assume that it can scan forwards until the first non-identifier
character and be done with it, because this could be the @samp{\}
introducing an escaped newline, or the @samp{?} introducing the trigraph
sequence that represents the @samp{\} of an escaped newline.  Similarly
for the routine that handles numbers, @samp{parse_number}.  If these
routines stumble upon a @samp{?} or @samp{\}, they call
@samp{skip_escaped_newlines} to skip over any potential escaped newlines
before checking whether they can finish.

Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
check for a @samp{=} after a @samp{+} character to determine whether it
has a @samp{+=} token; it needs to be prepared for an escaped newline of
some sort.  These cases use the function @samp{get_effective_char},
which returns the first character after any intervening newlines.

The lexer needs to keep track of the correct column position,
including counting tabs as specified by the @samp{-ftabstop=} option.
This should be done even within comments; C-style comments can appear in
the middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment.

Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
may be invalid and require a diagnostic.  However, if they appear in a
macro expansion we don't want to complain with each use of the macro.
It is therefore best to catch them during the lexing stage, in
@samp{parse_identifier}.  In both cases, whether a diagnostic is needed
or not is dependent upon lexer state.  For example, we don't want to
issue a diagnostic for re-poisoning a poisoned identifier, or for using
@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
Therefore @samp{parse_identifier} makes use of flags to determine
whether a diagnostic is appropriate.  Since we change state on a
per-token basis, and don't lex whole lines at a time, this is not a
problem.

Another place where state flags are used to change behaviour is whilst
parsing header names.  Normally, a @samp{<} would be lexed as a single
token.  After a @samp{#include} directive, though, it should be lexed
as a single token as far as the nearest @samp{>} character.  Note that
we don't allow the terminators of header names to be escaped; the first
@samp{"} or @samp{>} terminates the header name.

Interpretation of some character sequences depends upon whether we are
lexing C, C++ or Objective C, and on the revision of the standard in
force.  For example, @samp{@@foo} is a single identifier token in
objective C, but two separate tokens @samp{@@} and @samp{foo} in C or
C++.  Such cases are handled in the main function @samp{_cpp_lex_token},
based upon the flags set in the @samp{cpp_options} structure.

Note we have almost, but not quite, achieved the goal of not stepping
backwards in the input stream.  Currently @samp{skip_escaped_newlines}
does step back, though with care it should be possible to adjust it so
that this does not happen.  For example, one tricky issue is if we meet
a trigraph, but the command line option @samp{-trigraphs} is not in
force but @samp{-Wtrigraphs} is, we need to warn about it but then
buffer it and continue to treat it as 3 separate characters.

@node Whitespace, Concept Index, Lexer, Top

The lexer has been written to treat each of @samp{\r}, @samp{\n},
@samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
it to transparently preprocess MS-DOS, Macintosh and Unix files without
their needing to pass through a special filter beforehand.

We also decided to treat a backslash, either @samp{\} or the trigraph
@samp{??/}, separated from one of the above newline forms by whitespace
only (one or more space, tab, form-feed, vertical tab or NUL characters),
as an intended escaped newline.  The library issues a diagnostic in this
case.

Handling newlines in this way is made simpler by doing it in one place
only.  The function @samp{handle_newline} takes care of all newline
characters, and @samp{skip_escaped_newlines} takes care of all escaping
of newlines, deferring to @samp{handle_newline} to handle the newlines
themselves.

@node Concept Index, Index, Whitespace, Top
@unnumbered Concept Index
@printindex cp

@node Index,, Concept Index, Top
@unnumbered Index of Directives, Macros and Options
@printindex fn

@contents
@bye
* cppinternals.texi: New file. From-SVN: r37990 2000-12-04 02:34:21 -05:00			`\input texinfo`
			`@setfilename cppinternals.info`
			`@settitle The GNU C Preprocessor Internals`

			`@ifinfo`
			`@dircategory Programming`
			`@direntry`
Makefile.in (info, [...]): Also build and remove and install and uninstall c-tree.info and cppinternals.info. * Makefile.in (info, maintainer-clean, install-info, uninstall): Also build and remove and install and uninstall c-tree.info and cppinternals.info. ($(srcdir)/gcc.info): Add dependency on contrib.texi. ($(srcdir)/cppinternals.info): New target. * c-tree.texi: Change file name used when makeinfo used without -o from ir.info to c-tree.info. Add info directory entry. * cppinternals.texi: Add info directory entry. * .cvsignore: Update. cp: * Make-lang.in (c++.info, c++.install-info): Build and install g++ internals info. (c++.uninstall, c++.maintainer-clean): Remove g++ internals info. ($(srcdir)/cp/g++int.info): New target. * gxxint.texi: Add info directory entry. Use @@ in email address. * .cvsignore: Update. From-SVN: r38970 2001-01-12 19:24:39 -05:00			`* Cpplib: (cppinternals). Cpplib internals.`
* cppinternals.texi: New file. From-SVN: r37990 2000-12-04 02:34:21 -05:00			`@end direntry`
			`@end ifinfo`

			`@c @smallbook`
			`@c @cropmarks`
			`@c @finalout`
			`@setchapternewpage odd`
			`@ifinfo`
			`This file documents the internals of the GNU C Preprocessor.`

Makefile.in (info, [...]): Also build and remove and install and uninstall c-tree.info and cppinternals.info. * Makefile.in (info, maintainer-clean, install-info, uninstall): Also build and remove and install and uninstall c-tree.info and cppinternals.info. ($(srcdir)/gcc.info): Add dependency on contrib.texi. ($(srcdir)/cppinternals.info): New target. * c-tree.texi: Change file name used when makeinfo used without -o from ir.info to c-tree.info. Add info directory entry. * cppinternals.texi: Add info directory entry. * .cvsignore: Update. cp: * Make-lang.in (c++.info, c++.install-info): Build and install g++ internals info. (c++.uninstall, c++.maintainer-clean): Remove g++ internals info. ($(srcdir)/cp/g++int.info): New target. * gxxint.texi: Add info directory entry. Use @@ in email address. * .cvsignore: Update. From-SVN: r38970 2001-01-12 19:24:39 -05:00			`Copyright 2000, 2001 Free Software Foundation, Inc.`
* cppinternals.texi: New file. From-SVN: r37990 2000-12-04 02:34:21 -05:00
			`Permission is granted to make and distribute verbatim copies of`
			`this manual provided the copyright notice and this permission notice`
			`are preserved on all copies.`

			`@ignore`
			`Permission is granted to process this file through Tex and print the`
			`results, provided the printed document carries copying permission`
			`notice identical to this one except for the removal of this paragraph`
			`(this paragraph not being relevant to the printed manual).`

			`@end ignore`
			`Permission is granted to copy and distribute modified versions of this`
			`manual under the conditions for verbatim copying, provided also that`
			`the entire resulting derived work is distributed under the terms of a`
			`permission notice identical to this one.`

			`Permission is granted to copy and distribute translations of this manual`
			`into another language, under the above conditions for modified versions.`
			`@end ifinfo`

			`@titlepage`
			`@c @finalout`
			`@title Cpplib Internals`
Makefile.in (info, [...]): Also build and remove and install and uninstall c-tree.info and cppinternals.info. * Makefile.in (info, maintainer-clean, install-info, uninstall): Also build and remove and install and uninstall c-tree.info and cppinternals.info. ($(srcdir)/gcc.info): Add dependency on contrib.texi. ($(srcdir)/cppinternals.info): New target. * c-tree.texi: Change file name used when makeinfo used without -o from ir.info to c-tree.info. Add info directory entry. * cppinternals.texi: Add info directory entry. * .cvsignore: Update. cp: * Make-lang.in (c++.info, c++.install-info): Build and install g++ internals info. (c++.uninstall, c++.maintainer-clean): Remove g++ internals info. ($(srcdir)/cp/g++int.info): New target. * gxxint.texi: Add info directory entry. Use @@ in email address. * .cvsignore: Update. From-SVN: r38970 2001-01-12 19:24:39 -05:00			`@subtitle Last revised Jan 2001`
* cppinternals.texi: New file. From-SVN: r37990 2000-12-04 02:34:21 -05:00			`@subtitle for GCC version 3.0`
			`@author Neil Booth`
			`@page`
			`@vskip 0pt plus 1filll`
			`@c man begin COPYRIGHT`
Makefile.in (info, [...]): Also build and remove and install and uninstall c-tree.info and cppinternals.info. * Makefile.in (info, maintainer-clean, install-info, uninstall): Also build and remove and install and uninstall c-tree.info and cppinternals.info. ($(srcdir)/gcc.info): Add dependency on contrib.texi. ($(srcdir)/cppinternals.info): New target. * c-tree.texi: Change file name used when makeinfo used without -o from ir.info to c-tree.info. Add info directory entry. * cppinternals.texi: Add info directory entry. * .cvsignore: Update. cp: * Make-lang.in (c++.info, c++.install-info): Build and install g++ internals info. (c++.uninstall, c++.maintainer-clean): Remove g++ internals info. ($(srcdir)/cp/g++int.info): New target. * gxxint.texi: Add info directory entry. Use @@ in email address. * .cvsignore: Update. From-SVN: r38970 2001-01-12 19:24:39 -05:00			`Copyright @copyright{} 2000, 2001`
* cppinternals.texi: New file. From-SVN: r37990 2000-12-04 02:34:21 -05:00			`Free Software Foundation, Inc.`

			`Permission is granted to make and distribute verbatim copies of`
			`this manual provided the copyright notice and this permission notice`
			`are preserved on all copies.`

			`Permission is granted to copy and distribute modified versions of this`
			`manual under the conditions for verbatim copying, provided also that`
			`the entire resulting derived work is distributed under the terms of a`
			`permission notice identical to this one.`

			`Permission is granted to copy and distribute translations of this manual`
			`into another language, under the above conditions for modified versions.`
			`@c man end`
			`@end titlepage`
			`@page`

			`@node Top, Conventions,, (DIR)`
			`@chapter Cpplib - the core of the GNU C Preprocessor`

			`The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is`
			`now implemented as a library, cpplib, so it can be easily shared between`
			`a stand-alone preprocessor, and a preprocessor integrated with the C,`
			`C++ and Objective C front ends. It is also available for use by other`
			`programs, though this is not recommended as its exposed interface has`
			`not yet reached a point of reasonable stability.`

			`This library has been written to be re-entrant, so that it can be used`
			`to preprocess many files simultaneously if necessary. It has also been`
			`written with the preprocessing token as the fundamental unit; the`
			`preprocessor in previous versions of GCC would operate on text strings`
			`as the fundamental unit.`

			`This brief manual documents some of the internals of cpplib, and a few`
			`tricky issues encountered. It also describes certain behaviour we would`
			`like to preserve, such as the format and spacing of its output.`

			`Identifiers, macro expansion, hash nodes, lexing.`

			`@menu`
			`* Conventions:: Conventions used in the code.`
			`* Lexer:: The combined C, C++ and Objective C Lexer.`
			`* Whitespace:: Input and output newlines and whitespace.`
			`* Concept Index:: Index of concepts and terms.`
			`* Index:: Index.`
			`@end menu`

			`@node Conventions, Lexer, Top, Top`

			`cpplib has two interfaces - one is exposed internally only, and the`
			`other is for both internal and external use.`

			`The convention is that functions and types that are exposed to multiple`
			`files internally are prefixed with @samp{_cpp_}, and are to be found in`
			`the file @samp{cpphash.h}. Functions and types exposed to external`
			`clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.`

			`We are striving to reduce the information exposed in cpplib.h to the`
			`bare minimum necessary, and then to keep it there. This makes clear`
			`exactly what external clients are entitled to assume, and allows us to`
			`change internals in the future without worrying whether library clients`
			`are perhaps relying on some kind of undocumented implementation-specific`
			`behaviour.`

			`@node Lexer, Whitespace, Conventions, Top`

			`The lexer is contained in the file @samp{cpplex.c}. We want to have a`
			`lexer that is single-pass, for efficiency reasons. We would also like`
			`the lexer to only step forwards through the input files, and not step`
			`back. This will make future changes to support different character`
			`sets, in particular state or shift-dependent ones, much easier.`

			`This file also contains all information needed to spell a token, i.e. to`
			`output it either in a diagnostic or to a preprocessed output file. This`
			`information is not exported, but made available to clients through such`
			`functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.`

			`The most painful aspect of lexing ISO-standard C and C++ is handling`
			`trigraphs and backlash-escaped newlines. Trigraphs are processed before`
			`any interpretation of the meaning of a character is made, and unfortunately`
			`there is a trigraph representation for a backslash, so it is possible for`
			`the trigraph @samp{??/} to introduce an escaped newline.`

			`Escaped newlines are tedious because theoretically they can occur`
			`anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,`
			`within the characters of an identifier, and even between the @samp{*}`
			`and @samp{/} that terminates a comment. Moreover, you cannot be sure`
			`there is just one - there might be an arbitrarily long sequence of them.`

			`So the routine @samp{parse_identifier}, that lexes an identifier, cannot`
			`assume that it can scan forwards until the first non-identifier`
			`character and be done with it, because this could be the @samp{\}`
			`introducing an escaped newline, or the @samp{?} introducing the trigraph`
			`sequence that represents the @samp{\} of an escaped newline. Similarly`
			`for the routine that handles numbers, @samp{parse_number}. If these`
			`routines stumble upon a @samp{?} or @samp{\}, they call`
			`@samp{skip_escaped_newlines} to skip over any potential escaped newlines`
			`before checking whether they can finish.`

			`Similarly code in the main body of @samp{_cpp_lex_token} cannot simply`
			`check for a @samp{=} after a @samp{+} character to determine whether it`
			`has a @samp{+=} token; it needs to be prepared for an escaped newline of`
			`some sort. These cases use the function @samp{get_effective_char},`
			`which returns the first character after any intervening newlines.`

			`The lexer needs to keep track of the correct column position,`
			`including counting tabs as specified by the @samp{-ftabstop=} option.`
			`This should be done even within comments; C-style comments can appear in`
			`the middle of a line, and we want to report diagnostics in the correct`
			`position for text appearing after the end of the comment.`

			`Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,`
			`may be invalid and require a diagnostic. However, if they appear in a`
			`macro expansion we don't want to complain with each use of the macro.`
			`It is therefore best to catch them during the lexing stage, in`
			`@samp{parse_identifier}. In both cases, whether a diagnostic is needed`
			`or not is dependent upon lexer state. For example, we don't want to`
			`issue a diagnostic for re-poisoning a poisoned identifier, or for using`
			`@samp{__VA_ARGS__} in the expansion of a variable-argument macro.`
			`Therefore @samp{parse_identifier} makes use of flags to determine`
			`whether a diagnostic is appropriate. Since we change state on a`
			`per-token basis, and don't lex whole lines at a time, this is not a`
			`problem.`

			`Another place where state flags are used to change behaviour is whilst`
			`parsing header names. Normally, a @samp{<} would be lexed as a single`
			`token. After a @samp{#include} directive, though, it should be lexed`
			`as a single token as far as the nearest @samp{>} character. Note that`
			`we don't allow the terminators of header names to be escaped; the first`
			`@samp{"} or @samp{>} terminates the header name.`

			`Interpretation of some character sequences depends upon whether we are`
			`lexing C, C++ or Objective C, and on the revision of the standard in`
			`force. For example, @samp{@@foo} is a single identifier token in`
			`objective C, but two separate tokens @samp{@@} and @samp{foo} in C or`
			`C++. Such cases are handled in the main function @samp{_cpp_lex_token},`
			`based upon the flags set in the @samp{cpp_options} structure.`

			`Note we have almost, but not quite, achieved the goal of not stepping`
			`backwards in the input stream. Currently @samp{skip_escaped_newlines}`
			`does step back, though with care it should be possible to adjust it so`
			`that this does not happen. For example, one tricky issue is if we meet`
			`a trigraph, but the command line option @samp{-trigraphs} is not in`
			`force but @samp{-Wtrigraphs} is, we need to warn about it but then`
			`buffer it and continue to treat it as 3 separate characters.`

			`@node Whitespace, Concept Index, Lexer, Top`

			`The lexer has been written to treat each of @samp{\r}, @samp{\n},`
			`@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows`
			`it to transparently preprocess MS-DOS, Macintosh and Unix files without`
			`their needing to pass through a special filter beforehand.`

			`We also decided to treat a backslash, either @samp{\} or the trigraph`
			`@samp{??/}, separated from one of the above newline forms by whitespace`
			`only (one or more space, tab, form-feed, vertical tab or NUL characters),`
			`as an intended escaped newline. The library issues a diagnostic in this`
			`case.`

			`Handling newlines in this way is made simpler by doing it in one place`
			`only. The function @samp{handle_newline} takes care of all newline`
			`characters, and @samp{skip_escaped_newlines} takes care of all escaping`
			`of newlines, deferring to @samp{handle_newline} to handle the newlines`
			`themselves.`

			`@node Concept Index, Index, Whitespace, Top`
			`@unnumbered Concept Index`
			`@printindex cp`

			`@node Index,, Concept Index, Top`
			`@unnumbered Index of Directives, Macros and Options`
			`@printindex fn`

			`@contents`
			`@bye`