Why the gcc ASG is not source-level

At the bottom of this page is a list of source-level issues which can't be distinguished in the CPPX output.

An abstract semantic graph, such as is input and output from CPPX, consists of

These are all represented by nodes and edges in a suitable kind of graph. For CPPX, they are ultimately GXL graphs.

Often the semantic structures don't need to be constructed explicitly: for the purpose of a compiler they can be identified with the syntac which defines them. For example, the meaning of a reference may be an edge in the graph from the use of a name to the definition of that same name.

A source-level ASG may be defined as one which wholly represents the original source code. A test for this property is: can the original source be reconstructed from the ASG? There are a few levels of answer to this question, depending on how much information has been left in the ASG by the compiler, and how much semantic information has been introduced to replace it:

  1. the source can be completely reconstructed byte-for-byte
  2. the source can be reconstructed except for spacing and layout (lexical normalization)
  3. the source can be reconstructed except for preprocessing (comment removal, macro replacement, source inclusions)
  4. the source can be reconstructed except for semantic normalization (constant folding, literal evaluation, explication of implicit code (such as constructor calls, implicit casts, etc))
  5. a computationally equivalent program can be constructed
The use of the gcc ASG puts CPPX at level 4 above, roughly speaking. This means that certain source-level distinctions are lost. Here is a list.

Source-Level Distinctions Erased by CPPX, With Examples

  1. Everything to do with layout and spacing is lost. For example,
    1. for (x = 1 ; 
           x < f. g (x);
           x = get_next (x, y) )
    will compile the same as
      for(x=1;x<f.g(x);x=get_next(x,y))
    Of course, spacing within quoted literals isn't lost.
  2. Initially everything to do with source inclusion (#include) is lost. For example,
    1. #include <stdio.h>
      int main () { printf ("hi"); }
    will compile the same as
      extern "C" {
      typedef unsigned int size_t;
      typedef void *__gnuc_va_list;
      typedef unsigned char __u_char;
      typedef unsigned short __u_short;
      typedef unsigned int __u_int;
      typedef unsigned long __u_long;
      ... etc etc
      int main () { printf ("hi"); }
    Later we will provide a representation of the source hierarchy, with inclusion relationships, and links from semantic objects to their source files. As long as gcc is the underlying analyser, though, it's unlikely that CPPX will be able to represent the exact source position of include directives.
  3. Everything to do with other preprocessing is lost. For example,
    1. #define ULT_ANS 42
      int zaphod = ULT_ANS;
    will compile the same as
      int zaphod = 42;
    (So macro-implemented API entry points will disappear in favour of their functional implementation. Embedded SQL is lost, too.)
  4. Comments are lost. (No need for an example.)
  5. Some enumerated type information is lost. Essentially, an enumerated type is identified with an unsigned subrange, and constants are introduced and initialized for the explicit enumerands. Hence
    1. enum Colour { WHITE = 3, GREEN, BLACK, BLANC = 3, VERT, NOIR };
      Colour x = WHITE;
    will compile the same as
      enum Colour { WHITE = 3, GREEN, BLACK = 5, BLANC = 3, VERT = 4, NOIR };
      Colour x = BLANC;
    or even as
     
      enum Colour { WHITE=3, GREEN=4, BLACK, BLANC=WHITE, VERT=GREEN, NOIR=BLACK };
      Colour x = (Color) 3;
  6. Literals (syntax expressing constants of various types) are evaluated. This doesn't mean all kinds of constant folding, it means that when in context there are multiple ways to express the same (constant) value, the distinctions are lost. Escape sequences are lost, nonsignificant zeros (as in 12.300 or 0x0FF) are lost, precision specifications are lost, representational-base is lost, adjacent-string concatenation is lost. For examples:
    1. int x = 0123;
      char *y = "this is"
                "a string";
      unsigned long z = (unsigned) -1;
      float a = 12.300;
      int b = 0x0FF;
      
      while (1) { ... }
      
    will compile the same as
      int x = 83;
      char *y = "this is a string";
      unsigned long z = 0xffffffff;
      float a = 12.3;
      int b = 0x0000FF;
      
      while (1) { ... }
      
    Note that precision is not lost, only precision specification. The rules for interpreting integer constants in C++ are quite complicated, and are summarized by "choose the most signed and the shortest meaning consistent with the value expressed". The literal 0xffffffffffffffff and the literal -1LL are indistinguishable at this level.
  7. Some typecasts are lost. Alas. There are two kinds of type casts: elementary and constructive. Elementary typecasts applied to literals are lost as shown above. Constructive typecasts are replaced by constructor calls (which is what they mean), so that with:
    1. class Blat {
          float y;
          Blat (int x) { y = x; }
      };
    the following are indistinguishable:
      int foo (int y) { Blat x = (Blat) y; }
      int foo (int y) { Blat x = Blat (y); }
      int foo (int y) { Blat x = y; }
  8. Parenthesization is lost.  It is not represented in the Datrix model anyway.  So
    1. a = b + (c * d);
    will not be distinguished from
      a = b + c * d;
    Nor will
      a = (b + c);
    will not be distinguished from
      a = b + c;
  9. Some operators are lost when they have little or no operational effect. The following are indistinguishable:
    1.  
      void blat ()
      {
              int a = 7;
              int b = 9;
              int c;

              c = (a,b);
              c = (+a);
              c = (1? a: b);
      }


    compiles just like

       
      void blat ()
      {
              int a = 7;
              int b = 9;
              int c;

              c = b;
              c = a;
              c = a;
      }
       

  10. The array dimension specification [0] means the same as [], and the syntactic distinction is lost.  Furthermore, initialized definitions with either [0] or [] are indistinguishable from initialized definitions with a positive upper bound.  This is because the compiler has computed a real upper bound from the dimension of the initializer. For example, the following variables compile the same way:
    1.  
      int ar [] = {1, 2, 3, 4, 5, 6};
      int ar [0] = {1, 2, 3, 4, 5, 6};
      int ar [6] = {1, 2, 3, 4, 5, 6};
       
    For formal parameters, the same remarks apply: and in addition, the leading array dimension is treated as a pointer, so that the following compile the same way:
       
      void f(int ar []);
      void f(int ar [0]);
      void f(int ar [6]);
      void f(int *ar);


    In the case of multidimensional arrays, the leading (leftmost) dimension alone can be [], and the rest remain, so generalizing, the following are the same:

       
      int ar [] [3] = {1, 2, 3, 4, 5, 6};
      int ar [0] [3] = {1, 2, 3, 4, 5, 6};
      int ar [2] [3] = {1, 2, 3, 4, 5, 6};
       
    and also the following:
       
      void f(int ar [] [3]);
      void f(int ar [0] [3]);
      void f(int ar [2] [3]);
      void f(int (*ar) [3]);
       
    Note the parentheses in the last example; without them the meaning would be completely different, as an array of pointers instead of a pointer to an array.
  11. Single-statement bodies in compound statements are converted to empty bodies, so that
    1. for (x = y; x < n; x = f(x));
    compiles the same way as
      for (x = y; x < n; x = f(x)) {}
    and the same for if, while, do, and switch (but not functions, of course!)
  12. More to come, no doubt.
AJM 2001-04-4