gcc ASG is not source

Why the gcc ASG is not source-level

At the bottom of this page is a list of source-level issues which can't be distinguished in the CPPX output.

An abstract semantic graph, such as is input and output from CPPX, consists of

a representation of the syntactic structure of the compiled program,
a representation of some semantic structures
a representation of some "meaning relationships" between syntactic instances and their semantics

These are all represented by nodes and edges in a suitable kind of graph. For CPPX, they are ultimately GXL graphs.

Often the semantic structures don't need to be constructed explicitly: for the purpose of a compiler they can be identified with the syntac which defines them. For example, the meaning of a reference may be an edge in the graph from the use of a name to the definition of that same name.

A source-level ASG may be defined as one which wholly represents the original source code. A test for this property is: can the original source be reconstructed from the ASG? There are a few levels of answer to this question, depending on how much information has been left in the ASG by the compiler, and how much semantic information has been introduced to replace it:

the source can be completely reconstructed byte-for-byte
the source can be reconstructed except for spacing and layout (lexical normalization)
the source can be reconstructed except for preprocessing (comment removal, macro replacement, source inclusions)
the source can be reconstructed except for semantic normalization (constant folding, literal evaluation, explication of implicit code (such as constructor calls, implicit casts, etc))
a computationally equivalent program can be constructed

The use of the gcc ASG puts CPPX at level 4 above, roughly speaking. This means that certain source-level distinctions are lost. Here is a list.

Source-Level Distinctions Erased by CPPX, With Examples

Everything to do with layout and spacing is lost. For example,

for (x = 1 ; 
     x < f. g (x);
     x = get_next (x, y) )

for(x=1;x<f.g(x);x=get_next(x,y))

Initially everything to do with source inclusion (#include) is lost. For example,

#include <stdio.h>
int main () { printf ("hi"); }

extern "C" {
typedef unsigned int size_t;
typedef void *__gnuc_va_list;
typedef unsigned char __u_char;
typedef unsigned short __u_short;
typedef unsigned int __u_int;
typedef unsigned long __u_long;
... etc etc
int main () { printf ("hi"); }

Everything to do with other preprocessing is lost. For example,

#define ULT_ANS 42
int zaphod = ULT_ANS;

int zaphod = 42;

Comments are lost. (No need for an example.)
Some enumerated type information is lost. Essentially, an enumerated type is identified with an unsigned subrange, and constants are introduced and initialized for the explicit enumerands. Hence

enum Colour { WHITE = 3, GREEN, BLACK, BLANC = 3, VERT, NOIR };
Colour x = WHITE;

enum Colour { WHITE = 3, GREEN, BLACK = 5, BLANC = 3, VERT = 4, NOIR };
Colour x = BLANC;

enum Colour { WHITE=3, GREEN=4, BLACK, BLANC=WHITE, VERT=GREEN, NOIR=BLACK };
Colour x = (Color) 3;

Literals (syntax expressing constants of various types) are evaluated. This doesn't mean all kinds of constant folding, it means that when in context there are multiple ways to express the same (constant) value, the distinctions are lost. Escape sequences are lost, nonsignificant zeros (as in 12.300 or 0x0FF) are lost, precision specifications are lost, representational-base is lost, adjacent-string concatenation is lost. For examples:

int x = 0123;
char *y = "this is"
          "a string";
unsigned long z = (unsigned) -1;
float a = 12.300;
int b = 0x0FF;

while (1) { ... }

int x = 83;
char *y = "this is a string";
unsigned long z = 0xffffffff;
float a = 12.3;
int b = 0x0000FF;

while (1) { ... }

precision

precision specification

0xffffffffffffffff

-1LL

Some typecasts are lost. Alas. There are two kinds of type casts: elementary and constructive. Elementary typecasts applied to literals are lost as shown above. Constructive typecasts are replaced by constructor calls (which is what they mean), so that with:

class Blat {
    float y;
    Blat (int x) { y = x; }
};

int foo (int y) { Blat x = (Blat) y; }
int foo (int y) { Blat x = Blat (y); }
int foo (int y) { Blat x = y; }

Parenthesization is lost. It is not represented in the Datrix model anyway. So

a = b + (c * d);

a = b + c * d;

a = (b + c);

a = b + c;

Some operators are lost when they have little or no operational effect. The following are indistinguishable:

void blat ()

{

int a = 7;

int b = 9;

int c;

        c = (a,b);
        c = (+a);
        c = (1? a: b);
}

compiles just like

void blat ()

{

int a = 7;

int b = 9;

int c;

        c = b;
        c = a;
        c = a;
}

The array dimension specification [0] means the same as [], and the syntactic distinction is lost. Furthermore, initialized definitions with either [0] or [] are indistinguishable from initialized definitions with a positive upper bound. This is because the compiler has computed a real upper bound from the dimension of the initializer. For example, the following variables compile the same way:

int ar [] = {1, 2, 3, 4, 5, 6};

int ar [0] = {1, 2, 3, 4, 5, 6};

int ar [6] = {1, 2, 3, 4, 5, 6};

void f(int ar []);

void f(int ar [0]);

void f(int ar [6]);

void f(int *ar);

In the case of multidimensional arrays, the leading (leftmost) dimension alone can be [], and the rest remain, so generalizing, the following are the same:

int ar [] [3] = {1, 2, 3, 4, 5, 6};

int ar [0] [3] = {1, 2, 3, 4, 5, 6};

int ar [2] [3] = {1, 2, 3, 4, 5, 6};

void f(int ar [] [3]);

void f(int ar [0] [3]);

void f(int ar [2] [3]);

void f(int (*ar) [3]);

Single-statement bodies in compound statements are converted to empty bodies, so that

for (x = y; x < n; x = f(x));

for (x = y; x < n; x = f(x)) {}

if

while

do

switch

More to come, no doubt.

AJM 2001-04-4