The GD Binary Format

3. The GD Binary Format

The GD format is the format dumped by our modified version of the GNU C++ compiler and forms the basis of the transformation from the GCC Schema to the Datrix Schema. There were several requirements that fed the definition of the format:

Fast I/O. Since there would be a significant number of small transforms, and the format would be read and written for each of the transforms, the format had to have a low I/O overhead.
Compact. We were developing on Linux. One property of Linux is that you can write a file, read it, and erase it with little or no disk activity if you have enough memory (or conversely if the file is small enough for the memory you have).. A compact representation helps here.
Iteration over the graph must be easy. Most of the transforms target a particular type of node and ignore all other nodes. When a node of the type is found, some limited graph traversal may be done. For example, when structures are converted, the node representing the structure is converted, and the individual fields and methods of the structure are visited. Requiring the entire graph to be traversed to find structure definitions is less than efficient. This rules out most pointer / dynamic memory representations of the graph. This is probably the most important requirement.
The graph must be representable in C. The compiler itself is written in C and must be able to generate the initial graph. This eliminates the use of C++ class constructs.

The result is an array based representation where all the nodes in the graph are held in a single array. Pointers are represented as array indexes. This satisfies the first and third requirements. The I/O is fast since the entire array can be read and written with a single binary read or write statement. The pointers need not be encoded or reconstrucuted. Iterration is easy since all of the nodes of the graph can be visited by scanning the array. Another advantage is not paying the cost of dynamic memory allocation for each of the nodes, which can be expensive for large programs.

There are some disadvantages to the representation. The first is that there is no support for variable length nodes, which are used in the GCC graph. Another disadvantage is that if a transform runs out of free nodes in the array, then a new array must be allocated and the contents copied (2 x memory). The last disadvantage we mention here is that 0 is a valid pointer value (the first node in the array) and thus the null pointer is -1.

The graph also contains an integral hash table to represent the string values in the program. Implemented as two arrays, the hash table is also trivial to read and write to disk. String values in the program are also represented as integer array indexes.

3.1 General Graph Representation

The GD graph representation is defined in the file gd_graph.h, one of the new files for the modified compiler. This file defines the general graph representation, the definition for each of the node types and prototypes for some utiltity routines (hash table management, file I/O). The structure is defined as:

typedef struct {
/* string table stuff */
unsigned long stringBuffLen;
unsigned long curStringPos;
char * stringBuffer;

/* hash table */
/* hash table only stores location in string
* table. Therefore, no need for a complex structure */
unsigned long hashTableLen;
long *hashTable;

/* node table */
unsigned long nodeBuffLen;
unsigned long curNodePos;
node_ptr firstNode;
gd_graph_node * nodeBuffer;
} gd_graph;

The graph is represented as three arrays, implented as three pointers and unsigned long length fields in the main structure. The first of these arrays is the string buffer (stringBuffer and stringBuffLen). This array is used to store the contents of any strings in the program graph. Each string is stored once in the table, terminated by the null character. The strings are referred to in graph by the index of the first character of the string. This reduces string comparisons to a single integer comparison of the index of the string in the array. An extra index in the main structure keeps track of the first unallocated character in the array. Currently no garbage collection is performed on this array. Once a string is stored in the array, it is never removed.

The second array is the hash table. The hash table is mostly used when constructing the graph, and only occasionally during the transforms. The graph nodes point directly to the string buffer. The hash table is used to elimnate duplicate strings and to find the index of a string from its contents. For example, to find a node with the name "void", a hash table lookup will return the index of the string "void" in the string buffer, and and instances of void in the graph will have that index. If the lookup returns -1, then the string void is not in the string buffer, and thus is not in the graph.

The last array is the array of nodes. As with the string buffer, there is an extra index that indicates the limit of used nodes in the array. Unlike the string buffer, garbage collection of nodes is performed (by gd_filter) and there may be free nodes within the allocated part of the array. There is also an extra index that indicates the root of the graph (points to the global namespace in the graph).

3.2 Node Definitions

The definition of gd_graph_node is a union structure of all of the different variants of the nodes. Unlike the GCC internal graph definition in the compiler, we have a separate definition for each type of node, even when there is no extra information in the node. Thus a given strucrture in the GCC internal graph may be mapped to one of many different strucutres in the GD graph.

There is some subtyping in both the GCC and the Datrix node types. For example, a type declaration is a declaration which is a GCC node. This is implemented in the GD structure using nesting and overlyaing of the structures. A subclass is implemented by declaring the first field of the structure as the type of the superclass. For example, part of the declaraion of gd_graph_node is:

typedef union {
    gd_graph_common    common;
    ...
    t_gcc_decl         decl;
    t_namespace_decl   namespace_decl;
    ...

} gd_graph_node;

All gd nodes are subclasses of gd_graph_common which defines fields common to all nodes. On of these fields is the field type which give the type of the node.. The structure t_gcc_decl defines the fields common to all gcc declaration nodes as represented in GD. GCC namespaces are represented using the structure t_namespace_decl. The last two structures are defined as:

typedef struct {
gd_graph_common common;
node_ptr name;
node_ptr mngl;
node_ptr type;
node_ptr chan;
node_ptr scpe;
str_ptr srcf;
int srcl;
src_lang lang;
int artificial;
} t_gcc_decl;

typedef struct {
t_gcc_decl decl;
node_ptr dcls; /* C++ only */
node_ptr alis; /* C++ only */
} t_namespace_decl;

All three types are defined in the main graph union (as shown above). Thus the type field of a namespace_decl node may be refered to as:
   thegraph.nodeBuffer[theNode].common.type
or
   thegraph.nodeBuffer[theNode].decl.type
or
   thegraph.nodeBuffer[theNode].namespace_decl.decl.type

Every node has gd_graph_common as the ultimate superclass, so the type of any node can always be detemined using first code version (thegraph.nodeBuffer[theNode].common.type).

One consequence of this representation is that there are no variable length nodes. Most places where GCC uses variable ength nodes are not exported as part of the GD representation. The one place where it matters is in the encoding of class inheritance. In this case the dumping routines change the tree vector representation into a linked list using tree_list nodes.

The gd_graph_common structurer is defined as follows:

typedef struct {
enum gd_node_type type:16;
/* bit fields for simple definitions */
unsigned unsigned_flag: 1; /* 17 */
unsigned constant_flag: 1; /* 18 */
unsigned static_flag:   1; /* 19 */
unsigned public_flag:   1; /* 20 */
unsigned private_flag: 1; /* 21 */
unsigned protected_flag:1; /* 22 */ /* 10 bits left */
/* for use of conversion algorithm */
unsigned mark_flag1:    1; /* 23 */ /* 9 bits left */
unsigned mark_flag2:    1; /* 24 */ /* 8 bits left */
node_ptr other_node;
} gd_graph_common;

The first part of the common structure is almost identical to the GCC common structure. The main difference is that the node type field has bee expanded to 16 bits. The next 6 flags are the same as the flags in the GCC common structure and when the GD graph is dumped by the modified compiler, they are copied directly from the GCC internal structure. The last three fields are for use in the conversion. The mark flags are used for mark and sweep algorithms in the transforms, and the other_node field is a generic node pointer whose purpose changes during the transform. For example, GCC represents blocks of declarations or statments with a single edge to the first of the block, and using one field in each declaration or statement to build a linked list of the statements or declarations in the block. The Datrix model uses multiple edges. In the intermediate stage, a datrix node needs a next pointer when it is embedded in a GCC list that has not been converted yet. The other_node is used for that purpose in this case.

GCC edges were represented as pointers in the original internal graph and are represented as node_ptr fields (contain array indexes). That is the edges in the GCC graph are implemented as a reference from one GD node to another. Datrix edges may carry attributes and in addition, a given node may use the same relation (edge) to refer to more than one node. For example, the ArcSon edge is used to link the node representing a structure to the nodes representing each of the members (field and methods). We use gd_graph_nodes to represent Datrix edges. The nodes represenging datrix edges have a source and dest pointers in the node. Any attributes are fields of these nodes. When traversing the graph, we need a quick way to find the edges associated with a given node. Therefore the src node has a field pointing to the first of the edges, and the nodes representing the edge have next pointers linking them together. Figure 1 shows an example.

On the left is the logical view of the graph. We have an aggregate with two fields (the objects). Two ArcSon edges connect the aggregate node to its members. The edges are ordered (the number in parentheses). The right shows how the graph is represented in GD. Two nodes are allocated to represent the edges. The nodes have src and dest fields that give the src and dest nodes of the edge. In addition, the src node has a reference to the first of the edge nodes which are chained in a linked list. This permits easy graph traversal from parent to child (we don't have to scan the entire node array to find the edge nodes).

3.3 Future Improvements

There are several improvements that can be made to the representation. Most of these should be deferred untiil the project is more complete and a general rewrite is planned. They are:

Statement superclass for GCC. All of the statement classes are thier own structure and the next pointer for the statement chains are in different locations in the node. Thus gd_next is needed to find the next statement. An intermediate class that combines all the common elments of statments would simplify the code.
Add a general next pointer for Datrix nodes. When datrix declaraions or statements are generated partway through, the general other_node pointer is used to link them into the chain. A general next pointer should be included in the datrix node for that puirpose
The types of the atrributes are not uniform, and should be implemented with typedefs. The current exporters for gxl, ta and vcg are hand written. If each of the types for these notations were typedef'd to C types, then the exporters could be auto generated from the definition (see seciton 7).
This secion needs to be improved with some diagrams to make the approach more clear.

Prev: CPPX Architecture

Table of Contents

Next: : The Transformations