TA: TUPLE-ATTRIBUTE LANGUAGE

AN INTRODUCTION TO TA:
THE TUPLE-ATTRIBUTE LANGUAGE
(Draft)

R. C. Holt
6 Mar 97, 9 Mar 97, 12 Mar 97, 20 Mar 97, 24 Mar 97
Department of Computer Science
University of Toronto
27 Feb 97
holt@cs.toronto.edu

TA, the Tuple-Attribute language, is intended to allow convenient recording of information about certain types of graphs. This information includes (1) nodes and edges in the graph, and (2) attributes of these nodes and edges. The information in a TA "program" can be interpreted to be a special kind of data base, a graph data base (GDB), where the relations (corresponding to the edges) are all binary. This is the class of graphs formalized as a calculus of relations by Tarski [1955]; see also [Holt 1997].

The impetus for designing TA comes from the need for a convenient format for recording, manipulating and diagramming information that represents the structure of large computer programs. However, the design of TA is not restricted to this application. While TA is convenient for recording how to draw graphs, by encoding positions and sizes as attributes, TA is not uniquely a "drawing language".

TA can serve as a "data interchange" language, used for exchanging information among a set of software tools. For example, parsers can extract "facts" from source language programs and can represent these in TA. These facts in turn can be manipulated to create new "views", also expressed in TA, which in turn can be displayed on a screen by a tool called a renderer. The Grok interpreter [Holt 97] is an example of a tool which manipulates these facts.

The TA language is divided into two sub-languages, namely, the Tuple and the Attribute sub-languages. TA is also a two level language, one level for recording "facts" (an actual graph with its attributes) and another level for recording the general scheme for the graph (the classes of edges and of attributes).

The next two sections introduce the Tuple and Attribute sub-languages, at the "fact" level.

THE TUPLE SUB-LANGUAGE

Here is a "program", written in the Tuple subset of the TA language, which represents nodes P, Q and V, with node P having a Call edge to Q, and node Q having a Ref edge to V:

    // Two edges in the Tuple language.
    Call P Q
    Ref Q V

In each triple, such as "Call P Q", we can think of P as the subject, Call as the verb and Q as the object; this could mean procedure P calls procedure Q. (This syntax for triples was invented in the Rigi Project [Muller 93], where it is called RSF, for Rigi Standard Form. We have extended this notation in various ways, as described in this note. The commenting convention, with comments starting with //, is borrowed from C++.)

The graph defined by this Tuple program can be drawn as

    P  ------->  Q  ------->  V
         Call         Ref

P, Q and V are called entities or nodes. The term "entity" is most natural when considering that the information in the Tuple program constitutes a data base. The term "node" is most natural when we alternately consider that the information in the Tuple program represents a graph.

The two tuples represent the edges (P, Q) and (Q, V) in the relations Call and Ref, respectively. The term "tuple" or alternately "fact" is natural to use when we are thinking about a database. The term "edge" is natural when we are thinking about a graph.

We could also have other nodes, such as main.c or "New York" and other relations, such as Declare or is-owner-of. The syntax of TA allows entities and relations to be named by (to have identifiers as) any string that contains no white space (or is quoted).

TA's concept of a graph is sometimes called a multigraph or a typed graph or a colored graph, which means that there can be more than one type (color) of edge. Between two nodes, there can be more than one edge, given that the edges are of different types. In this example, there are two types of edges, named Call and Ref. We will not use the term "type", but will instead say that there are two "relations", Call and Ref. In this example Call contains one edge (P, Q) and Ref contains one edge (Q, V).

ASIDE. Besides interpreting a Tuple program as a graph or a data base, we can also interpret it is a Prolog program; Ullman [88] gives the interpretation of data bases as Prolog (really, as DataLog) programs. For example, the tuple "Call P Q" would be written in Prolog as Call(P, Q), where it would mean that predicate Call is true for arguments P and Q. In the Prolog interpretation, the term "fact" might be used instead of "tuple" or "edge".

To summarize what has been said so far, the Tuple language has a syntax, namely, a sequence of triples, and it also has a semantics. The semantics is that each triple is interpreted as an edge in a typed graph. Nodes in this graph are distinct if they have distinct names. (Later we will explain that the nodes can be divided into "entity classes", analogous to the way in which edges are divided into separate relations.) Beware that we will use the terms "node" and "entity" interchangeably and the terms "tuple", "triple", and "edge" interchangeably.

Parsers developed at the University of Victoria [Muller 93] in the Rigi project read source code written in a language such as C and produce output triples that record the relations among program objects such as procedures and variables.

There is a "tuple calculator" called Grok [Holt 97] that manipulates the information in Tuple programs. For example, using Grok one can carry out operations such as finding the union of two relations (by combining their tuples), finding composition of two relations (this produces a new relation and is analogous to a "join"), finding the transitive closure of a relation, etc. The operations in Grok are analogous to the SQL operations in a relational data base.

THE ATTRIBUTE SUB-LANGUAGE

Each node and edge can have attributes. For example, here are color attributes for the nodes and edges in the previous example written in the Attribute sub-language:

    // Colors for nodes P, Q and V
    P { color = red }
    Q { color = yellow }
    V { color = green }
    
    // Colors for the two edges
    (Call P Q) { color = black }
    (Ref Q V)  { color = gray }

Here are further attributes for the three nodes that record positions and sizes for diagramming the graph:

    // (x,y) positions and (width,height) sizes for nodes P, Q and V
    P { x =  50  y = 100 width = 70 height = 30 }
    Q { x = 150  y = 100 width = 70 height = 30 }
    V { x = 250  y = 100 width = 70 height = 30 }

Here is another attribute giving the description of node P:

    // Description for node P
    P { description = "This is a procedure named P" }

As you can see from these examples, an attribute for a node or edge is given by writing the name of the node, such as P, or writing the edge in the form (Call P Q), then a left brace {, then a list of items of the form "name-of-attribute = attribute-value", all followed by a right brace }.

The syntax of TA is free form, allowing comments to appear anywhere, blank lines to appear anywhere, and "tokens", such as names and special characters such as { and = to be separated by white space, such as blanks and ends of lines. For example, we could have written the (x,y) and size attributes for P this way:

    P { x = 50  y = 100
        width = 70
        height = 30 }

The examples of attributes that we have given in this section are suitable for recording how to draw our example graph. For example, node P would be drawn in color red, with its top left corner at (x,y) location (50, 100), with a width and height of 70 and 30 pixels respectively. This interpretation is not inherent in the TA language, but rather is a convention used by certain renderers (automatic picture drawers) that use TA to give the format for their input.

The TA language itself does not place any particular interpretation upon the attributes. Rather, like a data base, TA simply serves as a means for recording these values. However, a set of tools, such as the Landscape tools [Mancoridis 95] for drawing software architecture diagrams, uses the convention we have illustrated here for recording color, x, y, width and height information. At the University of Toronto, a set of software tools has been produced which reads and manipulates TA programs that use this convention. For example, there is an layout program that reads in tuples and automatically produces the x and y information to describe a diagram in which there are minimal crossings of edges. (At present, these programs require that colors must be recorded as RGB values such as (.5 .2 .8) instead of names such as "red".)

It is possible to have more than one setting of an attribute, for example, we might have:

    P { color = pink }

appearing later in our TA program. The most recent (the final) setting of the attribute is the one that is the actual value. This convention provides a convenient means for making small changes (such as highlighting a node by changing its color) to an existing graph.

To keep things simple, we have only shown simple attributes whose values are strings. A later section explains that attributes can be nested and can have lists as values.

We use attributes to attach values to entities, such as red to P, when the value can be thought of as being "local" to the entity. We think of P's set of attributes as a record that "belongs" to P or is contained in P.

THE SCHEME LEVEL OF THE TA LANGUAGE

Up to this point we have introduced the "fact" level of the TA language, which is used to describe a graph and its attributes. We will now introduce the "meta" or "scheme" level of TA, which is used to describe the general "shape" of a graph. The scheme level of TA uses essentially the same syntax as the fact level, but now the notation is used to record higher level information; in particular, the scheme level describes what is commonly known as an E/R Diagram [Chen 77]. (Note: a "scheme" is sometimes called "schema".)

In the example we have given, there were two types (classes) of edges, namely, Call and Ref edges. There were also two types (classes) of nodes, namely, procedure nodes (P and Q) and variable nodes (V).

We can record these conventions as follows

    Call Proc Proc
    Ref Proc Var

What these two lines mean is that procedures Call procedures and procedures Ref (reference) variables. This implies that there are Call and Ref relations and that there are Proc and Var entity classes. (Note that an "edge class" is called a "relation".) In other words, an actual graph can have Call edges from Procs to Procs as well as Ref edges from Procs to Vars, but no other edges.

When combining fact level and scheme level notation in the same file, we separate the file into one of four possible sections, with these headers:

    SCHEME TUPLE :      // Gives E/R description of allowed edges 
    SCHEME ATTRIBUTE :  // Gives names of allowed attributes
    FACT TUPLE :        // Gives edges in actual graph
    FACT ATTRIBUTE :    // Gives attributes for actual graph

In general, a scheme, described by its scheme tuples and its scheme attributes, would be used for more than one actual (fact level) graph.

In our example, we would add the following to our "fact" level TA program (to a FACT TUPLE section):

    $INSTANCE P Proc
    $INSTANCE Q Proc
    $INSTANCE V Var

This means that P and Q are in the Proc class of entities and V is in the Var class of entities.

For very simple graphs, in which there is only one class of entity, such as "programming language items", and any relation can connect any two entities, it is not necessary to describe the scheme, because there is a default scheme, called the "Universal Scheme", for this case. The Universal Scheme has the property that any arbitrary set of tuples is allowed by it.

We have given an example of the Tuple sub-language at the scheme level (as in "Call Proc Proc"). We will now give an example of the Attribute sub-language at the scheme level. In our example, procedures (Proc's) can have six attributes: x, y, width, height, color and description. We can record this scheme information as follows:

    Proc { x y width height
           color = red
           description }

We have "declared" that a Proc entity can have these six attributes. We have also "declared" that there is a default value of the color attribute which is "red". This means that if any Proc entity does not explicitly have its color attribute set, it is taken to be red.

ASIDE. E/R Diagrams are richer than TA's schemes in that E/R Diagrams support n-ary relations while TA supports only binary relations such as Call and Ref.

INHERITANCE IN TA

Among entity classes (and relations) there may be many common characteristics. In our example, all entities have the same attributes, whether they were Proc's or Var's. Similarly, the two edge classes (relations) both have a color attribute.

We can use inheritance to avoid re-writing such common attributes, as is done here at the scheme level:

    $INHERIT Proc ProgItem
    $INHERIT Var ProgItem

This says that a Proc "is a" ProgItem, as is a Var. We declare the common attributes of program items (ProgItem's) this way:

    ProgItem { x y width height
           color
           description }
    Proc { color = red 
           numberOfCalls = 0 }
    Var  { Color = green }

These attribute declarations allow procedure and variable entities to have the six attributes of program items. As well procedures have a default color of red and a numberOfCalls attribute that is initially 0. Variables have a default color of green.

There is a predefined entity class called $ENTITY that is the inheritance root for entity classes; in other words, all entity classes implicitly inherit from $ENTITY. Instead of using ProgItem we could have used $ENTITY with the same effect in the following way:

    $ENTITY { x y width height
           color
           description }
    Proc { color = red 
           numberOfCalls = 0 }
    Var  { Color = green }

Since every entity class, such as Proc and Var, inherits from $ENTITY, we do not need to add lines such as "$INHERIT Proc $ENTITY".

There is a predefined relation called $RELATION that is the inheritance root for relations; in other words, all relations implicitly inherit from $RELATION. We can specify that all relations have color and weight attributes as follows:

    ($RELATION) { color weight }
    (Call) { color = black }

We have also specified that the default color for Call edges is black.

In a typed graph, there is at most one edge of a given type between two nodes. If we wish to represent more than one edge of a given type, e.g., more than one call from P to Q, we can use an attribute such as weight to record this information. For example, the following can record the fact that there are three calls from procedure P to procedure Q:

    (Call P Q) { weight = 3 }

When one entity class inherits from another, this has two affects. First, the inheriting class gets all the attributes of the inherited-from class. Second, any connectivity allowed for the inherited-from entity class is also allowed by the inheriting class. For example, this line

 
    partOf progItem progItem

specifies that any progItem (any Proc or Var) can be "part of" any other progItem.

As another example,

 
    $RELATION $ENTITY $ENTITY

allows any entity (every entity descends from $ENTITY) to be connected by any relation (every relations descends from $RELATION). This rule is implicitly used in the Universal Scheme in which any pair of entities can be connected by edges of any relations.

In TA, inheritance is a notational convenience, but it adds no new power beyond what was in TA without inheritance. It simply allows us to write less by factoring common scheme information into an inheritance ancestor. It is always possible to write out the connectivity and attributes explicitly rather than using inheritance and get the same effect.

A particular item, such as an entity class, can have its attributes given several times, as in:

    Proc { color description }
    Proc { color weight }

The result is the combination of the given attributes; in this case these two lines are equivalent to:

   Proc { color description weight }

The only possible incompatibility from combining attributes of two inherited attributes would be if two attributes had the same name and one named a nested set of attributes and the other named a scalar attribute; this problem and its solution is discussed in the section on attribute nesting and attribute lists.

Multiple inheritance is allowed. For example, a Proc could inherit from both a Scope and a Unit. Multiple inheritance combines the attributes from its various inherited-from classes. Two inheritances of the same attribute, say color, are taken to be a single attribute of that name. The only possible incompatibility when combining attributes is the same as the problem previously described, in which an item's attributes are given several times.

SUMMARY OF SUB-LANGUAGES OF TA

There are two languages that make up TA, the Tuple and the Attribute sub-languages. As well, there are two levels of each of these, the fact level and the scheme level. We illustrate this idea in this diagram.



                      Tuple          Attribute
                 +---------------+----------------+
 Levels:         |               |                |
         Scheme  |  Allowed      |  Allowed       |
                 |  edges        |  attributes    |
                 |               |                |
                 +---------------+----------------+
                 |               |                |
         Fact    |  Edges of     |  Attributes of |
                 |  actual graph |  actual graph  |
                 |               |                |
                 +---------------+----------------+

The syntax used for scheme tuples and for fact tuples is identical, and the syntax used for scheme attributes and for fact attributes is essentially identical. At the scheme level we "declare" each entity class and each relation as well as the attributes allowed for each of these. At the fact level, we "populate" a graph, by giving the graph's nodes, edges and attributes.

SCHEME CONFORMANCE

The purpose of a scheme is to capture certain invariant structural properties of a fact-level graph (or set of graphs). For example, it is handy to state, in a single place, the kinds of nodes and edges in a graph, its various attributes and the allowed connectivity among nodes. In turn, it is helpful to be able to test to see if a (fact level) graph actually conforms to its scheme. If not, presumably there is something wrong with the graph (or possibly with the scheme). In other words, the scheme provides a certain amount of redundancy that can help us find errors in particular graphs. Put another way, a scheme encodes a set of constraints that we expect our graph(s) to satisfy. Yet another way to put this is that a scheme provides a set of declarations for a graph, and the graph itself makes assignments (creates structure) to populate the declarations.

If the fact level graph R corresponds to a particular scheme S, we say that F "conforms" to S (or satisfies S, or is legal according to S). Sometimes there is a commonly used scheme S, in which case we say simply F is conformant, or legal, without mentioning S.

In a standard relational data base (an RDB), there is a scheme and the corresponding (instance) data base contains tuples that corresponding to the scheme. In the usual implementations of RDBs, it is not possible to add a tuple to the data base unless the tuple corresponds to the scheme. This in turn implies that the data base can never fail to conform to its scheme.

Things are not this simple in TA. In a TA program, the programmer can write anything he or she feels like. If he violates the TA's syntax, then technically, the program has no meaning at all. If the program satisfies TA's syntax, then the program may or may not be conformant. If it is, this situation is like that in an RDB, namely, the fact level graph actually satisfies its scheme. If not, we can still consider that the fact level graph exists, as does the scheme, even though the fact level graph "breaks the rules" of its scheme. As explained in the following section, there is always at least one scheme, the Universal Scheme, that every graph satisfies. We can repair (or write a program to repair) a program so it satisfies its scheme, by deleting violating constructs.

THE UNIVERSAL SCHEME

A scheme defines those relations, nodes and attributes that are allowed in a corresponding actual graph. A scheme can be given explicitly in terms of the TA language. A scheme is optional in that the Universal Scheme US can be used. The Universal Scheme assumes:

(a) Any node can be attached to any other node. This is equivalent to this tuple in an explicit scheme:

            $RELATION $ENTITY $ENTITY

(b) Any name of a node that appears in a fact tuple, such as A and B in "Rel A B", is implicitly considered to be an instance of $ENTITY, as if the following occurred explicitly as a tuple in a scheme:

            $INSTANCE A $ENTITY
            $INSTANCE B $ENTITY

(c) Any attribute is allowed for any node or edge or relation. This is as if the corresponding attribute were given explicitly in a schemes attributes.

Note that the graph for a scheme can itself be considered to be a fact level graph, by considering that its scheme is the Universal Scheme.

ASIDE. As we have defined the Universal Scheme, it is actually an infinite set of schemes; each fact level graph effectively generates its own private scheme. We could alternately have defined the Universal Scheme as one which contains all possible relations, nodes and attributes, but this would make it infinite, and we have chosen to avoid that complication.

ATTRIBUTE NESTING

An attribute may contain sub-attributes, i.e., attributes can be nested. As an example, suppose a special set of layout tools needs a private way of specifying how to diagram a graph. The following example gives a "layout" attribute for procedure P which contains sub-attributes x1, y1, x2 and y2:

    P { layout { x1 = 200 y1 = 150 x2 = 240 y2 = 180 } }

The corresponding scheme for these attributes would be:


    Proc { layout { x1 y1 x2 y2 } }

Nesting can be to any level.

Sometimes, it is convenient for a scheme to allow arbitrary names for attributes. For example, we might want a menu attribute containing any arbitrary sub-attributes. To this we could write this in the scheme:

    Proc { menu { $NOTE } }

The $NOTE pseudo attribute allows any subattributes to be given, as in:

    P { menu { Copy = c "Choose box" = b Quit = q } }

These attributes, with their arbitrary names, are sometimes called "notes" or "annotations".

If a scheme allows an attribute, but does not explicitly specify that the attribute contains nested attributes, then the attribute can only have values which are not nested attributes. For example, if this is written in a scheme:

    Proc { color }

then Proc's such as P can have color set this way:

    P { color = red }

but not this way

    P { color { hue = bright texture = blurry } }

When a scheme repeatedly specifies allowed attributes, as in

    Proc { color }
    Proc { layout { x1 y1 x2 y2 } }
    Proc { layout { hue } }

these are combined, in this example giving the equivalent of

    Proc { color layout { x1 y1 x2 y2 hue } }

The only combination that is illegal in TA is if one part of the combination specifies that an attribute contains nesting and the other specifies no nesting, as in:

    Proc { color }             // Implies non-nesting in color
    Proc { color { R G B } }   // Illegal combination of attributes

Similarly the only case of illegal multiple inheritance is this possible collision of requiring and prohibiting nesting of attributes.

ATTRIBUTE VALUES THAT ARE LISTS

The value of an non-nested attribute can either be a string, such as green in

    P { color = green }

or a value list as in

    P { color = ( 0.5 0.2 0.4 ) }

A value list is a sequence of zero or more values (strings or possibly sub-lists) surround by parentheses (). Note that lists of lists are allowed.

APPENDIX A: GRAMMAR FOR TA

The grammar for TA is as follows:


  tupleLanguage ::= { stringToken stringToken stringToken }

  attributeLanguage ::=
            itemId "{" {attributeSetting} "}"

  attributeSetting ::=
            attributeId
          | attributeId "=" attributeValue
          | attributeId "{" {attributeSetting} "}"  // Nested attributes
            
  attributeValue ::=
            stringToken
          | "(" {attributeValue} ")"       // Nested lists
            
  itemId ::=  stringToken        // Fact entity or, in scheme, entity class
          | "(" stringToken ")"  // Relation (edge class)
          | "(" stringToken stringToken stringToken ")"    // Actual edge
            
  TALanguage ::= {section}

  section ::=   
            SCHEME TUPLE : tupleLanguage
          | SCHEME ATTRIBUTE : attributeLanguage
          | FACT TUPLE : tupleLanguage
          | FACT ATTRIBUTE : attributeLanguage

A stringToken is any blank separated or double-quote terminated string. These stringTokens have predefined meanings: $ENTITY, $RELATION, $INSTANCE, $INHERIT and $NOTE.

APPENDIX B: ADDENDUM

String tokens may have their lengths limited by some tools; for example, relations names, entity class names, entity names and attributes names may be limited to 255 characters. Tools that use or accept attributes values are expected to handle attribue string values that are very long, e.g., thousands of characters long.

Within a quoted string, internal quotes must be preceded by a back slashes, as in "Bob says \"Hi\"".

Several fact assignments are allowed to a given attribute, with the final assignment being the effective one (previous ones being effectively ignored). Note that there is no way in TA to add to an already existing list (although of course a particluar tool could read a TA program and modify it such that new items are effectively added to a list).

There is a "include" directive, so you can build up a TA program out of a set of files. The format of this directive is:

   INCLUDE nameOfIncludedFileInQuotes :

APPENDIX C: DECLARATION BEFORE USE

It is possible to have several scheme and fact sections catenated to create a large TA program. Suppose a fact level item, such as procedure P, is stated to be an instance of a scheme class, such as Proc. A later appearance of Proc in the program, defining new attributes or connectivity for Proc's such as P, may have anomolous results (or anomalous semantics), because it is not entirely clear if P should or should not have the properties given by the later appearance of Proc. To avoid this possible, confusion, we will assume DBU (declaration before use), which means that we will consider it to be illegal if a class is modified after instances of it have been specified in the TA program.

REFERENCES

[Chen 77] The Entity-Relationship Approach to Database Design, by Peter Chen, QED Information Sciences Inc., 170 Linden St., Wellesley, Mass., 1977, 1991

[Holt 1997] "Binary Relational Algebra Applied to Software Architecture", by R. C. Holt. CSRI Tech Report 345, University of Toronto, March 1996.

[Mancoridis 95] "Extending Programming Environments to Support Architectural Design", by S. Mancoridis and R. C. Holt. In the Proceedings of the Seventh International Workshop on Computer-Aided Software Engineering, Toronto, Canada, July, 1995.

[Muller 93] A Reverse Engineering Approach to Subsystem Identification, by Hausi A. Muller, O.A. Mehmet, S.R. Tilley, and J.S. Uhl, Software Maintenance and Practice, Vol 5, 181-204, 1993.

[Tarski 1955] On the Calculus of Relations, by Tarski, A., J. Symb. Log. 6, 3, 1941, 73-89.

[Ullman 88] Principles of Database and Knowledge-Base Systems (Volume I), by J. D. Ullman, Computer Science Press, New York, New York, 1988.

AN INTRODUCTION TO TA: THE TUPLE-ATTRIBUTE LANGUAGE (Draft)