TA: The Tuple-Attribute Language

TA: The Tuple Attribute Language

Ric Holt, Department of Computer Science, University of Waterloo
27 Feb 97, Updated March 97, May 97, Nov 98, Jan 2002, May 2002, July 2002

The Tuple Sub-Language
The Attribute Sub-Language
The Scheme Level of the TA Language
Inheritance in TA
Summary of Sub-Languages of TA
Scheme Conformance
The Universal Scheme
Attribute Values That Are Lists
Grammar for TA
String Tokens in TA
Maximum String Lengths, etc.
Relations with special meanings
References

TA, the Tuple-Attribute language, is intended to allow convenient recording of information about certain types of graphs. This information includes

nodes and edges in the graph, and
attributes of these nodes and edges.

The information in a TA "program" can be interpreted to be a special kind of data base, a graph data base (GDB), where the relations (corresponding to the edges) are all binary. This is the class of graphs formalized as a calculus of relations by [Tarski 1955]; see also [Holt 1997].

The impetus for designing TA comes from the need for a convenient format for recording, manipulating and diagramming information that represents the structure of large computer programs. However, the design of TA is not restricted to this application. While TA is convenient for recording how to draw graphs, by encoding positions and sizes as attributes, TA is not uniquely a "drawing language".

TA can serve as a "data interchange" language, used for exchanging information among a set of software tools. For example, parsers can extract "facts" from source language programs and can represent these in TA. These facts in turn can be manipulated to create new "views", also expressed in TA, which in turn can be displayed on a screen by a tool called a renderer. The Grok interpreter [Holt 97] is an example of a tool which manipulates these facts.

The TA language is divided into two sub-languages, namely, the Tuple and the Attribute sub-languages. TA is a two level language, one level for recording "facts" (an actual graph with its attributes) and another level for recording the general scheme for the graph (the classes of edges and of attributes). Within the Tuple parts, facts are represented in RSF notation.

[top]

The Tuple Sub-Language

Here is a part of a "TA program", written in the Tuple subset of the TA language, which represents nodes P, Q and V, with node P having a Call edge to Q, and node Q having a Ref edge to V:

    // Two edges in the Tuple language.
    Call P Q
    Ref Q V

In each triple, such as "Call P Q", we can think of P as the subject, Call as the verb and Q as the object; this could mean procedure P calls procedure Q. (This syntax for triples was invented in the Rigi Project [Muller 93], where it is called RSF, for Rigi Standard Form. We have extended this notation in various ways, as described in this note. The commenting convention, with comments starting with //, is borrowed from C++.) The graph defined by this Tuple program can be drawn as

    P  ------->  Q  ------->  V
         Call         Ref

P, Q and V are called entities or nodes. The term "entity" is most natural when considering that the information in the Tuple program constitutes a database. The term "node" is most natural when we alternately consider that the information in the Tuple program represents a graph.

The two tuples represent the edges (P, Q) and (Q, V) in the relations Call and Ref, respectively. The term "tuple" or alternately "fact" is natural to use when we are thinking about a database. The term "edge" is natural when we are thinking about a graph.

We could also have other nodes, such as main.c or "New York" and other relations, such as Declare or is-owner-of. The syntax of TA allows entities and relations to be named by (to have identifiers as) any string that contains no white space (or is quoted).

TA's concept of a graph is sometimes called a multigraph or a typed graph or a colored graph, which means that there can be more than one type (color) of edge. Between two nodes, there can be more than one edge, given that the edges are of different types. In our example, there are two types of edges, named Call and Ref. We will not use the term "type", but will instead say that there are two "relations", Call and Ref. In this example Call contains one edge (P, Q) and Ref contains one edge (Q, V).

Aside. Besides interpreting a Tuple program as a graph or a database, we can also interpret it is a Prolog program; Ullman [88] gives the interpretation of data bases as Prolog (really, as DataLog) programs. For example, the tuple "Call P Q" would be written in Prolog as Call(P, Q), where it would mean that predicate Call is true for arguments P and Q. In the Prolog interpretation, the term "fact" is used instead of "tuple" or "edge".

To summarize what has been said so far, the Tuple language has a syntax, namely, it is written has a sequence of triples, and it also has a semantics. The semantics is that each triple is interpreted as an edge in a typed graph. Nodes in this graph are distinct if they have distinct names. (Later we will explain that the nodes can be divided into "entity classes", analogous to the way in which edges are divided into separate relations.) In the rest of this note, we will use the terms "node" and "entity" interchangeably and the terms "tuple", "triple", and "edge" interchangeably.

[top]

The Attribute Sub-Language

Each node and edge can have attributes. For example, here are color attributes for the nodes and edges in the previous example, written in the Attribute sub-language:

    // Colors for nodes P, Q and V
    P { color = red }
    Q { color = yellow }
    V { color = green }
    
    // Colors for the two edges
    (Call P Q) { color = black }
    (Ref Q V)  { color = gray }

Here are further attributes for the three nodes that record positions and sizes for diagramming the graph:

    // (x,y) positions and (width,height) sizes for nodes P, Q and V
    P { x =  50  y = 100 width = 70 height = 30 }
    Q { x = 150  y = 100 width = 70 height = 30 }
    V { x = 250  y = 100 width = 70 height = 30 }

Here is another attribute giving the description of node P:

    // Description for node P
    P { description = "This is a procedure named P" }

As you can see from these examples, an attribute for a node or edge is given by writing the name of the node, such as P, or writing the edge in the form (Call P Q), then a left brace {, then a list of items of the form "name-of-attribute = attribute-value", all followed by a right brace }.

The syntax of TA is free form, allowing comments to appear anywhere, blank lines to appear anywhere, and "tokens", such as names and special characters such as { and = to be separated by white space, such as blanks and ends of lines. For example, we could have written the (x,y) and size attributes for P this way:

    P { x = 50  y = 100
        width = 70
        height = 30 }

The examples of attributes that we have given in this section are suitable for recording how to draw our example graph. For example, node P would be drawn in color red, with its top left corner at (x,y) location (50, 100), with a width and height of 70 and 30 pixels respectively. This interpretation is not inherent in the TA language, but rather is a convention used by certain renderers (automatic picture drawers) that use TA to give the format for their input. The TA language itself does not place any particular interpretation upon the attributes. Rather, like a database, TA simply serves as a means for recording these values.

We use attributes to attach values to entities, such as red to P, when the value can be thought of as being "local" to the entity. We think of P's set of attributes as a record that "belongs" to P or is contained in P.

[top]

The Scheme Level of the TA Language

Up to this point we have introduced the "fact" level of the TA language, which is used to describe a graph and its attributes. We will now introduce the "meta" or "scheme" level of TA, which is used to describe the general "shape" of a graph. The scheme level of TA uses essentially the same syntax as the fact level, but now the notation is used to record higher level information; in particular, the scheme level describes what is commonly known as an E/R Diagram [Chen 77]. (Note: a "scheme" is commonly called a "schema". In this section we will use the term "scheme" because that term is part of the TA syntax.)

In the example we have given, there were two types (classes) of edges, namely, Call and Ref edges. There were also two types (classes) of nodes, namely, procedure nodes (P and Q) and variable nodes (V).

We can record these conventions as follows

    Call Proc Proc
    Ref Proc Var

What these two lines mean is that procedures Call procedures and procedures Ref (reference) variables. This implies that there are Call and Ref relations and that there are Proc and Var entity classes. (Note that an "edge class" is called a "relation".) In other words, an actual graph can have Call edges from Procs to Procs as well as Ref edges from Procs to Vars, but no other edges.

When combining fact level and scheme level notation in the same file, we separate the file into one of four possible sections, with these four headers:

    SCHEME TUPLE :      // Gives E/R description of allowed edges 
    SCHEME ATTRIBUTE :  // Gives names of allowed attributes
    FACT TUPLE :        // Gives edges in actual graph
    FACT ATTRIBUTE :    // Gives attributes in actual graph

In the usual case, a scheme, described by its scheme tuples and its scheme attributes, would be used for more than one actual (fact level) graph. In our example, we would add the following to our "fact" level TA program (to a FACT TUPLE section):

    $INSTANCE P Proc
    $INSTANCE Q Proc
    $INSTANCE V Var

This means that P and Q are in the Proc class of entities and V is in the Var class of entities. We can consider that these are "declarations" for P, Q and V.

We have given an example of the Tuple sub-language at the scheme level (as in "Call Proc Proc"). We will now give an example of the Attribute sub-language at the scheme level. In our example, procedures (Proc's) can have six attributes: x, y, width, height, color and description. We can record this scheme information as follows:

    Proc { x y width height
           color = red
           description }

We have "declared" that a Proc entity can have these six attributes. We have also "declared" that there is a default value of the color attribute, which is "red". This means that if any Proc entity is not explicitly given a color attribute set, it is taken to be red.

Rules for attributes values. First, consider attributes given in the FACT section. Such an attribute must have an explicit value, so "color = blue" is acceptable, but simply "color" is not acceptable. Such a fact attribute can be given only if there is a corresponding attribute in the SCHEME section. Second, consider attributes written in the SCHEME section. It is optional whether a value is given for such an attribute, so in the scheme, either "color" or "color = red" is acceptable. If the value is not given, the attribute signifies that a value can be given (but is not required to be given) for corresponding attributes in the FACT section. If the scheme gives a value, as in "color = red", this provides the default value for corresponding fact attributes. This default can be overridden by explicitly providing a value in the FACT section.

Aside. E/R Diagrams are richer than TA's schemes in that E/R Diagrams support n-ary relations while TA supports only binary relations such as Call and Ref.

[top]

Inheritance in TA

Among entity classes (and relations) there may be many common characteristics. In our example, all entities have the same attributes, whether they were Proc's or Var's. Similarly, the two edge classes (relations) both have a color attribute.

We can use inheritance to avoid re-writing such common attributes, as is done here at the scheme level:

    $INHERIT Proc ProgItem
    $INHERIT Var ProgItem

This says that a Proc "is a" ProgItem, as well, Var "is a" ProgItem. We declare the common attributes of program items (ProgItem's) this way:

    ProgItem { x y width height
           color
           description }
    Proc { color = red 
           numberOfCalls = 0 }
    Var  { Color = green }

These attribute declarations allow procedure and variable entities to have the six attributes of program items. As well procedures have a default color of red and a numberOfCalls attribute that is initially 0. Variables have a default color of green.

There is a predefined entity class called $ENTITY which is the inheritance root for entity classes; in other words, all entity classes implicitly inherit from $ENTITY. Instead of using ProgItem we could have used $ENTITY with the same effect, as in the following way:

    $ENTITY { x y width height
           color
           description }
    Proc { color = red 
           numberOfCalls = 0 }
    Var  { Color = green }

Since every entity class, such as Proc and Var, inherits from $ENTITY, we do not need to add lines such as "$INHERIT Proc $ENTITY". There is a predefined relation called $RELATION that is the inheritance root for relations; in other words, all relations implicitly inherit from $RELATION. We can specify that all relations have color and weight attributes as follows:

    ($RELATION) { color weight }
    (Call) { color = black }

We have also specified that the default color for Call edges is black.

In a typed graph, there is at most one edge of a given type between two nodes. If we wish to represent more than one edge of a given type, e.g., more than one call from P to Q, we can use an attribute such as "weight" to record this information. For example, the following can record the fact that there are three calls from procedure P to procedure Q:

    (Call P Q) { weight = 3 }

When one entity class inherits from another, this has two effects. First, the inheriting class gets all the attributes of the inherited-from class. Second, any connectivity allowed for the inherited-from entity class is also allowed by the inheriting class. For example, this line

    partOf progItem progItem

specifies that any progItem (any Proc or Var) can be "part of" any other progItem. As another example,

    $RELATION $ENTITY $ENTITY

allows any entity (every entity descends from $ENTITY) to be connected by any relation (every relation descends from $RELATION) to any other entity.

In TA, inheritance is a notational convenience, but it adds no new power beyond what was in TA without inheritance. It simply allows us to write less by factoring common scheme information into an inheritance ancestor. It is always possible to write out the connectivity and attributes explicitly rather than using inheritance and get the same effect.

The only possible incompatibility from combining attributes of two inherited attributes would be if two attributes had the same name and one named a nested set of attributes and the other named a scalar attribute; this problem and its solution are discussed in the section on attribute nesting and attribute lists.

Multiple inheritance is allowed. For example, a Proc could inherit from both a Scope and a Unit. Multiple inheritance combines the attributes from its various inherited-from classes. Two inheritances of the same attribute, say color, are taken to be a single attribute of that name. The only possible incompatibility when combining attributes is the same as the problem previously described, in which an item's attributes are given several times.

[top]

Summary of Sub-Languages of TA

There are two languages that make up TA, the Tuple and the Attribute sub-languages. As well, there are two levels of each of these, the fact level and the scheme level. We illustrate this idea in this diagram.


                      Tuple          Attribute
                 +---------------+----------------+
                 |               |                |
         Scheme  |  Allowed      |  Allowed       |
                 |  edges        |  attributes    |
                 |               |                |
                 +---------------+----------------+
                 |               |                |
         Fact    |  Actual       |  Actual        |
                 |  edges        |  attributes    |
                 |               |                |
                 +---------------+----------------+

The syntax used for scheme tuples and for fact tuples is identical, and the syntax used for scheme attributes and for fact attributes is essentially identical. At the scheme level we "declare" each entity class and each relation as well as the attributes allowed for each of these. At the fact level, we "populate" a graph, by giving the graph's nodes, edges and attributes.

[top]

Scheme Conformance

The purpose of a scheme is to capture certain invariant structural properties of a fact-level graph (or set of graphs). For example, it is handy to state, in a single place, the kinds of nodes and edges in a graph, its various attributes and the allowed connectivity among nodes. In turn, it is helpful to be able to test to see if a (fact level) graph actually conforms to its scheme. If not, presumably there is something wrong with the graph (or possibly with the scheme). The scheme provides a certain amount of redundancy that can help us find errors in particular graphs. Put another way, a scheme encodes a set of constraints that we expect our graph(s) to satisfy. Yet another way to put this is that a scheme provides a set of declarations for a graph, and the graph itself makes assignments (creates structure) to populate the declarations.

If the fact level graph F corresponds to a particular scheme S, we say that F "conforms" to S (or satisfies S, or is legal according to S). Sometimes there is a commonly used scheme S, in which case we say simply F is conformant, or legal, without mentioning S.

In a standard relational database (an RDB), there is a scheme and the corresponding (instance) database contains tuples that corresponding to the scheme. In the usual implementations of RDBs, it is not possible to add a tuple to the database unless the tuple corresponds to the scheme. This in turn implies that the database can never fail to conform to its scheme.

This is not the case in TA. In a TA program, the programmer can write anything he or she feels like. If he violates the TA's syntax, then technically, the program has no meaning at all. If the program satisfies TA's syntax, then the program may or may not be conformant. If it is, this situation is like that in an RDB, namely, the fact level graph actually satisfies its scheme. If not, we can still consider that the fact level graph exists, as does the scheme, even though the fact level graph "breaks the rules" of its scheme. We can repair (or write a program to repair) a program so it satisfies its scheme, by deleting violating constructs.

Note: There is a "schema checker", called chkschema, written in Grok, which checks to see if the fact part of a TA program conforms to its schema.

[top]

The Universal Scheme

A scheme defines those relations, nodes and attributes that are allowed in a corresponding fact graph. A scheme is usually given explicitly in terms of the TA language. However, in principle, a scheme is optional in that the Universal Scheme can be used. The Universal Scheme assumes:

Any node can be attached to any other node. This is equivalent to this tuple in an explicit scheme:
```
            $RELATION $ENTITY $ENTITY  
```
Any name of a node that appears in a fact tuple, such as A and B in "Rel A B", is implicitly considered to be an instance of $ENTITY, as if the following occurred explicitly as a tuple in a scheme:
```
            $INSTANCE A $ENTITY
            $INSTANCE B $ENTITY  
```
Any attribute is allowed for any node or edge or relation. This is as if the corresponding attribute were given explicitly in a scheme attribute.

[top]

Attribute Values That Are Lists

The value of an attribute can either be a string, such as green in

    P { color = green }

or a value list as in

    P { color = ( 0.5 0.2 0.4 ) }

A value list is a sequence of zero or more values (strings) surrounded by parentheses ( ... ). Note that lists of lists are allowed. The original version of TA specified that attributes could be nested, but this feature has been eliminated from TA.

[top]

Grammar for TA

The grammar for TA is given below. The notation {...} means zero or more repetitions of the enclosed items.

  TALanguage ::= {section}

  section ::=   
            SCHEME TUPLE : tupleLanguage
          | SCHEME ATTRIBUTE : attributeLanguage
          | FACT TUPLE : tupleLanguage
          | FACT ATTRIBUTE : attributeLanguage
          | INCLUDE "stringToken" :

  tupleLanguage ::= { stringToken stringToken stringToken }

  attributeLanguage ::=
            { itemId "{" {attributeSetting} "}" }

  attributeSetting ::=
            attributeId
          | attributeId "=" attributeValue
            
  attributeValue ::=
            stringToken
          | "(" {stringToken} ")"
            
  itemId ::=  stringToken        // Fact entity or, in scheme, entity class
          | "(" stringToken ")"  // Relation (edge class)
          | "(" stringToken stringToken stringToken ")"    // Actual edge

Note: Nested attributes have been deleted from TA. These stringTokens have predefined meanings: $ENTITY, $RELATION, $INSTANCE, and $INHERIT. ($NOTE no longer has any meaning.)

[top]

String Tokens in TA

In TA, string tokens (stringTokens) may be quoted, or not, depending on their contents. For example, in the following:

	P { color = red }

the string token with value red is written without quotes. This could as well be written as

	P { color = "red" }

with the same meaning. This can also be written as:

	P { color = 'red' }

As these examples illustrate, single or double quotation marks can be used to enclose the contents of string tokens. In string tokens, certain characters, such as newline characters, must be escaped, using a backslash character (\). Here is the list of these characters:

	   CHARACTER       ESCAPE CONVENTION
        newline (10)       \n
        tab (9)            \t
        formfeed (12)      \f
        return (13)        \r
        eject (27)         \e
        delete (127)       \d
        backslash (92)     \\
        single quote (39)  \'
                 (escaping not necessary in double quoted strings)
        double quote (34)  \"
                 (escaping not necessary in single quoted strings)

Single quotes ( ' ) do not need to be escaped within strings surrounded by double quotes ( " ) and vice versa. In TA, string tokens containing the following must be quoted:

Blank character
Any of the above escape characters
( Left parenthesis
) Right parenthesis
{ Left brace
} Right brace
= Equal sign
' Single quote outside of a double quoted string
" Double quote outside of a single quoted string

String tokens in TA must contain only printable characters (and the above escaped characters). As a result, TA token strings cannot contain all the possible 8-bit patterns of bytes.

[top]

Maximum String Lengths, etc.

String tokens may have their lengths limited by some tools; for example, relation names, entity class names, entity names and attributes names may be limited to 255 characters. (Grok has this restriction.) Tools that use or accept attributes values are expected to handle attribute string values that are very long, e.g., thousands of characters long.

Some tools, notably LSEDIT, use the following convention. Several fact assignments are allowed to a given attribute, with the final assignment being the effective one (previous ones being effectively ignored). Note that there is no way in TA to add to an already existing list (although of course a particular tool could read a TA program and modify it such that new items are effectively added to a list).

There is an "include" directive, so you can build up a TA program out of a set of files. The format of this directive is:

   INCLUDE nameOfIncludedFileInQuotes :

LSEDIT makes effective use of "includes" to separate out various parts of TA, such as the schema, layout information and descriptions, which are attributes.

[top]

Relations with special meanings

By Nikita Synytskyy

Some relations are treated specially by most software programs that work with TA data. The special treatment is not formalized in the TA specification, but is very wide-spread so it is included in this document as a general guide.

Attributes. In most software that works with TA data, attributes are internally stored as relations. To differentiate between ordinary relations and relations that are meant to represent attributes, attribute relation names are commonly prefixed with the "@" or "@_" symbol combinations. For example, data

P {color = green}

might be stored internally as

@color P green

It is therefore unadvisable to give ordinary relations names that begin with the @ character, because some programs will interpret these relations as containing attribute information.

$INSTANCE. As described above, the $INSTANCE relation TA's way to declare the type of an entity. It is unadvisable to use it to store any other information, as this could lead to unexpected errors with the software that operates on TA.

Containment. Since TA is often used to record and operate on data about software architecture, the need to store containment information often comes up (e.g. a variable is contained in a procedure, which is contained in a file, which is contained in a subsystem, etc.) It is a common, although poorly documented, agreement that the relation called contain is used for that purpose. As a result, most software packages that in any way deal with containment information expect to find it in a relation called contain. It is also often (although not always) assumed that the data in the contain relation is a tree, or at least a forest, i.e. no entity is contained by two entities simultaneously, and there are no loops.

If the contain relation is missing, or contains data that can not be interpreted as containment information, software that depends on it existing and being correct may yield unexpected results or not run at all.

[top]

References

[Chen 77] The Entity-Relationship Approach to Database Design, by Peter Chen, QED Information Sciences Inc., 170 Linden St., Wellesley, Mass., 1977, 1991
[Holt 1998] Structural Manipulations of Software Architecture using Tarski Relational Algebra, R. C. Holt, WCRE '98: Working Conference on Reverse Engineering, Honolulu, Oct 1998.
[Malton 01] Union Schemas as the Basis for a C++ Extractor, Thomas Dean, Andrew Malton, Richard Holt, WCRE 2001: Working Conference on Reverse Engineering, Stuttgart, Germany, Oct 2-5, 2001.
[Mancoridis 95] Extending Programming Environments to Support Architectural Design, S. Mancoridis and R. C. Holt, CASE '95: Workshop on Computer-Aided Software Engineering, Toronto, Canada, July, 1995.
[Muller 93] A Reverse Engineering Approach to Subsystem Identification, by Hausi A. Muller, O.A. Mehmet, S.R. Tilley, and J.S. Uhl, Software Maintenance and Practice, Vol 5, 181-204, 1993.
[Tarski 55] On the Calculus of Relations, by Tarski, A., J. Symb. Log. 6, 3, 1941, 73-89.
[Ullman 88] Principles of Database and Knowledge-Base Systems (Volume I), by J. D. Ullman, Computer Science Press, New York, New York, 1988.

Article edited by N. S. Feb. 27 2004