ASX C/C++/Assembler fact extractor

SWAG >> ASX

Download ASX

ASX is a fact extraction tool that extracts source information from C, C++, assembler, object, libraries, dynamic libraries and executables, in a format that may then immediately be visualised using lsedit.

The supporting paper can be found here.

The latest builds of ASX are available below. While effort has been made to test ASX, the software may contain bugs. Note that currently ASX is provided as a binary executable that will only run on the specified platform.

For convenience and for verification purposes the intermediate disassembled output can be viewed by using the program asxo.

ASX 4.0.2 for linux [asx-4.0.2.zip] [asx-4.0.2.exe]

ASX 4.0.1 for linux [asx-4.0.1.zip] [asx-4.0.1.exe]

ASX 4.0.0 for linux [asx-4.0.0.zip] [asx-4.0.0..exe]

ASX 3.0.11 for linux [asx-3.0.11.zip] [asx.exe]

ASX 3.0.10 for linux [asx-3.0.10.zip] [asx.exe]

ASX 3.0.9 for linux [asx-3.0.9.zip] [asx.exe]

ASX 3.0.8 for linux [asx-3.0.8.zip] [asx.exe]

ASX 3.0.7 for linux [asx-3.0.7.zip] [asx.exe]

ASX 3.0.6 for linux [asx-3.0.6.zip] [asx.exe]

ASX 3.0.5 for linux [asx-3.0.5.zip] [asx.exe]

ASX 3.0.4 for linux [asx-3.0.4.zip] [asx.exe]

ASX 3.0.3 for linux [asx-3.0.3.zip] [asx.exe]

ASX 3.0.2 for linux [asx-3.0.2.zip] [asx.exe]

ASX 3.0.1 for linux [asx-3.0.1.zip] [asx.exe]

ASX 2.0.8 for linux [asx-2.0.8.zip] [asx.exe]

ASX 2.0.7 for linux [asx-2.0.7.zip] [asx.exe]

ASX 2.0.6 for linux [asx-2.0.6.zip] [asx.exe]

ASX 2.0.5 for linux [asx-2.0.5.zip] [asx.exe]

ASX 2.0.4 for linux [asx-2.0.4.zip] [asx.exe]

ASX 2.0.3 for linux [asx-2.0.3.zip] [asx.exe]

ASX 2.0.2 for linux [asx-2.0.2.zip] [asx.exe]

ASX 2.0.1 for linux [asx-2.0.1.zip] [asx.exe]

ASX 1.0.9 for linux [asx-1.0.9.zip] [asx.exe]

ASX 1.0.8 for linux [asx-1.0.8.zip] [asx.exe]

ASX 1.0.7 for linux [asx-1.0.7.zip] [asx.exe]

ASX 1.0.6 [asx-1.0.6.zip]

ASX 1.0.6 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

ASX 1.0.5 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

ASX 1.0.4 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

ASX 1.0.3 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

ASX 1.0.2 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

ASX 1.0.1 (ELF i686-i386 GNU/Linux 2.2.5) [asx.exe]

Supporting library [libstdc++.so.6]

Supporting public domain source (huge) [binutils-2.21.tar]

Official binutils download site [binutils-2.21.tar.gz]

Official zlib download site [zlib-1.2.5]

Trivial program for scripts [echoargs.c]

ASX is being constantly developed and improved; check back often for newer versions.

Description

This program takes a list of parameters specifying compilation options and C/C++/assembler/object files assumed to collectively form some sort of compilation unit, and emits to the standard output facts expressed in TA regarding the function declarations, function invocations, variable usage, and classes discovered in the compilation unit(s).

When C and C++ source files are presented to this fact extractor it first attempts to generate assembler source files source files from the submitted named source files, and if this is successful then extracts from the generated assembler the facts emitted in the output TA. If the compilation of such a source file fails, this is reported in the output TA, and asx falls back on a more clumsy method of extracting facts from the provided source code by performing a simple lexical scan of the output produced by cpp when presented with this source and associated arguments. When assember files (*.s files) are presented to ASX they are directly parsed.

When object files (*.o files), archives (*.a files), dynamic libraries (*.so) and/or executables are presented to this fact extractor extraction is achieved by disassembling the contents of the object file.

The primary motivation justifying the development of this fact extractor was the desire to be able to simply extract facts from a wide variety of source languages and object file formats while preserving the source file and line number from which the facts were extracted as attributes within the resulting TA.

Synopsis

asx [-] < build.history > facts.ta

asx (<compile options>* <extraction files>+)+ > facts.ta

Options

When working with small projects the simplest way to perform fact extraction is to simply invoke asx with the command line options used to compile the project, as if asx were a substitute for gcc or g++.

Such options consist of zero or more compile time options followed by one or more source files, with this pattern optionally being repeated. Any compiler options (these are presumed to begin with a '-') may be specified. The options -c, -o -E -O* are inappropriate options for generating instrumented assembler, and will thus be ignored if seen on the command line. The options -g, -S need not be explicitly specified, since these are automatically added to the compile time options if absent.

Prior options apply to all subsequent source files, with all such prior options being discarded if later options are discovered after prior source files. Thus it is possible, though not necessarily easy, to submit very different options for each source file from which facts are to be extracted.

Thus the options -DABC source1.cpp source2.cpp -DEFG source3.cpp would define ABC in source1.cpp and source2.cp but EFG and not ABC in source3.cpp.

No options are needed when extracting facts from assembler, binary object files, or archived libraries.

Alternative standard input options

In more complicated build processes, an alternative approach may be used to extract facts using asx. Firstly download, compile, and install the simple program echoargs.c. This program merely echoes the current working directory, followed by any arguments presented to it, enclosing such arguments in quotation marks the better to ensure that they can subsequently be reparsed correctly.

Then place script files named gcc, g++ and if required as and cc earlier in your $PATH search path than the genuine compiler and have these scripts capture the arguments presented to this genuine compiler before actually invoking it. To check that your $PATH is setup correctly use echo $PATH, and confirm that your scripts are in a directory named earlier than the directory containing the real compilers. If not use the commands PATH=<directory>:$PATH; export PATH to correct this, ideally by placing this command in your login script, so that it is automatically performed each time you login. If you login in with csh, rather than bash or sh, which can be determined by using the command echo $SHELL the above command becomes set PATH=<directory>:$PATH

One such gcc shell script might contain:


#!/bin/sh
echoargs /usr/bin/gcc "$@" >> $HOME/gcc.history
/usr/bin/gcc "$@"

while the corresponding g++ shell script would contain:


#!/bin/sh
echoargs /usr/bin/g++ "$@" >> $HOME/gcc.history
/usr/bin/g++ "$@"

and the as (sometimes named gas) shell script would contain:


#!/bin/sh
echoargs /usr/bin/as "$@" >> $HOME/gcc.history
/usr/bin/as "$@"

Note that the above scripts must write to the same history file, and that the history file must be given an absolute path name. You can test this setup by subsequently invoking gcc and/or g++ to verify that the commands as invoked are captured in the history, as well as executed.

Having done this, all build operations performed through make or other vehicle that do not explicitly specify the location of gcc/g++ will have a history built of the arguments submitted to gcc/g++. Further, for many makefiles the location of the compiler can be set through the CC environment variable, if all else fails.

Be careful to empty or remove this history file before beginning the build process, rather than at any point during it, and then build the executable from which facts are to be extracted. One way to do this within a Makefile is to have make clean remove the history file, and to specify clean as the first thing that your build is dependant on. Also be careful to rename the script wrappers used to capture this history, once capture is complete, so that you are not unintentionally building an ever larger history of prior compilations needlessly.

Such captured compiler instructions may then be presented to asx, by either invoking asx without options, or by passing to asx an initial run time option of "-". In either case asx subsequently operates by reading its instructions from standard input, rather than from the command line.

Each line read from standard input must contain parameters that are double quoted, to indicate where each begins and ends. The first such parameter specifies the directory in which all subsequent compilation relating to this line is to be performed, and the second must contain the named location of the tool to be used in performing all compilations relating to options on this same input line.

Care should be taken to review the input parameters generated using this technique prior to running asx since depending on the context one might or might not wish to extract facts from both source files and object files. Facts might erroneously be recovered multiple times if source code was for example compiled into object code, and then these object units were themselves compiled a second time into larger object units. By default asx will make some effort to avoid processing input object files corresponding to earlier source file compilations, but input libraries will always by default have facts extracted from them.

A special keyword reset on a line by itself within the history passed to asx, instructs asx to make all classes, functions and variables constructed by asx prior to this reset command, invisible to all later processing performed, prior to actual emitting of the output TA. When a build involves building multiple distinct programs, placing such a reset command at the appropriate points within the history, ensures that the subgraphs for each distinct program constructed during the build process remain disconnected from all subgraphs constructed by other unrelated parts of the build process. Without such manually entered reset instructions, asx will permit functions, and variables defined in one program to logically be accessible to all of the programs being built, even if the build process itself does not actually permit such functions and variables to be shared across programs.

Lacking build instructions

In situations where the source lacks the build instructions that would permit capture of a genuine build history, the tool guessbuild can be used in a somewhat desperate attempt to reverse engineer a plausable build history from the collection of files that constitute the source of interest, without actually building source code.

Special environment variables

These environment variables may be used to constrain the behaviour of asx.

ASX_COMPILE=[yes|no]
ASX_SKIP=[yes|no]
ASX_SILENT=[yes|no]
ASX_UNLINK=[yes|no]
ASX_IGNORE=<suffixes>
ASX_FORCE=<suffixes>
ASX_LIFT=<a>
ASX_DWARF

If ASX_COMPILE is set to no, then no attempt is made to compile source files to assembler. Instead extraction from such source files is achieved using a simple lexical examination of the source code, having preprocessed that source code using cpp. Otherwise, asx will attempt to compile source files to assembler before falling back on a lexical scan if the compilation fails. Fact extraction using such a lexical scan is not recommended since the results are not very accurate at this time.

If ASX_SKIP is set to yes, then no attempt is made to compile source files for which more current assembler files already exist which are newer than the corresponding source file. Instead it is assumed that the assembler files are to be read, without first compiling the corresponding source.

If ASX_SILENT is set to yes, then compilation commands are not echoed, though output resulting from those compilations may still be. The default of no causes compilation commands not to be echoed.

If ASX_UNLINK is set to no then any generated assembler files will not be silently removed. This is a useful option if one wishes to later examine such files. Otherwise, assembler files generated by asx will subsequently be silently deleted.

ASX_IGNORE contains a list of qualifiers separated by any of ';', ':' or ',' that are not to be presumed to contain information from which facts are to be extracted. The following qualifiers are recognised:

r Don't capture references to functions as addresses

f Don't show function calls

fv Don't show virtual function calls

t Don't show templated functions and function calls

v Don't show any variables

vs Show only global static variables

vf Show only file level static variables

vp Show only static variables

vl Show only static variables and parameter variables

vn Show all variables

a Don't extract facts from discovered library archives

o Don't extract facts from discovered object code

so Don't extract facts from discovered shared object code

s Don't extract facts from discovered *.s files

S Don't extract facts from discovered *.S files

c Don't extract facts from discovered *.c files

C Don't extract facts from discovered *.C files

cc Don't extract facts from discovered *.cc files

cxx Don't extract facts from discovered *.cxx files

cpp Don't extract facts from discovered *.cpp files

c++ Don't extract facts from discovered *.c++ files

For example, if one knew that one only wished to extract facts from source files one might specify ASX_IGNORE="o;so;a" to avoid extracting facts from object files and libraries presented as inputs to the build process. Use 'v' to suppress all variables, 'vs' to suppress all but global variables, and 'vf' to suppress static variables defined within a functions scope. If 'f' is specified no function calls will be generated in the output. If "fv' is specified no virtual function calls will be generated in the output. Currently by default all function calls are emitted and only global variables are emitted.

ASX_FORCE contains a list of suffixes separated by any of ';', ':' or ',' that are always presumed to contain information from which facts are to be extracted. By default '.o' and '.so' files whose absolute prefix matches the absolute prefix associated with a source file already processed by asx are ignored, on the grounds that the facts contained within such object files have already been extracted from these corresponding source files. Thus the default behaviour is effectively to only extract facts from object files seen as inputs within the build for which no corresponding source file appears to exist. If "f" is specified no virtual function calls will be generated. If "fv" is specified all function calls will be emitted. If 'v' is specified all static variables will be emitted; if 'vg' is specified only global variables will be emitted, while if 'vs' is specified both global variables and static file scope variables will be emitted but not static variables having only function scope.

ASX_LIFT contains a list of types of entity whose edges are to be lifted to the parent of this entity. Currently the only entity which may have edges lifted are address entities whose parents are functions or variables. This lifting is requested by assigning ASX_LIFT the value 'a'.

ASX_DWARF causes the dwalf symbolic information to be dumped in its internal tree form on the standard error output. This information is primarily of use to developers of ASX.

The above environment variables will also be recognised if entered directly on the asx command line preceeded by "--".

Source files

Source files are those files having one of the suffixes '.c', '.C', '.cc' (presumed to be C source files); one of the suffixes '.cxx', '.cpp', or '.c++' (presumed to be C++ source files); .S (presumed to be assembler requiring preprocessing) or '.s' (presumed to be assembler source files). C source files are converted to assembler by using /usr/bin/gcc while C++ source files are converted to assembler using /usr/bin/g++.

When the options to asx are submitted on the command line, each non-assembler source file is compiled in the directory within which it resides, even if this is not the directory from which asx is invoked. It is impossible in general to say where source code should be compiled, and this is a good guess as to how the submitted code is normally built. Note that this is significant when specifing include files as relative path addresses. Also note that such directory changes during the exercise of compiling source should be otherwise transparent to the user.

Object files

Object files are those files having the suffixes '.o', '.a' and '.so and executable files (having no suffix). Extraction from these files uses bfd, which invokes backend software to manage the differences between machine architectures. Object files on swag are represented using the Elf standard (pdf), with some sections within this standard internally encoded using the Dwalf standard (pdf) and the Dwalf 4 standard. For dwarf standard see also http://dwarfstd.org.

Internals

The behaviour of asx is specific to a given assembler dialect and if ported to other machines might require modification to handle different assembler directives.

Data structures are generated as each file from which facts are extracted is processed. These represent:

Directories
Archives
Source units
Named files
Classes
Functions
Variables
Function/Variable usage
Objects

ASX uses the dwarf encoded symbolic debug information (if present) to deduce the names and types of parameters and local variables. It makes an attempt to also detect sequences of assembler code that get generated when C++ functions are invoked indirectly via the a class VTable. This logic presumes a considerable amount about how such assembler code is generated, and is very compiler specific.

ASX version 3 uses the Itanium C++ ABI standard to extract the namespace and class that objects belong to, from these objects internally mangled names. It attempts to address departures within the gnu compilers from this provided standard. Earlier versions of ASX did not handle namespaces correctly.

Directories

The directory structures emitted in the output TA is derived directly from the named source files occurring on the command line. Different output will be emitted by the TA for source files specified as *.c* from those specified logically equivalently as ./*.c*. In the former case the current directory will not be emitted, and so the source files will appear under the root of the landscape, while in the latter case, the current directory will be emitted with the source files specified contained within this directory.

Source units

These internal objects represent the logical files processed; be those source files, assembler files, object files, or virtual files contained within an archive. The key characteristic of such a file is that it is a basic source from which facts can be extracted.

Named files

Named files represent citations within the assembler code that specify where the assembler is to be deemed to have been produced from. While these named files will often agree with the names of the files submitted for compilation, this will not always be the case, since the preprocessor inserts #line directives into source code when scanning header files. Some preprocessors such as yacc and bison also employ such directives explicitly so that the assembled code is related not to the source that produced it but to the directives that cause source code to be generated. A given source file many thus contain multiple distinct named files within it. These are within the assembler distiguished by number.

Classes

These objects represent the various classes to which functions and variables are identified by their mangled name as belonging to. To see all of the member variables and functions of such a class within lsedit, select the classes of interest and then use a forward query (F).

Functions

These objects represent instances of function signatures. Since the number of functions may be large in a big compilation, hashing is used to provide rapid lookup of known functions.

Those functions for which a function declaration has been seen will be known to occur in one or more specific source units with each such declaration generally being associated with a starting line number (specified in the output TA by the functions lineno attribute) in a named file (specified in the output TA by the functions file attribute).

When a function signature is declared in multiple source units, this function will appear within each such source unit within the output TA.

Software may also recover the address of functions, presumably to later permit such these function to be indirectly invoked by their address. When such addresses are obtained this is schematically shown by creating an edge from the accessor to the function address, itself contained as a pseudo member of this function.

Variables

These objects represent instances of variables either declared or used within the assembler. Unlike functions and function invocations, the assembler does not preserve the line number or file in which such variables are logically defined.

Variables may be accessed in a manner which updates them, retrieves their address, or determines their value. Any operation which may potentially result in updating a variable, either directly, or through its address, will generate update edges, rather than read edges.

Function/Variable usage

Since the mumber of possible interactions between functions may be huge, hashing is used to provide rapid lookup of known function invocations, and variable usage.

Each function invocation is represented in the TA by an edge from the calling function, to the function invoked. This edge is assigned a file and lineno attribute indicating where this specific function invocation was first encountered. The number of times a given function calls a given function within the assembler, is contained in the edge's freq attribute.

Functions which are invoked but never declared in the source code submitted to asx in the output TA become members of the class external and are placed in a special :library: directory. Edges to such functions are assigned a different type (and color) from edges between functions that are both declared.

Because function signatures may be replicated in different files, a certain amount of resolution is sometimes needed to correctly determine what function is the actual one being called by a caller. The preference is to assume (1) that the appropriate function is the one declared in the same source unit as the caller. Where no instance of a function occurs in the same source unit as the caller function, preference is then to resolve the function to the earliest declaration of the function seen when processing the input. Thus if the same function was declared both in an object file and an archive library, resolution would be to the object file if this appeared earlier in the command line than the archive library, and other to the declaration of the function contained in the archive library.

Edges are also constructed from functions to variables used within these functions. As with functions resolution of the actual variable referenced when the same variable name occurs in multiple input files, is achieved by preferring always to associate functions with variables that occur within the same compilation unit.

Objects

The objects shown in the output TA as variables are those denoted in the assembler as being statically defined named objects. Such objects may have either local or global scope.

Name demangling

Within the assembler code function and variable names may be mangled if originally derived from C++ code. These mangled names are unmangled by using a subset of the GNU source code provided in the binutils-2.20.1 distributable. For details on how names are mangled, and thus later unmangled see the Itanium C++ ABI and the earlier the GNU V3 ABI specification. The original mangled name is preserved as an attribute within the output TA.

Binutil

ASX leverages BinUtil both for demangling internal C++ names, and for handling the interpretation of some of the contents of ELF object files. Unfortunately, while the amount used within BinUtil is not large, it seemed undesirable to attempt to extract from BinUtil those parts used, since this would make it more difficult to upgrade asx when more recent versions of binutil are released. As a precursor to compiling asx, extract the public domain binutil source from http://ftp.gnu.org/gnu/binutils, change into the extracted directory, and then follow the instructions in README to build binutil. This typically involves issuing the command "./configure" followed by the command "make".

Zlib

Modern versions of Binutil themselves depend on the zlib libraries which can be readily downloaded from http://www.zlib.net and compiled.

Caveats

This code was originally developed to work with C and C++ assember containing embedded symbolic information encoded using the Dwarf-3 encoding standard. In the interim 4 years since it was developed GCC and G++ have been enhanced to use the Dwarf-4 standard. Only ASX-4.0.1 and higher contain the necessary code to interpret this newer standard. Older versions of ASX will fail if the assember source code generated by GCC/G++ conforms to the Dwarf-4 standard rather than the Dwarf-3 standard.

When presented with source code this fact extractor works best when the code is compilable. Compilation is therefore always recommended. Since the source code is compiled as part of fact extraction, warning and error message relating to compilation of this source code may be emitted by this program.

When the code fails to compile, the fact extraction achieved by lexically scanning the source code, is designed to be as forgiving of errors as possible. However the output produced in this case may be problematic, particularly if scanning C++ files. Function name extensions that identify functions with specific classes will not be reported unless the code explicitly uses the full function name, and function signatures will not be encoded in the function name, potentially causing problems when attempting to resolve function names within this software. Thus invocations to a single function, might visually appear to be invocations to multiple functions, while invocations to multiple distinct functions distinguished only by their signatures will appear to be invocations to the same function.

When presented with assembler and object code that has be previously compiled the usefulness of fact extraction will depend critically on whether these source units were themselves generated in a manner (using the -g option) that preserved symbolic information and line number information in the output file.

Problems may arise if function names begin with the specific pair of characters "_Z", because it is this initial sequence that indicates that a name has been internally mangled. The risk is that the software will attempt to unmangle a function name that it recognises as mangled, which in fact is not a mangled function name.

The TA produced by this fact extractor may produce what appear to be duplicate function declarations. This is initially somewhat confusing. The problem arises because functions with different internal mangled names may resolve to the same external function signatures. In particular, C++ constructors and destructors are internally supported by two very similar functions, which appear to be largely duplications of each other. The one function declaration constructs/ destructs instances of a given class, while the other is employed when constructing/destruction classes derived from this base class. While this apparent duplication could be resolved by matching function based on their external signatures, this would be a somewhat expensive operation to perform at run time. An obvious solution to this problem would be to make minor modification to the name demangling software so that such variations in the roles of functions encoded within their internal mangled named could be propagated to the external function signature emitted in the TA. This would at least help explain the seeming duplication of function declarations.

Fact Extraction for Freeciv

Fact extraction of Freeciv is described here.

Fact Extraction for Wesnoth

Fact extraction of Wesnoth is described here.

Fact Extraction for Quake

Fact extraction of Quake is described here.

Fact Extraction for MySQL

Fact extraction of MySql is described here.

Fact Extraction for GIT

Fact extraction of GIT is described here.

Developed by

Ian Davis.

Location

swag:~ijdavis/src/asx -- source code
swag:~ijdavis/bin/asx -- executable

Supported platforms

Linux using the i386 instruction set

Future platforms

Any platform running gcc/g++/as which can be instructed to cross-compile to the i386 assembler.
Any compiler for which assembler can be produced, and logic written to extract the desired facts from that assembler.

Caveats

Subject to change.. still under development.

Contact information

For more information on ASX please contact us at .

Download ASX

Description

Synopsis

Options

Alternative standard input options

Lacking build instructions

Special environment variables

Source files

Object files

Internals

Directories

Archives

Source units

Named files

Classes

Functions

Variables

Function/Variable usage

Objects

Name demangling

Binutil

Zlib

Caveats

Fact Extraction for Freeciv

Fact Extraction for Wesnoth

Fact Extraction for Quake

Fact Extraction for MySQL

Fact Extraction for GIT

Developed by

Location

Supported platforms

Future platforms

Caveats

See also

Contact information