SWAG >> ASX | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Download ASXASX is a fact extraction tool that extracts source information from C, C++, assembler, object, libraries, dynamic libraries and executables, in a format that may then immediately be visualised using lsedit. The supporting paper can be found here.
The latest builds of ASX are available below. While effort has been made to test ASX, the software may contain bugs. Note that currently ASX is provided as a binary executable that will only run on the specified platform.
For convenience and for verification purposes the
intermediate disassembled output can be viewed by using the program
asxo.
ASX is being constantly developed and improved; check back often for newer versions. DescriptionThis program takes a list of parameters specifying compilation options and C/C++/assembler/object files assumed to collectively form some sort of compilation unit, and emits to the standard output facts expressed in TA regarding the function declarations, function invocations, variable usage, and classes discovered in the compilation unit(s). When C and C++ source files are presented to this fact extractor it first attempts to generate assembler source files source files from the submitted named source files, and if this is successful then extracts from the generated assembler the facts emitted in the output TA. If the compilation of such a source file fails, this is reported in the output TA, and asx falls back on a more clumsy method of extracting facts from the provided source code by performing a simple lexical scan of the output produced by cpp when presented with this source and associated arguments. When assember files (*.s files) are presented to ASX they are directly parsed. When object files (*.o files), archives (*.a files), dynamic libraries (*.so) and/or executables are presented to this fact extractor extraction is achieved by disassembling the contents of the object file. The primary motivation justifying the development of this fact extractor was the desire to be able to simply extract facts from a wide variety of source languages and object file formats while preserving the source file and line number from which the facts were extracted as attributes within the resulting TA.
Synopsisasx [-] < build.history > facts.ta asx (<compile options>* <extraction files>+)+ > facts.ta
OptionsWhen working with small projects the simplest way to perform fact extraction is to simply invoke asx with the command line options used to compile the project, as if asx were a substitute for gcc or g++. Such options consist of zero or more compile time options followed by one or more source files, with this pattern optionally being repeated. Any compiler options (these are presumed to begin with a '-') may be specified. The options -c, -o -E -O* are inappropriate options for generating instrumented assembler, and will thus be ignored if seen on the command line. The options -g, -S need not be explicitly specified, since these are automatically added to the compile time options if absent. Prior options apply to all subsequent source files, with all such prior options being discarded if later options are discovered after prior source files. Thus it is possible, though not necessarily easy, to submit very different options for each source file from which facts are to be extracted. Thus the options -DABC source1.cpp source2.cpp -DEFG source3.cpp would define ABC in source1.cpp and source2.cp but EFG and not ABC in source3.cpp. No options are needed when extracting facts from assembler, binary object files, or archived libraries. Alternative standard input optionsIn more complicated build processes, an alternative approach may be used to extract facts using asx. Firstly download, compile, and install the simple program echoargs.c. This program merely echoes the current working directory, followed by any arguments presented to it, enclosing such arguments in quotation marks the better to ensure that they can subsequently be reparsed correctly. Then place script files named gcc, g++ and if required as and cc earlier in your $PATH search path than the genuine compiler and have these scripts capture the arguments presented to this genuine compiler before actually invoking it. To check that your $PATH is setup correctly use echo $PATH, and confirm that your scripts are in a directory named earlier than the directory containing the real compilers. If not use the commands PATH=<directory>:$PATH; export PATH to correct this, ideally by placing this command in your login script, so that it is automatically performed each time you login. If you login in with csh, rather than bash or sh, which can be determined by using the command echo $SHELL the above command becomes set PATH=<directory>:$PATH One such gcc shell script might contain: #!/bin/sh echoargs /usr/bin/gcc "$@" >> $HOME/gcc.history /usr/bin/gcc "$@" while the corresponding g++ shell script would contain: #!/bin/sh echoargs /usr/bin/g++ "$@" >> $HOME/gcc.history /usr/bin/g++ "$@"and the as (sometimes named gas) shell script would contain: #!/bin/sh echoargs /usr/bin/as "$@" >> $HOME/gcc.history /usr/bin/as "$@" Note that the above scripts must write to the same history file, and that the history file must be given an absolute path name. You can test this setup by subsequently invoking gcc and/or g++ to verify that the commands as invoked are captured in the history, as well as executed. Having done this, all build operations performed through make or other vehicle that do not explicitly specify the location of gcc/g++ will have a history built of the arguments submitted to gcc/g++. Further, for many makefiles the location of the compiler can be set through the CC environment variable, if all else fails. Be careful to empty or remove this history file before beginning the build process, rather than at any point during it, and then build the executable from which facts are to be extracted. One way to do this within a Makefile is to have make clean remove the history file, and to specify clean as the first thing that your build is dependant on. Also be careful to rename the script wrappers used to capture this history, once capture is complete, so that you are not unintentionally building an ever larger history of prior compilations needlessly. Such captured compiler instructions may then be presented to asx, by either invoking asx without options, or by passing to asx an initial run time option of "-". In either case asx subsequently operates by reading its instructions from standard input, rather than from the command line. Each line read from standard input must contain parameters that are double quoted, to indicate where each begins and ends. The first such parameter specifies the directory in which all subsequent compilation relating to this line is to be performed, and the second must contain the named location of the tool to be used in performing all compilations relating to options on this same input line. Care should be taken to review the input parameters generated using this technique prior to running asx since depending on the context one might or might not wish to extract facts from both source files and object files. Facts might erroneously be recovered multiple times if source code was for example compiled into object code, and then these object units were themselves compiled a second time into larger object units. By default asx will make some effort to avoid processing input object files corresponding to earlier source file compilations, but input libraries will always by default have facts extracted from them. A special keyword reset on a line by itself within the history passed to asx, instructs asx to make all classes, functions and variables constructed by asx prior to this reset command, invisible to all later processing performed, prior to actual emitting of the output TA. When a build involves building multiple distinct programs, placing such a reset command at the appropriate points within the history, ensures that the subgraphs for each distinct program constructed during the build process remain disconnected from all subgraphs constructed by other unrelated parts of the build process. Without such manually entered reset instructions, asx will permit functions, and variables defined in one program to logically be accessible to all of the programs being built, even if the build process itself does not actually permit such functions and variables to be shared across programs. Lacking build instructionsIn situations where the source lacks the build instructions that would permit capture of a genuine build history, the tool guessbuild can be used in a somewhat desperate attempt to reverse engineer a plausable build history from the collection of files that constitute the source of interest, without actually building source code. Special environment variablesThese environment variables may be used to constrain the behaviour of asx.
If ASX_SKIP is set to yes, then no attempt is made to compile source files for which more current assembler files already exist which are newer than the corresponding source file. Instead it is assumed that the assembler files are to be read, without first compiling the corresponding source. If ASX_SILENT is set to yes, then compilation commands are not echoed, though output resulting from those compilations may still be. The default of no causes compilation commands not to be echoed. If ASX_UNLINK is set to no then any generated assembler files will not be silently removed. This is a useful option if one wishes to later examine such files. Otherwise, assembler files generated by asx will subsequently be silently deleted. ASX_IGNORE contains a list of qualifiers separated by any of ';', ':' or ',' that are not to be presumed to contain information from which facts are to be extracted. The following qualifiers are recognised:
For example, if one knew that one only wished to extract facts from source files one might specify ASX_IGNORE="o;so;a" to avoid extracting facts from object files and libraries presented as inputs to the build process. Use 'v' to suppress all variables, 'vs' to suppress all but global variables, and 'vf' to suppress static variables defined within a functions scope. If 'f' is specified no function calls will be generated in the output. If "fv' is specified no virtual function calls will be generated in the output. Currently by default all function calls are emitted and only global variables are emitted. ASX_FORCE contains a list of suffixes separated by any of ';', ':' or ',' that are always presumed to contain information from which facts are to be extracted. By default '.o' and '.so' files whose absolute prefix matches the absolute prefix associated with a source file already processed by asx are ignored, on the grounds that the facts contained within such object files have already been extracted from these corresponding source files. Thus the default behaviour is effectively to only extract facts from object files seen as inputs within the build for which no corresponding source file appears to exist. If "f" is specified no virtual function calls will be generated. If "fv" is specified all function calls will be emitted. If 'v' is specified all static variables will be emitted; if 'vg' is specified only global variables will be emitted, while if 'vs' is specified both global variables and static file scope variables will be emitted but not static variables having only function scope. ASX_LIFT contains a list of types of entity whose edges are to be lifted to the parent of this entity. Currently the only entity which may have edges lifted are address entities whose parents are functions or variables. This lifting is requested by assigning ASX_LIFT the value 'a'. ASX_DWARF causes the dwalf symbolic information to be dumped in its internal tree form on the standard error output. This information is primarily of use to developers of ASX. The above environment variables will also be recognised if entered directly on the asx command line preceeded by "--". Source filesSource files are those files having one of the suffixes '.c', '.C', '.cc' (presumed to be C source files); one of the suffixes '.cxx', '.cpp', or '.c++' (presumed to be C++ source files); .S (presumed to be assembler requiring preprocessing) or '.s' (presumed to be assembler source files). C source files are converted to assembler by using /usr/bin/gcc while C++ source files are converted to assembler using /usr/bin/g++. When the options to asx are submitted on the command line, each non-assembler source file is compiled in the directory within which it resides, even if this is not the directory from which asx is invoked. It is impossible in general to say where source code should be compiled, and this is a good guess as to how the submitted code is normally built. Note that this is significant when specifing include files as relative path addresses. Also note that such directory changes during the exercise of compiling source should be otherwise transparent to the user. Object filesObject files are those files having the suffixes '.o', '.a' and '.so and executable files (having no suffix). Extraction from these files uses bfd, which invokes backend software to manage the differences between machine architectures. Object files on swag are represented using the Elf standard (pdf), with some sections within this standard internally encoded using the Dwalf standard (pdf) and the Dwalf 4 standard. For dwarf standard see also http://dwarfstd.org. InternalsThe behaviour of asx is specific to a given assembler dialect and if ported to other machines might require modification to handle different assembler directives. Data structures are generated as each file from which facts are extracted is processed. These represent:
ASX uses the dwarf encoded symbolic debug information (if present) to deduce the names and types of parameters and local variables. It makes an attempt to also detect sequences of assembler code that get generated when C++ functions are invoked indirectly via the a class VTable. This logic presumes a considerable amount about how such assembler code is generated, and is very compiler specific. ASX version 3 uses the Itanium C++ ABI standard to extract the namespace and class that objects belong to, from these objects internally mangled names. It attempts to address departures within the gnu compilers from this provided standard. Earlier versions of ASX did not handle namespaces correctly. Directories
The directory structures emitted in the output TA is derived directly
from the named source files occurring on the command line. Different
output will be emitted by the TA for source files specified as *.c*
from those specified logically equivalently as ./*.c*. In the former
case the current directory will not be emitted, and so the source files will
appear under the root of the landscape, while in the latter case, the
current directory will be emitted with the source files specified
contained within this directory.
An archive or object library is a binary file, containing within it zero
or more object files. The source units contained within it treat an archive
as if it were a specialisation of a directory.
These internal objects represent the logical files processed; be those source
files, assembler files, object files, or virtual files contained within an
archive. The key characteristic of such a file is that it is a basic source
from which facts can be extracted.
Named files represent citations within the assembler code that specify
where the assembler is to be deemed to have been produced from. While
these named files will often agree with the names of the files submitted
for compilation, this will not always be the case, since the preprocessor
inserts #line directives into source code when scanning header files.
Some preprocessors such as yacc and bison also employ such directives
explicitly so that the assembled code is related not to the source that
produced it but to the directives that cause source code to be generated.
A given source file many thus contain multiple distinct named files
within it. These are within the assembler distiguished by number.
These objects represent the various classes to which functions and variables
are identified by their mangled name as belonging to. To see all of the
member variables and functions of such a class within lsedit, select the
classes of interest and then use a forward query (F).
These objects represent instances of function signatures. Since the number
of functions may be large in a big compilation, hashing is used to provide
rapid lookup of known functions.
Those functions for which a function declaration has been seen will be
known to occur in one or more specific source units with each such
declaration generally being
associated with a starting line number (specified in the output TA by the
functions lineno attribute)
in a named file (specified in the output TA by the functions
file attribute).
When a function signature is declared in multiple source units, this function
will appear within each such source unit within the output TA.
Software may also recover the address of functions, presumably to later permit
such these function to be indirectly invoked by their address. When such
addresses are obtained this is schematically shown by creating an edge from
the accessor to the function address, itself contained as a pseudo member of
this function.
These objects represent instances of variables either declared or used within
the assembler. Unlike functions and function invocations, the assembler
does not preserve the line number or file in which such variables are
logically defined.
Variables may be accessed in a manner which updates them, retrieves their
address, or determines their value. Any operation which may potentially
result in updating a variable, either directly, or through its address,
will generate update edges, rather than read edges.
Since the mumber of possible interactions between functions may be huge,
hashing is used to provide rapid lookup of known function invocations, and
variable usage.
Each function invocation is represented in the TA by an edge from the
calling function, to the function invoked. This edge is assigned a
file and lineno attribute indicating where this specific function
invocation was first encountered. The number of times a given function
calls a given function within the assembler, is contained in the edge's
freq attribute.
Functions which are invoked but never declared in the source code submitted
to asx in the output TA become members of the class external
and are placed in a special :library: directory. Edges to such
functions are assigned a different type (and color) from edges between
functions that are both declared.
Because function signatures may be replicated in different files, a certain
amount of resolution is sometimes needed to correctly determine what function
is the actual one being called by a caller. The preference is to assume (1)
that the appropriate function is the one declared in the same source unit
as the caller. Where no instance of a function occurs in the same source
unit as the caller function, preference is then to resolve the function
to the earliest declaration of the function seen when processing the input.
Thus if the same function was declared both in an object file and an archive
library, resolution would be to the object file if this appeared earlier in
the command line than the archive library, and other to the declaration of
the function contained in the archive library.
Edges are also constructed from functions to variables used within these
functions. As with functions resolution of the actual variable referenced
when the same variable name occurs in multiple input files, is achieved
by preferring always to associate functions with variables that occur
within the same compilation unit.
The objects shown in the output TA as variables are those denoted in the
assembler as being statically defined named objects. Such objects may have
either local or global scope.
Within the assembler code function and variable names may be mangled if
originally derived from C++ code. These mangled names are unmangled by
using a subset of the GNU source code provided in the
binutils-2.20.1 distributable.
For details on how names are mangled, and thus later unmangled
see the
Itanium C++ ABI and the earlier
the GNU V3 ABI specification.
The original mangled name is preserved as an attribute within the output TA.
ASX leverages BinUtil both for demangling internal C++ names, and for handling
the interpretation of some of the contents of ELF object files. Unfortunately,
while the amount used within BinUtil is not large, it seemed undesirable to
attempt to extract from BinUtil those parts used, since this would make it
more difficult to upgrade asx when more recent versions of binutil are
released. As a precursor to compiling asx, extract the public domain
binutil source from
http://ftp.gnu.org/gnu/binutils,
change into the extracted directory, and then follow
the instructions in README to build binutil. This typically involves
issuing the command "./configure" followed by the command "make".
Modern versions of Binutil themselves depend on the zlib libraries which can
be readily downloaded from
http://www.zlib.net and compiled.
This code was originally developed to work with C and C++ assember containing embedded symbolic information
encoded using the Dwarf-3 encoding standard. In the interim 4 years since it was developed GCC and G++ have
been enhanced to use the Dwarf-4 standard. Only
ASX-4.0.1 and higher contain the necessary code to
interpret this newer standard. Older versions of ASX
will fail if the assember source code generated by
GCC/G++ conforms to the Dwarf-4 standard rather than
the Dwarf-3 standard.
When presented with source code this fact extractor works best when the
code is compilable. Compilation is therefore always recommended. Since the
source code is compiled as part of fact extraction, warning and error message
relating to compilation of this source code may be emitted by this program.
When the code fails to compile, the fact extraction achieved by lexically
scanning the source code, is designed to be as forgiving of errors as
possible. However the output produced in this case may be problematic,
particularly if scanning C++ files. Function name extensions that
identify functions with specific classes will not be reported unless
the code explicitly uses the full function name, and function signatures
will not be encoded in the function name, potentially causing problems
when attempting to resolve function names within this software. Thus
invocations to a single function, might visually appear to be invocations
to multiple functions, while invocations to multiple distinct functions
distinguished only by their signatures will appear to be invocations to
the same function.
When presented with assembler and object code that has be previously compiled
the usefulness of fact extraction will depend critically on whether these
source units were themselves generated in a manner (using the -g option)
that preserved symbolic information and line number information in the
output file.
Problems may arise if function names begin with the specific pair of
characters "_Z", because it is this initial sequence that indicates that
a name has been internally mangled. The risk is that the software will
attempt to unmangle a function name that it recognises as mangled, which
in fact is not a mangled function name.
The TA produced by this fact extractor may produce what appear to be
duplicate function declarations. This is initially somewhat confusing.
The problem arises because functions with different internal mangled
names may resolve to the same external function signatures.
In particular, C++ constructors
and destructors are internally supported by two very similar functions,
which appear to be largely duplications of each other.
The one function declaration constructs/
destructs instances of a given class, while the other is employed when
constructing/destruction classes derived from this base class. While
this apparent duplication could be resolved by matching function based on
their external signatures, this would be a somewhat expensive operation to
perform at run time. An obvious solution to this problem would be to make
minor modification to the name demangling software so that such variations in
the roles of functions encoded within their internal mangled named could be
propagated to the external function signature emitted in the TA. This
would at least help explain the seeming duplication of function declarations.
Fact Extraction for WesnothFact extraction of Wesnoth is described here.Fact Extraction for QuakeFact extraction of Quake is described here.Fact Extraction for MySQLFact extraction of MySql is described here. Fact Extraction for GITFact extraction of GIT is described here.Developed by
Location
Supported platforms
Future platforms
Caveats
See alsoContact informationFor more information on ASX please contact us at . |