SWAG >> Dexcd

Dex clone detector

This program tries to detect potential candidate clones that arise in dex class files, by matching pcode against pcode.

Download DEXCD

DEXCD is a DEXclone detector, that discovers similarities in pcode by examining android dex class files.

The latest builds of DEXCD are available below. While effort has been made to test DEXCD, the software may contain bugs.

DEXCD 1.0.1 [dexcd-1.0.1.zip] [dexcd-1.0.1.exe]

DEXCD is being constantly developed and improved; check back often for newer versions.


dexcd [-a/ll {i/nstructions|l/ines}] [-b/est] [-j/avac <program>] [-c/lasspath <path>] [-i/nput <file>] [-d/ynamic] [-f/ilepath <path>] [-h/tml <directory>] [-l/ength <min>] [-L/ength <min>] [-m/ismatch <cost>] [-s/ame <cost>] [-u/se <count> [-I/terations <count>] [-E/lapse <sec>] ] [-t/a <file>] [-v/erbose] [-r/emove] [-R/emove] [-T/op <count>] [-C/olumns <count>] *.[dex]


If the -c option is specified this option should be followed by a list of directories, separated by ';' or ',' where additional dex files may be found. Similarly if the -f option is specified this option should be followed by a list of directories where source files might be found.

If the -i option is specified then in addition to processing any java source and/or class files specified directly on the command line, javax will also process those files named within the indicated file. The special filename '-' is interpreted as in instruction to read class file names from the standard input. Each such source/class file named within this file must begin on a new line and have neither blanks before or after it.

If the -d option is specified then having completed the initial processing of the named java source and/or class files, dexcd will attempt to iteratively find all dex classes cited but not yet processed, and process these class files too. It will search each location named in the classpath provided using the -c option, with the default classpath if none is specified being '.'. This results in an exhaustive search for all cited class files, and the subsequent dynamic processing of these initially unspecified class files. The consequent data capture may potentially be huge. If the -j option is also specified such dynamically discovered class files will be recompiled if their source is also located before themselves being processed. This may prove to be a lengthy process but ensures that all class files examined conform to their corresponding source files, and are compiled using the '-g' option.

The -a option establishes instruction boundaries at which clone detection may be performed. If line boundary is specified matching of instructions may only begin at the initial instruction associated with a line of text in the source, assuming that such line boundaries are known, as consequence of having compiled the source the the '-g' option. If some other boundary is specified using the '-a' option, clone detection is permitted to start and end at any instruction.

The -h option greatly improves the readability of massive amounts of output by creating html web pages under the named directory, which individually are managable (though potentially huge in number), and linked to the much smaller root page created with the name "index.html" in this same directory. Absent this option all output is emitted serially to a single potentially unmanagably large file.

The -l option specifies the minimum number of pcode instructions that may constitute a clone. This value currently defaults to 5. Either both sides of the comparison must involve this minimal number of pcode code instructions be they matching or unmatching instructions.

The -L option provides an alternative minimum length (to the -l option) when the instructions within a clone comprise all the instructions in a function.

The -m option provides as a positive number the mismatch cost, which by default is 1. A larger mismatch cost strongly biases the clone detection algorithm towards avoiding declaring sequence of pcode to be clones of each other when mismatches occur within these sequences. A very large cost effectively insists that clones may contain no mismatches.

The -s option provides as a positive number the matching cost, which by default is 1. A larger same cost increases the degree to which the clone detection algorithm is forgiving of mismatches, as the number of earlier matches seen increases.

The -u option if not 0 causing hill-climbing to be performed after execution of the greedy algorithm. The value specifies the maximum number of edges that may be unmatched, in order to permit an unmatched edge to be matched.

The -I option specifies the maximum number of iterations of the hill-climbing algorithm to be permitted when attempting to improve clones. If -E is specified with a non-zero number of seconds, hillclimbing for any given clone pair is executed at most this number of seconds before terminating with whatever improvements have been made within this specified time duration. This value currently defaults to 60 seconds.

The -b option causes clones to be truncated at that instruction for which the aggregate weight was maximal, rather than for the last instruction whose aggregate weight was non-negative.

If the -v option is specified then dexcd reports each class file processed, upon successful completion of that processing. The output will distinguish between class files explicitly processed as per the instructions provided to dexcd, and those class files implicitly processed as a consequence of having earlier been cited in previously read pcode. In addition, any internal compiles performed will be reported on the standard error output.

The -r option indicates that the clone detection output is not to report the corresponding lines of source code that match the pcode examined. This option will reduce run time signicantly, since java source code will not be repeatedly scanned to find desired line that must be printed. Note that class files not produced with the debug information included within them as consequence of having been produced by compiling with the '-g' option have no line number information, and thus do not permit recovery of this java source code information.

The -R option indicates that the line by line comparison of pcode is not required in the output. This will substantially reduce the size of the clone detection output, at the expense of considerable loss of information.

The -T option indicates the number of top clones to be reported. These are the longest clones detected.

The -C option specifies how many columns the top clones are to be reported in.

The -t option specifies a file to write TA output to, showing graphically clones and their relationships to the source containing them, as well as to clones they pair with. This output is viewable using lsedit.

Clone detection algorithm

The software begins by reading into memory the machine instructions present in each specified java class file. These instructions are stored within arrays and a certain amount of instruction manipulation is performed to logically add labels to the instruction set, and pseudo instructions that represent the implications of try/catch blocks.

Every pcode instruction is hashed so that alternative candidate pcode instructions that match a given instruction can be reduced to a managable number, helping to reduce run time from O(n^2) to O(n).

Starting with two distinct pcode instructions established as matching, the clone detection algorithm examines all pairings of subsequent pcode instructions seeking a further match. A greedy algorithm is used to locate the next matching pair of pcode instructions. Specifically, for incrementing numbers of skipped pcode instructions, all pairs of pcode instructions which involve precisely this number of skipped pcode instructions are searched for a match.

Every matched pair of pcode instructions is assigned a positive match weight, and every prior unmatched pcode instruction a negative mismatch weight. The pairing of pcode instructions performed as part of clone detection terminates when no further pcode instructions exist to be matched within the method(s) being compared, or when the cumulative weight of all the pcode instructions observed as part of clone detection becomes negative. At this point the clone discards from the clone detection final pcode instructions in the pcode sequence not followed by a matching pcode instruction.

Thus if the matching and mismatch wieghts are the same value, clone detection terminates when more mismatched pcode instructions have been seen within the two sequences than matched ones.


This program outputs html that can be viewed in any suitable web browser. The output shows sections of pcode that are similar or the same to other sections of pcode. If the class files have been compiled with the '-g' option similarity of local variables is predicated on their given name. Otherwise it is currently predicated on the position of these local variables on the stack.

If the -h option is used a three tier html reporting structure is employed with the root of this structure being in the file 'index.html' within the indicated directory. The top tier hyperlinks to distinct pages for each class file examined. That page describes the clones found in this class file that match all later pcode either in the same class file or in later class files examined. The third tier shows for each such identified potential clone the actual pcode, and source code that justified reporting of this clone. Note that this option can potentially produce a huge number of small viewable files.


If an exception or interrupt such as generated by keyboard input of Cntl-C is raised, this program terminates its current efforts to detect clones, and after reporting the observed exception terminates in a fashion which permits the results so far generated to be viewed.

Example usage

On windows one might invoke the command:

dexcd.exe -v -c /swag/lsedit/7.3.12 -d -u 256 -l 24 -L 24 -T 2048 -h /temp/lsedit c:/swag/lsedit/7.3.12/lsedit/LandscapeEditorFrame.dex

while on unix one might invoke the command:

./dexcd -v -S -c /home/ijdavis -d -u 256 -l 24 -L 24 -T 2048 -h /home/ijdavis/public_html/lsedit ../../lsedit/LandscapeEditorFrame.dex

See also:

acd, javap2, javex jcd

Developed by

  • Ian Davis.


  • swag:~ijdavis/src/dexcd -- source code

Supported platforms

Any that the source can be compiled on.


  • Subject to change.. still under development.

Contact information

For more information on dexcd please contact us at .