SWAG >> JCD

Java clone detector

This program tries to detect potential candidate clones that arise in java class files, by matching pcode against pcode.

Download JCD

JCD is a Java clone detector, that discovers similarities in pcode by examining java class files.

The latest builds of JCD are available below. While effort has been made to test JCD, the software may contain bugs.

JCD 1.0.10 [jcd-1.0.10.zip]
JCD 1.0.9 [jcd-1.0.9.zip] Linux executable
JCD 1.0.8 [jcd-1.0.8.zip] Linux executable
JCD 1.0.6 [jcd-1.0.6.zip] Linux executable
JCD 1.0.5 [jcd-1.0.5.zip] Windows executable
JCD 1.0.4 [jcd-1.0.4.zip] Windows executable
JCD 1.0.3 [jcd-1.0.3.zip] Windows executable
JCD 1.0.2 [jcd-1.0.2.zip]
JCD 1.0.1 [jcd-1.0.1.zip] Windows executable

JCD is being constantly developed and improved; check back often for newer versions.

Synopsis

jcd [-a/ll {i/nstructions|l/ines|s/teps}] [-/best] [-j/avac <program>] [-c/lasspath <path>] [-i/nput <file>] [-d/ynamic] [-e/xceptions] [-h/tml <directory>] [-l/ength <min>] [-L/ength <min>] [-m/ismatch <cost>] [-s/ame <cost>] [-u/se <count> [-I/terations <count>] [-E/lapse <sec>] ] [-t/a <file>] [-v/erbose] [-r/emove] [-R/emove] [-S/kipcompile] [-T/op <count>] [-C/olumns <count>] [-x/tra] *.[class|java]

Options

If the -j option is specified jcd will attempt to compile source files presented to it that have the suffix ".java" using the indicated program and classpath, prior to examining the corresponding class file. This compilation will itself use the '-g' option ensuring that the resulting class files contain symbolic variable information, source line number information, etc. Note that in the absence of the '-d/ynamic' option, this will result in inner classes contained in inner class files not being processed by jcd, unless these inner class files are explicitly named.

If the -c option is specified this option should be followed by a list of directories, separated by ';' or ',' where additional class files may be found. If the -j option is also specified and as consequence java source files are compiled this classpath must also conform to the syntax required by javac.

If the -i option is specified then in addition to processing any java source and/or class files specified directly on the command line, javax will also process those files named within the indicated file. The special filename '-' is interpreted as in instruction to read class file names from the standard input. Each such source/class file named within this file must begin on a new line and have neither blanks before or after it.

If the -d option is specified then having completed the initial processing of the named java source and/or class files, jcd will attempt to iteratively find all classes cited but not yet processed, and process these class files too. It will search each location named in the classpath provided using the -c option, with the default classpath if none is specified being '.'. This results in an exhaustive search for all cited class files, and the subsequent dynamic processing of these initially unspecified class files. The consequent data capture may potentially be huge. If the -j option is also specified such dynamically discovered class files will be recompiled if their source is also located before themselves being processed. This may prove to be a lengthy process but ensures that all class files examined conform to their corresponding source files, and are compiled using the '-g' option.

The -a option establishes pcode boundaries at which clone detection may be performed. If line boundary is specified matching of pcode clones may only begin at the initial pcode associated with a line of text in the source, assuming that such line boundaries are known, as consequence of having compiled the source the the '-g' option. If step boundary is specified matching of pcode may only begin at pcode instructions that could have been performed given the presence of an empty stack. Matching continues while the amount pushed onto the stack at least equals the amount removed from the stack. If some other boundary is specified using the '-a' option, clone detection is permitted to start and end at any pcode instruction.

If the -e option is specified then the pcode instructions will not be augmented to include information about the presence of try/catch blocks, and distinctions resulting from the presence, absence, or differences between try/catch blocks will not be discovered.

The -h option greatly improves the readability of massive amounts of output by creating html web pages under the named directory, which individually are managable (though potentially huge in number), and linked to the much smaller root page created with the name "index.html" in this same directory. Absent this option all output is emitted serially to a single potentially unmanagably large file.

The -l option specifies the minimum number of pcode instructions that may constitute a clone. This value currently defaults to 5. Either both sides of the comparison must involve this minimal number of pcode code instructions be they matching or unmatching instructions.

The -L option provides an alternative minimum length (to the -l option) when the instructions within a clone comprise all the instructions in a function.

The -m option provides as a positive number the mismatch cost, which by default is 1. A larger mismatch cost strongly biases the clone detection algorithm towards avoiding declaring sequence of pcode to be clones of each other when mismatches occur within these sequences. A very large cost effectively insists that clones may contain no mismatches.

The -s option provides as a positive number the matching cost, which by default is 1. A larger same cost increases the degree to which the clone detection algorithm is forgiving of mismatches, as the number of earlier matches seen increases.

The -u option if not 0 causing hill-climbing to be performed after execution of the greedy algorithm. The value specifies the maximum number of edges that may be unmatched, in order to permit an unmatched edge to be matched.

The -I option specifies the maximum number of iterations of the hill-climbing algorithm to be permitted when attempting to improve clones. If -E is specified with a non-zero number of seconds, hillclimbing for any given clone pair is executed at most this number of seconds before terminating with whatever improvements have been made within this specified time duration. This value currently defaults to 60 seconds.

The -b option causes clones to be truncated at that instruction for which the aggregate weight was maximal, rather than for the last instruction whose aggregate weight was non-negative.

If the -v option is specified then jcd reports each class file processed, upon successful completion of that processing. The output will distinguish between class files explicitly processed as per the instructions provided to jcd, and those class files implicitly processed as a consequence of having earlier been cited in previously read pcode. In addition, any internal compiles performed will be reported on the standard error output.

The -r option indicates that the clone detection output is not to report the corresponding lines of source code that match the pcode examined. This option will reduce run time signicantly, since java source code will not be repeatedly scanned to find desired line that must be printed. Note that class files not produced with the debug information included within them as consequence of having been produced by compiling with the '-g' option have no line number information, and thus do not permit recovery of this java source code information.

The -R option indicates that the line by line comparison of pcode is not required in the output. This will substantially reduce the size of the clone detection output, at the expense of considerable loss of information.

The -S option indicates that jcd need not compile any java file for which the corresponding class file already exists, whenever this corresponding class file has a later creation date than the java source file.

The -T option indicates the number of top clones to be reported. These are the longest clones detected.

The -C option specifies how many columns the top clones are to be reported in.

The -t option specifies a file to write TA output to, showing graphically clones and their relationships to the source containing them, as well as to clones they pair with. This output is viewable using lsedit.

The -x option indicates that high level commentary describing the action of each pcode instruction seen should be shown, either in conjunction with this pcode, or as an alternative to it. This option substantially increases the output, but may be useful for those wishing to easily grasp what the pcode instructions are doing, if not already having a general familiarity with the pcode stack based architecture.

Clone detection algorithm

The software begins by reading into memory the machine instructions present in each specified java class file. These instructions are stored within arrays and a certain amount of instruction manipulation is performed to logically add labels to the instruction set, and pseudo instructions that represent the implications of try/catch blocks.

Every pcode instruction is hashed so that alternative candidate pcode instructions that match a given instruction can be reduced to a managable number, helping to reduce run time from O(n^2) to O(n).

Starting with two distinct pcode instructions established as matching, the clone detection algorithm examines all pairings of subsequent pcode instructions seeking a further match. A greedy algorithm is used to locate the next matching pair of pcode instructions. Specifically, for incrementing numbers of skipped pcode instructions, all pairs of pcode instructions which involve precisely this number of skipped pcode instructions are searched for a match.

Every matched pair of pcode instructions is assigned a positive match weight, and every prior unmatched pcode instruction a negative mismatch weight. The pairing of pcode instructions performed as part of clone detection terminates when no further pcode instructions exist to be matched within the method(s) being compared, or when the cumulative weight of all the pcode instructions observed as part of clone detection becomes negative. At this point the clone discards from the clone detection final pcode instructions in the pcode sequence not followed by a matching pcode instruction.

Thus if the matching and mismatch wieghts are the same value, clone detection terminates when more mismatched pcode instructions have been seen within the two sequences than matched ones.

Output

This program outputs html that can be viewed in any suitable web browser. The output shows sections of pcode that are similar or the same to other sections of pcode. If the class files have been compiled with the '-g' option similarity of local variables is predicated on their given name. Otherwise it is currently predicated on the position of these local variables on the stack.

If the -h option is used a three tier html reporting structure is employed with the root of this structure being in the file 'index.html' within the indicated directory. The top tier hyperlinks to distinct pages for each class file examined. That page describes the clones found in this class file that match all later pcode either in the same class file or in later class files examined. The third tier shows for each such identified potential clone the actual pcode, and source code that justified reporting of this clone. Note that this option can potentially produce a huge number of small viewable files.

Exceptions

If an exception or interrupt such as generated by keyboard input of Cntl-C is raised, this program terminates its current efforts to detect clones, and after reporting the observed exception terminates in a fashion which permits the results so far generated to be viewed.

Example usage

On windows one might invoke the command:

jcd.exe -v -S -c /swag/lsedit/7.3.12 -j "/Program Files/Java/jdk1.6.0_12/bin/javac.exe" -d -u 256 -l 24 -L 24 -S -T 2048 -h /temp/lsedit c:/swag/lsedit/7.3.12/lsedit/LandscapeEditorFrame.java

while on unix one might invoke the command:

./jcd -v -S -c /home/ijdavis -j /usr/java/jdk1.6.0_10/bin/javac -d -u 256 -l 24 -L 24 -S -T 2048 -h /home/ijdavis/public_html/lsedit ../../lsedit/LandscapeEditorFrame.java

See also:

acd, javap2, javex

Developed by

  • Ian Davis.

Location

  • swag:~ijdavis/src/jcd -- source code

Supported platforms

Any that the source can be compiled on.

Caveats

  • Subject to change.. still under development.

Contact information

For more information on jcd please contact us at .