AA Clusterer

The AA cluster is provided with a set of zero or more values (strings) associated with each node to be considered for clustering. These values are generated by the rules provided to LSEdit when AA is invoked, and can be viewed as constructed in the input file submitted to AA. Each line of this input file has the format <node id> <value>*.

Clustering of the different nodes identified by id, is achieved by grouping nodes whose similarity metric when applied to the collection of values associated with a node is high.

Similarity Metrics

Similarity metrics award nodes having more similarity in the values associated with these nodes a higher metric. Nodes are more likely to be clustered together if they share a high similarity metric. In all cases division by 0 produces a similarity metric of 0.

Jaccard

For two sets A and B this metric is |A&B|/(|A&B| + |A&~B| + |B&~A|). That is the number of items in common divided by the items in common plus the items in A not in B, plus the items in B not in A.

SimpleMatching

For two sets A and B this metric is (|A&B| + |~A&~B|)/(|A&B| + |~A&~B| + |A&~B| + |B&~A|). That is the number of items in common plus those items in neither, divided by the items in common plus those items in neither plus the items in A not in B, plus the items in B not in A.

Sorensen-Dice

For two sets A and B this metric is 2*|A&B|/(2*|A&B| + |A&~B| + |B&~A|). That is twice the number of items in common divided by twice the items in common plus the items in A not in B, plus the items in B not in A.

Algorithms

All algorithms compute similarity directly from the above similarity metric when both items being compared are leaf nodes. They differ only in how a similarity metric is computed when one or both of the items being compared are themselves binary trees representing clusters of two or more nodes. Without loss of generality assume the first item is in this case not a leaf node, reversing when necessary the two items being compared if it is.

SingleLinkage

The similarity is the minimum of the two children of the first item when compared to the second.

CompleteLinkage

The similarity is the maximum of the two children of the first item when compared to the second.

WAvg

The similarity is the the average of the similarities of the two children of the first item when compared to the second. If the first item compared is F1 and the second F2, let C1 be the first child of F1, and C2 the other child of F1. Then the similarity is (similarity(C1,F2) + similarity(C2,F2))/2.

UAvg

Let L = number of leaf nodes under the first child (C1) of F1, and R = the number of leaf nodes under second child (C2) of F1. Set L to 1 if zero and likewise R. The resulting similarity is (L*similarity(C1, F2)+R*similarity(C2,F2))/(L + R).