6 June, 2005 at 03:55 Leave a comment

Copy Paste Detector – Finding Duplicate Document Segments

The main idea here is to detect “similar” segments or paragraphs or snippets in a collection of documents. We can use it to detect duplicates and prune the collection. Or do it for hypertexting.
Amazingly, it has been done although this is for absolute string matching.  Have to see how we can re-use this for our own problems of immense duplication on Furl, Self-Email, Mozblog etc. Anyway Here is the original article –

Copy pasted code is usually bad. But it can be hard to find, especially in a large project. So we wrote a utility – CPD – to find it for us. It’s been through three major incarnations:

  • First we wrote it using a variant of Michael Wise’s Greedy String Tiling algorithm (our variant is described here)
  • Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform
  • Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.

Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).

Here’s a screenshot of CPD after running on the JDK java.lang package.

Note that CPD works with Java, C, C++, and PHP code.

If you have Java Web Start, you can run CPD by clicking here.

Here are the duplicates CPD found in the JDK 1.4 source code.

Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).

Andy Glover wrote an Ant task for CPD; here’s how to use it:

<target name="cpd">
<taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
<cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt">
<fileset dir="/home/tom/tmp/ant">
<include name="**/*.java"/>
</fileset>
</cpd>
</target>

There’s a ignoreLiterals="true" option which makes CPD ignore literal value differences when evaluating a duplicate block. This means that foo=42; and foo=43; will be seen as equivalent. You may want to run PMD with this option off to start with and then switch it on to see what it turns up. There’s also a ignoreIdentifiers="true" option that does the same thing with identifiers; i.e., variable names, methods names, and so forth. The same guidelines apply.

Also, you can get verbose output from this task by running ant with the -v flag; i.e., ant -v -f mybuildfile.xml cpd.

There’s also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here. This project is pretty much dead, though, since the current code is fast enough to just run it on one machine.

Advertisements

Entry filed under: Uncategorized.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

June 2005
M T W T F S S
« May   Jul »
 12345
6789101112
13141516171819
20212223242526
27282930  

Tweets


%d bloggers like this: