12 May, 2005 at 15:20 Leave a comment

Mozilla Thunderbird Duplicate Remover

two of each

[He] held one finger up directly in from of Yossarian and demanded, “How many fingers do you see?””Two”, said Yossarian.”How many fingers do you see now?” asked the doctor, holding up two.”Two”, said Yossarian.”And how many now?” asked the doctor, holding up none.”Two”, said Yossarian.The doctor’s face wreathed with a smile. “By Jove, he’s right,” he declared jubilantly. “He does see everything twice.”

In Catch-22, by Joseph Heller.

Just when I thought I would be doing something to remove duplicates from my mail account, found this. Boy, these Firefox people are some smart guys looking at the number and quality of the extensions that are out there. Me thinks that Firefox getting popular as a browser is not well-founded. It is the extensions that are giving it the appeal. Why, one can do everything inside Firefox. Just does not have a good Blogging tool like [Mozblog]. Otherwise, I would be gladly using it instead of Mozilla. Anycase, here is the story

– For some mysterious reason, Thunderbird occasionally downloads my emails twice from the POP server. I think it may be related with the fact that I use to email clients (home and office) to suck emails from the server. But I also remember seeing this happen in Outlook in a previous life. So maybe it the server’s fault rather than Thunderbird’s. Aaanyway, I decided to create an extension for Thunderbird that would delete duplicate emails automatically. Something that could be triggered from the context menu of a folder or account (and why not, as soon as an email arrives). So the approach goes like this.

The program iterates over every email in a folder or account and computes a hash for each email, using some function such as MD5. It then stores the hash in a map, associating it with the email (or rather, a reference to the email. We don’t want to store the contents of every email in memory). But, if the map already contains an entry for that hash, it means we’re probably facing a duplicate email. So the program compares the contents of both emails to rule out any chance of them having the same hash by coincidence (very unlikely, but possible). So far so good. After a few cycles of writing and testing the Javascript code (writing mozilla extensions is messy, I tell you) I got to a point where it could detect duplicate emails (no deletion yet).

To my surprise, when I ran it the browser locked up for a few seconds. I eventually discovered what the problem was and I’ll get to it after I describe how to program is structured. There’s a main function, deleteDuplicateMessages() that is invoked when the menu item is clicked on. This function iterates over the set of message headers (Thunderbird objects that contain metadata about each email) and invokes an asynchronous method to stream the contents of the email. The function takes an callback object as an argument. I believe they had to implement it like that to support IMAP, where getting the contents of an email could take a while. The callback object gathers all the pieces of the email as it is streamed, strips out the headers and computes the MD5 hash. It then makes sure the entry does not exist in the hash map and exits.

The reason for the lockup is two-fold. First, there’s the issue with the MD5 library I’m using, which takes a few seconds to compute the hash for really long strings. The second issue seems to be that Javascript code in Mozilla executes in the UI thread, regardless of whether it’s a callback function being invoked by a native component. So what’s happening here is that the computation of the MD5 hash executes in the UI thread and prevents the Thunderbird from responding to user events. I’ve been trying to find how to make that callback run in a separate thread, without much luck. Another alternative would be to write the callback object in C++ as an XPCOM component, but I wanted to make the extension as portable as possible. Creating a native XPCOM component would mean that I’d have to compile it for every platform I wish my extension to support. Too bad. The quest goes on.

1 Comments:

Anonymous said…

Or just keep a hash of the Message-ID field, which is unique for each message. In your case, especially, where you are downloading the same exact message multiple times, this should be enough. #It should also work even in cases where you’re on the recipient list multiple times, like say a reply to a mailing list message, where you were cc’d personally on the reply. The advantage of using the Message-Id field is that it will be the same for those two messages, even though headers (“Received”, etc) may be different (and thus your MD5’s would be different).should also be alot faster!

Advertisements

Entry filed under: Uncategorized.

Why Linux is Crap and Piracy is Good for Windows Hey

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

May 2005
M T W T F S S
« Apr   Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Tweets


%d bloggers like this: