Writing GDS Plugins

27 June, 2005 at 11:24 Leave a comment

Google Desktop Search (GDS) engine is a tool created by Google that indexes all of the files on your (Microsoft-Windows-based) computer and then provides the ability to search those files. The types of files that it indexes include all files written to disk (text files, web pages, media files, etc.), email, instant messages, and web pages and media files visited on the Web. GDS creates a deskbar in the toolbar which enables quick searching on criteria you specify. It returns search results by directing your default browser to a web server running on your machine. The browser-based search interface has an obviously Googlish look and feel. Interestingly, if you have GDS installed, you will have a Desktop search option (in addition to Web, Images, Groups, News, Froogle, and Local) when you visit Google. When you perform a search on the main Google page, GDS matches for that search may also show up in the form of “ results stored on your computer” as the first search result.

As cool as this is, an even cooler aspect of GDS is that it is an extensible framework. Google has released an SDK so developers can write plugins for GDS. One such plugin is Kongulo, a web spider. Kongulo provides a command-line interface to crawl, starting at a specified URL, and index the resources it finds there within GDS. Command-line options include depth, URL match, loop, sleep time between loops, and passwords. Kongulo can be a useful tool for indexing intranets or private wikis … or to see an example of a good plugin written for GDS.

1) First things first. How does a plugin tie into GDS? The answer is the Common Object Model or COM. As I mentioned above, GDS is an application for MS Windows systems. On Friday, May 27, 2005, Google released the source code for Kongulo. Here is the meat of how Kongulo pushes spidered web pages to GDS. (The pieces of the code that pertain specifically to spidering are interesting, but this article won’t detail that aspect of Kongulo.)

2) First, Kongulo creates an event factory object attached to the ‘crawler’ object, like this:
self.event_factory = win32com.client.Dispatch(‘GoogleDesktopSearch.EventFactory’)
An item of note here is that Kongulo uses the win32com libraries, so if you plan on running the source code, install the Win32</a&gt; extensions for Python or use the ActiveState Python distribution</a>.

3) Next, every time Kongulo wants GDS to index a page, it has to create an event from the event factory like this:
event = self.event_factory.CreateEvent(_GUID, ‘Google.Desktop.WebPage’)
The first argument the crawler passes into ‘CreateEvent’ is the ‘guid’ that Kongulo registers for itself the first time it runs. The second argument is a text string containing the fully qualified name of the type of event. Kongulo only uses ‘Google.Desktop.WebPage’, but other options include ‘Google.Desktop.Indexable’ (which is the parent of all of the following indexable resources), ‘Google.Desktop.Email’, ‘Google.Desktop.IM’, ‘Google.Desktop.File’, and ‘Google.Desktop.MediaFile’.

4) The next steps entail adding properties. The ‘event’ object has an ‘AddProperty’ method that takes two arguments: a property name and a property value. The crawler adds the following four properties to all pages it finds:
event.AddProperty(‘format’, doctype)
event.AddProperty(‘content’, content)
event.AddProperty(‘uri’, url)
event.AddProperty(‘last_modified_time’, pywintypes.Time(time.time() + time.timezone))
‘doctype’ is the document type, pulled from the HTTP headers. Kongulo will only index documents of the type ‘text/html’ or ‘text/plain’.
‘content’ is the body of the web page.
‘uri’ is the web location of the resource, and
‘last_modified_time’ is actually the current local time, but there is a note in the source code to use the ‘last-modified’ HTTP header instead.

5) The crawler adds the following property for HTML pages that contain a title:
event.AddProperty(‘title’, title)
Interestingly, Kongulo uses regular expressions to find titles, frames, and links, as opposed to using an HTML parser. The Kongulo team felt this would provide a less strict processing of web pages.

6) The final step is to send the page to GDS, like this:
event.Send(0x01)
‘Send’ expects a bitwise OR of the following values:
EventFlagIndexable = 0x00000001(indicates an event that GDS should index)
EventFlagHistorical = 0x00000010 (indicates a historical event as opposed to an event that is currently happening in realtime)

7) The Kongulo source code indicates that if the crawler passes in the historical flag, GDS will not process the event until the user’s system becomes idle. At this point, GDS has the web page and it is available for searching. That’s all there is to it.

The GDS team has done an excellent job of providing a great tool that is easy to extend. The more I play with GDS, the more it impresses me. Of course, I would play with it more if it ran on Linux (hint, hint). Likewise, the Kongulo team has done an excellent job of providing a useful plugin to GDS, but more importantly, of providing clean, readable source code (being written in Python doesn’t hurt its readability) to serve as an example of how to write a plugin for GDS. While there are plenty of plugins already available for GDS, this ease of creating a plugin makes me expect many more in the future.

Jeremy Jones is a script monkey who works for The Weather Channel as a software quality assurance engineer.

Advertisements

Entry filed under: Research, WebXP.

Google Desktop as LAN Search Towards Usable Web Privacy and Security

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

June 2005
M T W T F S S
« May   Jul »
 12345
6789101112
13141516171819
20212223242526
27282930  

Tweets


%d bloggers like this: