Real-time Search – Query People – Hybridosphere

6 December, 2009 at 14:40 Leave a comment

John Battelle thinking about the intersection of search, media, technology, and more, wrote a nice post, “From Static to Real-time Search” in 2008. It is a blurb of the original “Shifting Search from Static to Real-time” on the LookSmart Thought Leadership Series. The most interesting part of the article(s) was not the article(s) (in hindsight of course because we are in 2009 and most of the vision has tussed out) but the really insightful comments left behind which have come to creation after people read, processed and took the pain to express their thoughts in bits and bytes. This human activity made me ponder if what we should really be talking is shifting of search or shifting of web? I am betting on the latter because with the advent of Web 2.0, user generated content, social networking, micro-blogging and the unignorable “giant global graph” article by the so-called father of the WWW, it seems obvious to me that Internet is gradually transitioning from computer-centric to document-centric to people-centric to object-centric. We are currently in people-centric phase with the blogosphere, twittosphere, histosphere[3] et al. lighting bright candles to the amount of user generated content that is being generated by hordes of masses even if the signal-to-noise ratio is getting lower by the day. Holding that thought and as evidenced by the outburst of companies like Collecta (and its kin like OneRiot), Infoaxe (and its ilk like Wowd), Aardvark (and its type like Mahalo), BlueOrganizer/GetGlue (and its comrades like Microsoft Live Labs’ Pivot), DotSpots (and its siblings like ReframeIt) et al. I can only say that ‘querying the people-centric web’ which is commonly, if not mistakenly confused as “real-time search”, is not entirely unfamiliar vis-a-vis ‘querying the document-centric web’ (or “static/reference/archive search”). Earlier, I was drowned by millions of documents and today, am inundated by millions of people chattering about. A quick doodle…

To illustrate it more clearly, let us have a peek into what Collecta (a self-proclaimed ‘real-time search engine’) is doing. It primarily scours for blogs, tweets, comments and media from the social media landscape and exposes a simple search box on top of the index with scrolling results on every tick of time. While Collecta still does not crawl other activity streams like Delicious, Evernote, Hooeey, Digg etc. (due to lack of API or traction or both), very quickly, I was inundated by opinions of people talking about something that I was interested in knowing although it is a very simple keyword play using data stream algorithms is what I thought was happening in the background. In other words, I was firing a query and figuratively, getting people and their noisy thoughts, not documents per-se, as results. I was pretty amused (even UI is cute) but not impressed. Quite. There should be a semantic-people-web out there[2]. As for object-centric web (or 3.0 or 4.0) of the future, someone has to invent it but there are some chains of thought of what will searching on that web look like. I have some ideas too (hey, they’re free) but let us not get ahead of ourselves and stick to the real-time search topic and come back to the Battelle article, shall we? Till then, here are some choice quotes from article(s) and available commentary. It is at best incomplete because, a lot of debate must have ensued, ideas spawned, hundreds of blog posts (and tweets and comments) written yada yada yada in the downstream but we will never know about ALL even if citations and trackbacks are supposed to be transitive. If B mentions A and C mentions B, then A should carry over to C (or the commentary should attach to A). This is what I want to work on using the a multitude of available APIs (ThoughtReactions seems to be a good name for such an endeavour, no?) but that is a project for the proverbial another day which never ever seems to dawn. Am rattling again. Out with the commentary…

Google was the ultimate interface for stuff that had already been said – a while ago. When you queried Google, you got the popular wisdom – but only after it was uttered, edited into HTML format, published on the web, and then crawled and stored by Google’s technology. It’s inarguable that the web is shifting into a new time axis. Blogging was the first real indication of this. Technorati tried to be the search engine for the “live web” but failed[1]. Twitter can succeed because it is quickly gaining critical mass as a conversation hub. But there is ambient data more broadly, in particular as described by John Markoff’s article (posted here). All of us are creating fountains of ambient data, from our phones, our web surfing, our offline purchasing, our interactions with tollbooths, you name it. Combine that ambient data (the imprint we leave on the digital world from our actions) with declarative data (what we proactively say we are doing right now) and you’ve got a major, delicious, wonderful, massive search problem, er, opportunity.

Let’s say you are in the market to buy something – anything. You get a list of top pages for “Canon EOS”, and you are off on a major research project. Imagine a service that feels just like Google, but instead of gathering static-web results, it gathers live-web results – what people are saying, right now about “Canon EOS”? And/or, you could post your query to that engine, and you could get real-time results that were created – by other humans – directly in response to you? Add in your social graph (what your friends, and your friend’s friends are saying), far more sophisticated algorithms a critical mass of data – and those results could be truly game changing. OneRiot just launched and I believe we’re taking a piece of the problem by finding the pulse of the web. The content people are talking about today by having over 2 million people share their activity data processing it in real-time and create the first real-time index. The web as it is today, now, tackling news first followed by videos and products next. And therefore, each pulse.

How much journalism these days is spotting patterns from the real-time web? How much is mining the static web? There is another form of journalism, which involves spending time in the real world, but it may be falling out of fashion. I’m not sure that there’s a huge great wobbly lump of wondermoney sitting at the end of the real-time web search rainbow. And if there is, I wonder if it’s much bigger than the one sitting a day further down the line, where the massive outpouring of us auto-digitising hominids has been filtered by the mechanisms we have, more or less, in place now. Google’s big problem isn’t that it can’t be Google a day earlier, it’s that it can’t be cleverer about imparting meaning to what it filters. For now, and until AI gets a lot better, the new worth of the Web is how we humans organise, rank and connect it. The good stuff takes time and thought, and so far nobody’s built an XML-compliant thought accelerator – Rupert Goodwins

Do you think live feeds be treated similar to how newswires were 30 years ago – considered a pay-for service? You’ve described how Twitter could start making money (via its search) and made me think of the possibility of Google buying Twitter. How different from Twitter should Google’s indexing of Twitter be? Their blog search is dismal because they’re searching good with junk. Look at those Twitter results, I am wondering exactly what utility they actually bring? I mean what value to the user? To be frank I care less what my friends think about the Canon EOS than what the opinions of professional photographers. In that regard there need to some method for improving authority. My social graph is my social graph – it’s of dubious value to me for making buying decisions. All the same, great post as it continues to generate lots of discussion in our office. The point you raise about what this feels like to users is especially near to me – it’s one thing to bring back real time results, and another thing entirely to present them in dynamic, useful ways.

I’m not all that concerned about what twit Twittered what in the last 24 hours, and I think that most of the people that do are twits. For instance, if I was researching a camera or a car, I’d be interested in the best stuff written about it in the last year or so, not in the last five minutes. Sure, a public relations flack might want to keep track of bad things people say on Twitter so they can have their lawyers send them nastygrams, but for ordinary people, it’s just a waste of time. Entertaining maybe, but a waste of time. Right on. It is not just real-time search, there’s a lot more that can cash in on this (and provide great user experience in the process). There will also be a goodly sum of what Rupert calls “wondermoney” racing at lightspeed toward the bank account of the company that will best provide the means to protect privacy of hundreds of millions who have absolutely no need nor any desire to see the dots of their every action and comment connected and delivered to “the matrix”.

This is definitely the next big thing in search. Your articulation of it is perfect. I say this, because I experienced this same thing over the last several weeks when I created a new Twitter account for our new products and wanted to track what people are saying. A quick Twitter search was the answer and a few replies later I had some conversations going and new followers as well. The real-time web will far outweigh the benefits of the archived web, atleast for certain types of information. Journalism was the original search engine, albeit with a rather baroque query interface. It tends to adopt most efficient use of people and technology to produce good data, being a notoriously Darwinist entity, and it’s quite good at adapting quickly – hasn’t taken long for blogs to make their mark. It’s a good thing to track if you want to sniff out utility on the Web – after all, journalism is the first draft of history.

Marketers would love that ambient data but that is a backwards approach to search. I don’t see the usefulness or appetite for people to query about what their friends are doing – especially when its already being delivered to them. You really need to see what’s going on in FriendFeed more to grok the real time nature of the web. Look at my realtime feed here for just a small taste – that’s 4,800 hand-picked people being displayed in real-time here. So, I think evolution is the wrong word. Perhaps the right word is “rediscovery”, or “mass public revelation” or “adoption” or something like that. The future was here 15 to 50 years ago. It just wasn’t (to quote the popular phrase) evenly distributed. So maybe all you’re saying is that this particular aspect of search, i.e. routing and filtering, or SDI, or whatever we may call it, is finally “growing” or “spreading”. But “growing” != “evolving”. But search is not evolving; what you are speaking of already exists and has existed.

We are talking about “text filtering” which sounds exactly like an idea that has been around for 40+ years. Here is a description of the problem from http://trec.nist.gov/pubs/trec11/papers/OVER.FILTERING.pdf – a text filtering system sifts through a stream of incoming information to find documents relevant to a set of user needs represented by profiles. Unlike the traditional search query, user profiles are persistent, and tend to reflect a long term information need. With user feedback, the system can learn a better profile, and improve its performance over time. The TREC filtering track tries to simulate on-line time-critical text filtering applications, where the value of a document decays rapidly with time. This means that potentially relevant documents must be presented immediately to the user. There is no time to accumulate and rank a set of documents. Evaluation is based only on the quality of the retrieved set. Filtering differs from search in that documents arrive sequentially over time. This overview paper was from 2002, but the TREC track itself goes back to the 90s and the idea goes back even further. In fact, now that I think of it, I remember talking with a friend at Radio Free Europe (anyone else remember that?) in Prague back in 1995, and he was describing a newswire system that they had, that did this online, real-time filtering. So maybe there’s a shift from static to real-time search in the public, consumer web. But there have been systems (and research) around in other circles that have been doing this for a while.

You may note that the link refers to a machine called ‘Memex’, Vannevar Bush (one of the first visionaries of “automated” information storage and retrieval schemes) wrote about decades before Luhn wrote about SDI. But you could go back a couple millenia, too – for example: the ancient Greeks argued whether words were real or ideal, representations or hoaxes for “actual observation” (and such disputation persisted throughout the Middle Ages [Occam’s Razor] to this very day [one of most renowned philosophers of 20th Century – Ludwig Wittgenstein – probably immensely influenced the AI community without their even being “aware” of it). The issue that such “gizmos” such as SDI and/or AI in general cannot deal with is that the world keeps changing: change is the only constant. Everything is in flux – always! As it always has been, no? The ideas and technology for all search were around way before Alta Vista popularized them, and Google.

[1] Technorati is a cautionary tale but then, most blog search engines (Technorati, Icerocket, Tailrank) have not made an impact because value of pure play search is in doubt. No one wants to go to a search box when there are the triumvirate of Google, Wikipedia and Browser Search Bar. Even Google is neglecting the area (cue: Google Blog Search sucks). Sad really because I feel that blogs empowered the first and therefore, the impressionable pioneering wave of citizen journalism and democratization of media phenomenon (Podcasts, YouTube, Seesmic, Qik etc. followed) that is a promising and enticing field which got washed away while still raw by Twitter (which can still be seen as lazy blogging if one is really looking hard) and the search companies the statusosphere spawned (OneRiot, Topsy, Collecta, Scoopler, TweetMeme). Maybe it is the ‘path of least resistance’ or ‘journalism is not for everybody’ at play here or just that something might be missing like say, attention data that can today be sucked from various places (eg. “implicit web”). Some blogosphere companies still exist and have survived, nay, thrived because they were smart to change their technology, business and operational model like Sphere (where I worked) and another promising company, Twingly (working on ideas such as ‘Channels’ and integrating with rest of mainstream Web 2.0). Am not a betting person but if life depended on it, I predict a revenge of the hybridosphere (blogs plus history, status and trails) when the Twitter fad cools down as well, just another phase of tripe (Facebook has 40 times more updates). We are already seeing it because Twitter is becoming yet another ego-URL store and copy-cat social network where it is becoming increasingly difficult to seperate the genuine article from the millions of pretenders, spammers and worst, marketeers.

[2] Between extremes of organized mainstream professional media to unstructured freestyle frivolous noise of jibber-jabber, there is a small, yet significant band of people-centric web which offers a truly multi-opinionated clairvoyance to the world. An analogy is ye faithful human eye which can only see a very small portion of the electromagnetic spectrum. Sure, it would be nice to be able to see the ultraviolent and infrared frequencies but the most interesting things happen in the visible band because it is so colourful and vibrant. There has to be an evolutionary benefit that the eye has settled to its current state. Getting out of the metaphor, this narrow band of semi-professional passionate implicit-explicit human generated content (you call it ‘hybridosphere’ if you like), if captured and processed intelligently, can be made to do some very magical and wonderful things (search, direct and indirect such as ‘related articles’, is just one of the many applications that can be built on top of this foundation and as proof look at crowd powered news site Insttant and sentiment analysis companies like Clara and Infegy) to all stakeholders but most of all, to the general public who just want to see the web as a collective of nice people living harmoniously in a wee global village free from shackles of big media opening up a world of discovery from all parts of our little blue marble in the sky. It is a matter of time and effort (luck is to work on RSSCloud, ThoughtReactions, Histosphere[3] and other neologisms) when we will see such Webfountain’ish hybrid companies (data mix of blogs, status, history, conversation, bookmarks, attention, trails, media, objects etc.) claiming their rightful place in Web 2.0 (or 3.0 or 4.0) pecking order, bringing to the fore badly needed innovation to excavate the people-centric web diamond mine. In my vision, searching in such a world looks figuratively like this…

This is inspired by a scene in “The Time Machine” (2002) where the protagonist Alexander encounters the Vox System in the early 21st century. The virtual assistant (played by Orlando Jones), is seen on a series of glass fibre screens offering to help the hero using a “photonic memory core” linked to every database in the world effectively making it into a compendium of all human knowledge. Since this scene must have been thoroughly researched, it is safe to rip it and suffice to say that an immersive search experience is one where the searcher is virtually forwarded to experts in the area who might have the answer he/she seeks. [edit: 20091214] Apparently, such a thing has been pondered before. Obvious really. It is Battelle again writing for BingTweets Blog, “Decisions are Never Easy – So Far. Part-3”

Normally a 30 minute conversation is a whole lot better for any kind of complex question. What is it about a conversation? Why can we, in 30 minutes or less, boil down what otherwise might be a multi-day quest into an answer that addresses nearly all our concerns? And what might that process teach us about what the Web lacks today and might bring us tomorrow? The answer is at once simple and maddenly complex. Our ability to communicate using language is the result of millions of years of physical and cultural evolution, capped off by 15-25 years of personal childhood and early adult experience. But it comes so naturally, we forget how extraordinary this simple act really is. I once asked Larry Page of Google, what his dream search engine looked like. His answer: Computer from Star Trek – an omnipresent, all knowing machine with which you could converse. We’re a long way from that – and when we do get there, we’re bound to arrive a with a fair amount of trepidation – after all, every major summer blockbuster seems to burst with the bad narrative of machines that out-think humans (Terminator, Battlestar Galactica, 2001 Space Odyssey, Matrix, I Robot… you get the picture).

Allow me to wax a bit philosophical. While the search and Internet industry focus almost exclusively on leveraging technology to get to better answers, we might take another approach. Perhaps instead of scaling machines to the point of where they can have a “human” conversation with us (a la Turing), perhaps instead (or, as well), we might leverage machines to help connect us to just the right human with whom we might have that conversation? Let me go back to my classic car question to explain – and this will take something of a leap of faith, in that it will require we, as a collective culture, adapt to the web platform as a place where we’re perfectly comfortable having conversations with complete strangers. Imagine I have at my fingertips a service, that allows me to ask a question about which classic car to buy and how, and that engine instantly connects me to an expert – or a range of experts that can be filtered by critieria I and others can choose (collective intelligence and feedback loops are integrated, naturally). Imagine Mahalo crossed with Aardvark and Squidoo, at Google and Facebook scale.

An ‘expert’ of course is still undefined and the jury is still out on what such an entity constitutes. Hey! I never said I have all the answers. Besides, aren’t things like call centres, web site with live chat etc. already handle this rant of human-on-line? and communication is always a problem. So, good luck with that. Live long and prosper.

[3] Let us talk about histosphere. The concept is fairly simple. There are several companies (Hooeey, Google, Infoaxe, Thumbstrips, WebMynd, Iterasi, Timelope, Cluztr, Wowd, Nebulus etc.) that are collecting the browsing history of users mainly through the mechanism of toolbars. On an individual basis, ‘web memory’ has utility and so, users can be convinced that it is a good tool to have and that it is a good idea to share the surf logs to the public at large not very unlike the case made for social bookmarking. This collective social history (also count Opera Mini logs whose web proxy server is collecting 500Million URLs per-day and Mozilla Weave which will have similar numbers soon) is what I call the ‘histosphere’ (a parallel word being the blogosphere and the criminally underexploited, bookmarkosphere). A simple theory is that the histosphere is a proper superset of blogosphere and bookmarkosphere and hence it is as useful, if not more so, than both combined. There is a trickle effect at play here. Not all history gets bookmarked and not all bookmarks get blogged. So, the narrow band we talked about above is really narrow but as any signal processing engineer would vouch, we should also count the haze or radiation to make sense of the quasar. Therefore, the same business and technology models of blogosphere (example, Sphere) and bookmarkosphere (example, Digg) can be replicated for the histosphere but given the noisy nature of surf logs, one should apply filters (like ‘engagement metrics’) and use properties of attention data (like ‘observer neutrality’) to deliver better experiences. Google is already trying to do this if one is logged in to get personalized search results but they suck in one-off rare cases they are visible. A use-case is to combine web memory with the side-effect of identity provided by toolbars to customize the whole web experience. Everywhere you go, the web memory follows sifting through the cacophony. For example, if I am using Infoaxe and go to NYT or WSJ, the publishers will detect that it is me@infoaxe and deliver relevant content (and also ads, sic). Whichever search engine (reference or blog or real-time) loads history (and other streams) onto its cart will no doubt upset the shifting gravy train. Go Hybridosphere!

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Advertisements

Entry filed under: Business, Citizen-Journalism, Computers/ICT, CWorks, Life-Theories, Projects, Research, WebXP.

Joy of Giving – Ample Supply of Goodness – Banyan Manmohan Singh NRI Call – Return Home

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

December 2009
M T W T F S S
« Nov   Jan »
 123456
78910111213
14151617181920
21222324252627
28293031  

Tweets


%d bloggers like this: