Just mumblings and grumblings.

5.6.05

Searching, Sharing and Stumbling

Over the last couple of weeks I've been nosing around the internet looking for places to announce my website (KindaKarma.com) and did a lot of musing about how people find things on the internet.

When you really think about it, websites are a just a collection of isolated islands in a vast ocean of information. The island inhabitants build bridges from time to time from one island to another, but such bridges are one-way, and even some of the coolest websites are doomed to anonymity and an early death.

Fortunately, there are a few good ways to get around the archipelago:
  1. Seach Engines (Google, AskJeeves)
  2. Web Directories (Zeal, Yahoo Directory, DMOZ)
  3. Blogs (Slashdot, Joe Blow Blog)
  4. Social Bookmarks (StumbleUpon, del.icio.us)
Search Engines

Everyone knows what a good search engine does. It finds websites based on the keywords they contain. That's not quite enough though, since you'd just end up with a huge list of links of which perhaps only a handful are really interesting. So search engines also sort results by relevance. Google makes a good effort there by assuming that pages with lots of links to them have higher relevancy.

Search engines aren't a panacea though. New sites, even very worthy ones, have a long road ahead before they can come anywhere near to the top of the pile. Compounding this problem, there are also a lot of junk sites out there that exploit how search engines work to garner higher placement even though they might not be particularly relevant to the user.

Web Directories

Web directories are basically just large lists of websites sorted into categories. This provides a sort of rudimentary index to the web. Since two-thirds of the web is spam, most directories go through a filtering process to ensure that their listings are relevant and accurate.

This is all well and good, but at last glance there were more than four and a half million websites listed in DMOZ and that's AFTER filtering. Such an overwhelming list is bad all around because it's a phenomenal amount of work for editors, a bitch to wade through for users and even the best websites are little more than a link among many similar links.

Blogs

Blogs come in many flavors, from the communal news sharing that is Slashdot, to the Joe Blow man on the street's personal diary. A perhaps unintended consequence of blogs, is that blogs tend to be textual (read keyword rich) repositories of links to interesting or useful information. They make for a rich source of cross-linked data for search engines to chew through. This results in better search results, and if users happen to find a blog they particularly like, they get a human edited list of interesting links.

For the webmaster though, this is a zero-sum game. While many blogs allow users to submit their own posts, blog communities are continually bombarded by promotional posts making editors extremely wary to post anything even remotely commercial. Users get stiffed with having to sometimes wade through banal, redundant or downright moronic blogs for information.

Social Bookmarking

Social bookmarking (wikipedia article) is a relatively new concept that is being approached in a variety of ways. In essence, users share lists of links with each other, usually in some kind of category or keyword framework. The categories and keywords keep these lists relevant, while users keep out spam by only submitting worthy links.

One site, StumbleUpon takes a pretty novel approach to this by using collaborative filtering in combination with social bookmarking. Users install a toolbar in their browser that gives them a 'stumble', 'I like it' and 'not-for-me' buttons. Users click 'stumble' to get a randomly selected page based on what similar users liked. They vote them up or down. It works quite well, though it's a much more time intensive approach for users, as many of the websites you visit will not be interesting.

Another interesting site is digg, which adds a rating and ranking system that provides robust relevance filtering. Each bookmarks also gets a sizable write up, allowing the user to skim lists before clicking. Sadly, the site appears to be a victim of its own success, as page load times are extremely slow as of now.

Unfortunately, all these systems are in their infancy. Many suffer from terrible UI design that makes browsing a frustrating affair. Others lack any kind of moderation, which leads to less and less relevant links (I've yet to see a significant amount of spam on these systems, but I feel it's only a matter of time). For the webmaster, they represent a great new way to share links, but they'll need to be active members of each community to give their links any kind of prominence.

Best of All Worlds

So is there a best of all worlds solution that would cure all these ills? Perhaps. Let's lay out what our goals would be in creating a SupraSearch website.
  1. Relevance. Any link would need to have high relevance to users' searches.
  2. Webmaster Friendly. The site would need to have tools for webmasters to promote their sites.
  3. User-Friendly. This means a lot of things, but generally we want the user to have to click on as few things as possible to get where they want to go.
  4. Fairness. We want all websites, new and old, to have a fair chance to vie for a users attention, and any advantages they receive should be on merit.
  5. Abuse-proof. Spammers are everywhere, we don't want them to destroy any chance of achieving our other goals by exploiting the system.
So let's take a stab at a solution.

An obvious first step is to hybridize all the different approaches into a single mutant super-website. How about a classic search engine that only crawls websites that are in a moderated list of links (search engine and directory, check). The list of links would be built and managed by a community of civicly minded users and webmasters who contribute links, filter out spam and just generally keep relevance high (blogs and social bookmarking, check).

Every user would get an annonymous profile through which they search, submit, and moderate. Each website would have a profile, associated with its authors' profiles, that users could track, comment on, and score.

At the heart of such a system would be a relevancy score, which would drive the ranking of results in searches and directory listings. But relevancy is a slippery and subjective concept, so let's break it down a little further:
  • Contextual Relevance (CR). How well a result matches the context of what a user is looking for. Keywords and categories can be used to calculate this component.
  • Popular Relevance (PR). A key measure of how relevant a link is. How often people click it, and how often other people embed it in their own pages.
  • Author Relevance (AR). How highly regarded the author of the website is. Known spammers score low, important scholars or companies score high.
  • Quality Relevance (QR). Not all links are created equal, some are better than others.
  • Freshness Relevance (TR). Information often degrades with time. A philosophical dissertation might be relevant always, but a technical paper might become obsolete in just a few years.
Our SupraSearch has to start somewhere, so let's try this:
Link Relevance Score = (A * CR) + (B * PR) + (C * AR) + (D * QR) + (E * FR);
A-E are just constants to allow for tweaking. All variables are scaled from 0 to 1 and the sum of A-E must equal 1.

CR and PR is the classic keyword and backlink relevance scoring done by almost all current search engines.

AR, QR and FR would be managed by the community of moderators. For simplicity's sake, let's assume that every user gets a vote on each score, and the total score is just the average of all users'. An obvious next step would be to create a relevancy score for each USER and use a weighted average, but this is another discussion altogether.

A-E you either tweak to get the best results, or you might even allow each user to set their own factors search-by-search.

Summary & Conclusion

There are an endless number of other design elements we could discuss endlessly, but the important features here are that we've hopefully excluded spam sites by using an 'In List' and added a human element to the calculation of website relevancy.

Users get more relevant results and webmasters get something better than the one line submission form most search engines offer and the opportunity to influence their scores directly through the moderation community.

Like all things, SupraSearch still has some chinks in its armor:
  • Performance of this type of system might be horrendous since it takes into account so many factors for each search.
  • Abuse prevention still isn't addressed fully.
  • Calculation of the different relevance scores might be very difficult to balance.
  • Will the company who owns SupraSeach be scrupulous enough to not use paid listings or user accounts?
Still seems worth a shot to me...