Why I love Full Text Search

I’ve spent a big part of my life working on structured storage engines, first in DB2 and later in SQL Server. And yet, even though I fully understand the value of fully schematized data, I love full text search and view it as a vital access method for all content wherever it’s stored. There are two drivers of this opinion: 1) I believe, as an industry, we’re about ¼ of the way into a transition from primarily navigational access patterns to personal data to ones based upon full text search, and 2) getting agreement on broad, standardizing schema across diverse user and application populations is very difficult.

On the first point, for most content on the web, full text search is the only practical way to find it. Navigational access is available but it’s just not practical for most content. There is simply too much data and there is no agreement on schema so more structured searches are usually not possible. Basically structured search is often not supported and navigational access doesn’t scale to large bodies of information. Full text search is often the only alternative and it’s the norm when looking for something on the web.

Let’s look at email. Small amounts of email can be managed by placing each piece of email you chose to store in a specific folder so it can be found later navigationally. This works fine but only if we keep only a small portion of the email we get. If we never bothered to throw out email or other documents that we come across, the time required to folderize would be enormous and unaffordable. Folderization just doesn’t scale. When you start to store large amount of email or just stop (wasting time) aggressively deleting email, then the only practical way to find most content is full text search. As soon as 5 to 10GB of un-folderized and un-categorized personal content is accumulated, it’s the web scenario all over again: search is the only practical alternative. I understand that this scenario is not supported or encouraged by IT or legal organizations at most companies but that is the way I chose to work. There is no technical stumbling block to providing unbounded corporate email stores and the financial ones really don’t stand up to scrutiny. Ironically most expensive, corporate email systems offer only tiny storage quotas while most free, consumer-based services are effectively unbounded. Eventually all companies will wake up to the fact that knowledge workers work more efficiently with all available data. And, when that happens, even corporate email stores will grow beyond the point of practical folderization.

The second issue was the difficulty of standardizing schema across many different stores and many different applications. The entire industry has wanted to do this over the past couple of decades and many projects have attempted to make progress. If they were widely successful, it would be wonderful but they haven’t been. If we had standardized schema, we would have quick and accurate access to all data across all participating applications. But it’s very hard to get all content owners to cooperate or even care. Search engines attempt to get to the same goal but they chose a more practical approach: they use full text search and just chip away at the problem. They work hard on ranking. They infer structure in the content where possible and exploit it where it’s found. Where structure can’t be found, at least there is full text search with reasonably good ranking to full back upon.

Strong or dominant search engine providers have considerable influence over content owners and weak forms of schema standardization becomes more practical. For example, a dominate search engine provider can offer content owners opportunities to get better search results for their web site if they supply a web site map (standard schema showing all web pages in site). This is already happening and web administrators are participating because it brings them value. A web sites ranking in the important search engine providers is very vital and a chance to lift your ranking even slightly is worth a fortune. Folks will work really hard where they have something to gain. So, if adopting common schema can improve ranking, there is significant chance something positive actually could happen.

The combination of providing full text search over all content and then motivating content providers to participate in full or partial schema standardization coupled with the search engine inferring schema where it’s not feels like a practical approach to richer search. I love full text search and view it as the under-pinning to finding all information structured or not. The most common queries will include both structured and non-structured components but the common element will be that full schema standardization isn’t required nor is it required that a user understand schema to be able to find what they need. Over time, I think we will see incremental participation in standardized schemas but this will happen slowly. Full text search with good ranking and relevance assisted by whatever schema can be found or inferred in the data will be the under-pinning to finding most content over the near term.

–jrh

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com

2 comments on “Why I love Full Text Search
  1. Mike’s (as usual) 100% right. It’s about ALL the data, so start with access to it all with full text search and do better as more structure is stated or inferred.

    –jrh

  2. Neil Conway says:

    James, you might be interested in "data spaces":

    http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf

    The idea is to provide a unified architecture for accessing to data that scales all the way from unstructured, unintegrated data (over which only full-text search) is provided, to completely structured and integrate data (over which you might allow SQL). Rather than requiring up-front schema integration, the idea is to allow the system to be brought on-line providing only full-text search, and then allow schema integration to be performed incrementally and as needed.

Leave a Reply

Your email address will not be published. Required fields are marked *