Exclusive: Microsoft Patents The Search Engine
Microsoft has received a patent that covers a search engine platform that is based on a "bag-of-words" and "essential pages" ranking system to make searches more efficient.
While it is clear that this patent is the foundation for Bing, it is somewhat stunning to see how this patent collides in its details with what could also be claimed to be the foundation of Google's search engine.
The technology field is littered with silly patents that potentially should have never been awarded in the first place and do not serve their intended focus to foster innovation and protect the intellectual property of ingenious minds. One of those recent patents may have been Microsoft's Heureka! moment that the idea of the operating system shutdown should be generally owned by the company.
In fact, we are lately seeing much more activity in patents that cover a wide range of technologies, many of which are critical to the future of IT. For example, I have recently stressed that Google patented cloud browser sync as well as threaded email management, while Microsoft patented GPU-accelerated video encoding and the voice search capability through a search engine. In fact, it is especially the patent battle between Microsoft and Google that highlights much of the IP that is likely to determine key technology of the future Internet platform.
Microsoft has now bagged a patent that is labeled as "a search engine platform", which - by its general nature - is sure to raise some interest across the industry. Exactly what search engine has Microsoft patented here?
Patent filings always have a general and detailed description and as you dive deeper, this patent gets much more interesting. Microsoft general claim is:
"Systems and methods to perform efficient searching for web content using a search engine are provided. In an illustrative implementation, a computing environment comprises a search engine computing application having an essential pages module operative to execute one or more selected selection algorithms to select content from a cooperating data store. In an illustrative operation, the exemplary search engine executes on a received search query to generate search results. Operatively, the retrieved results can be generated based upon their joint coverage of the submitted search query by deploying a selected sequential forward floating selection (SFFS) algorithm executing on the essential pages module. In the illustrative operation, the SFFS algorithm can operate to iteratively add one and delete one element from the set to improve a coverage score until no further improvement can be attained. The resultant processed search results can be considered essential pages."
As I read through the patent, I learned that Microsoft described is a search engine technology that aims to increase the likelihood to find certain content with fewer mouse clicks. This idea is based on traditional search engine spidering techniques, a ranking system, as well as secondary information from neighboring search results to retrieve relevant information for a re-ranking of a search result. In Microsoft's words:
"In addition to relevance, existing practices also consider diversity of Web-search results as an additional factor for ordering documents. A re-ranking technique based on maximum marginal relevance criterion to reduce redundancy from search results as well as presented document summarizations has been considered. Additionally, an affinity ranking scheme to re-rank search results by optimizing diversity and information richness of the topic and query results has been developed. Such practices model the variance of topics in groups of documents.
The herein described systems and methods provide a modeling of the overall knowledge space for a specific query and improving the coverage of this space by a set of documents. In an illustrative implementation a "bag-of-words" model for representing knowledge spaces is provided. Additionally, in the illustrative implementation, a formal notion of coverage over the "bag-of-words" is provided and a simple but systematic algorithm to select documents that maximize coverage is derived to allow relevance to the search topic."
Microsoft considers a web page as a "bag-of-words" where keywords are filtered, extracted and counted to achieve a certain valuation of that document. The result is basically a document that lists keywords. Microsoft's patented search engine platform relies on a bag-of-words approach in which "a document is processed as a collection of statistics over a set (i.e., bag, of words used in it, without explicit semantic constructions such as sentences, formatting, etc.)." This document based on a bag-of-words provides the foundation for what Microsoft calls "essential pages" that relate to the bag-of-words and are said to eliminate certain less relevant search results from a search query and require a user to perform fewer mouse clicks.
The indexing and processing of the bag-of-words is a highly complex process and involves an interpretation and processing of each word, including the identification of the root of the word, word stemming. For example, Microsoft removes the endings as well as those that do not describe context semantics, such as "as," "is," or "be." According to Microsoft, this process will provide more "pertinent search results."
So, how is this patent different from what Google does today? Microsoft applied for this patent in March 2008, about one year before the company provided a first glimpse at the search engine. The concept comes down to keyword generation, extraction and storage - as well as a way how they are applied to a search query. The description largely describes what Google has been doing for several years as well as a keyword practice that has been implemented by basic search engine optimization efforts for several years. And even if Microsoft's patent differs in certain details from Google's approach, it is somewhat surprising that this idea has made it through the U.S. Patent and Trademark Office in the record time of not even three years. Legally, Microsoft may have some leverage against Google, even if it is questionable whether Microsoft would really try to go after Google at this time - in a critical technology area such as keyword extraction and interpretation.
Google's lawyers, on the other hand, may want to look at this patent more closely and figure whether Microsoft has invaded Google territory with this patent or not. Somehow I feel that this is not the last time we have heard of this patent.