How A Search Engine Works

The Internet, from being a prerogative of a selected few, has seeped into everyone’s lives around the globe. It would not be an exaggeration to say that every piece of information one would ever need is placed somewhere in the form of bits and bytes on a complex maze of web pages that comprises the Internet.

There is no exact figure on the total number of web pages out there but a rough estimate would run into tens and hundreds of billions. Retrieving the required information from so many pages would be a Herculean task but thanks to search engines, the process has been simplified. Let us take a look at where the search engines originated from.

History of Search Engines

The grandfather of all search engines called Archie can be traced way back to 1990 created by Alan Emtage, a student at McGill University in Montreal. That was a time when the primary modus operandi of sharing files across the network was File Transfer Protocol (FTP). In this method, anyone who has to share a file would run a service called FTP server.

A user requiring this file would connect to the PC using another program called FTP client. The availability of files for sharing would be divulged by posting on discussion forums or mailing lists in what could be termed as the Internet, equivalent to word of mouth. Later, anonymous FTP sites came into being allowing users to post or retrieve the files.

Archie changed all this by combining a script-based gatherer, which would scour the FTP sites creating indexes of the files on its run. The regular expression matcher of it allowed the users to access its database.

If Archie is called the grandfather of search engines then Veronica can rightly be called the grandmother. This search engine was developed in 1993 at the university of Nevada System Computing Services Group. It was a lot similar to Archie but worked on Gopher files. Gopher is a service akin to FTP but comprises plain text documents. Matthew Gray’s World Wide Web Wanderer, the mother of search engines, was the first to employ the “robot” concept.

A robot is essentially a software program designed to access all web pages using the links found in the web pages already accessed. Though it was first designed to count the number of web servers, later it started capturing URLs as it went along creating the first web database called Wandex. Mathew Gray’s wanderer fueled the development of many more robot-based search engines, many of which power today’s search engines.

Types of Search Engines

All the search engines prevalent today can be categorized into two types:

• Crawler-based search engines: A crawler based search engine employs software known as “spiders”. These spiders crawl through the web collecting links found in web pages as the source for continuing further searches. All the pages referenced by the links are stored in a database of indites that relate to the textual contents of the pages. This index is then used for displaying the pages that match the queries entered by a user.

• Directory-based search engines: Unlike crawler-based search engines, directory-based ones are completely human driven. Here, a directory containing the links of the pages is maintained by the creator of a web site, where links to web pages can either be provided by the webmasters or by the people reviewing that particular site.

Owing to their speed and autonomous nature, crawler-based search engines are the more popular of the two. So let us take a closer look at what makes them tick. A crawler based search engine is essentially com posed of three components-a) The Spider or Crawler, b) Indexer, and c) Ranking Software.

Spider

This is the component that relentlessly crawls through the web, accessing and downloading pages. Spiders work on the principle that all pages are linked through hyperlinks. They usually start off with a list of URLs (Universal Resource Locators, commonly known as web addresses) and then proliferate by accessing hyperlinks found within the progressive web pages.

Since it is not feasible to access every single web page out there, spiders are programmed to be capable of filtering and deciding where to stop and what routes to take while traversing the web. This speeds up the search. The frequency with which a spider visits a web page also depends on how often that page is updated. This behavior is decided by a set of policies programmed into the spider called crawling policies.

This includes 1) A selection policy (i.e. which pages to download) 2) A re-visit policy (i.e. when to check for changes to the pages) 3) A politeness policy (i.e. how to avoid overloading the web sites) and 4) A parallelization policy (i.e. how to coordinate distributed web crawlers).

• Selection policy: The mercurial growth of the web has necessitated the need to identify a good selection policy for the URLs to be navigated. A study done by Lawrence and Giles found that of all the pages available on the web, only 16 percent are actually downloaded and indexed.

With such a meager percentage of the coverage by search engines, it becomes necessary to display only the most relevant pages to users by prioritizing them. The significance of a page can be determined by its intrinsic quality, number of inbound links, number of visits to a particular page and most importantly, the ones with the most relevant information.

• Re-Visit policy: Given the depth and expanse of the web, the time required to navigate even a scant number of web pages could run into months. The web is dynamic with lots of additions, deletions and modifications happening frequently. This throws additional pressure on the search engines from the freshness and relevance standpoint. If freshness is an indication of accuracy, age indicates the superannuation of the local page downloaded from the site.

A good search engine always strives to achieve high freshness and low age in its search results. There are two re-visiting policies used: a Uniform Policy, which requires all pages to be visited uniformly irrespective of the rate of change of a page and a Proportional Policy, where the frequency of the updates to the page determines the rate at which the pages need to be visited.

• Politeness policy: When spiders rove the web, they do affect the overall performance of the web albeit imperceptibly. For example, network resources come under pressure as spiders hog a considerable amount of bandwidth due to the very nature of their functioning. Also, they often work in parallel, further increasing the load on a server.

They could even load servers if they access them too frequently. Finally, botched spiders that cannot process a web page correctly could even crash a servers or a router. A partial solution to overcome this problem is the Robots Exclusion Protocol, allowing web site administrators to specify which pages need to be visited and what information is to be accessed.

On the top of a URL, a file called robots.txt is specified which invokes this Robots Exclusion Protocol. Another method is by incorporating this authorization information in the META tags of a web page. This tag informs the spider that the current page should neither be indexed nor be followed. Currently, very few spiders/robots support this and it is completely up to the spiders to decide whether or not to obey them.

• Parallelization policy: A parallel crawler is one that spawns multiple processes, so as to cover more ground while searching the web. Though the goal is to increase the download rate, it has to take care of minimizing the overhead caused by this parallelization and also to prevent multiple processes from downloading the same page. To contain this redundant downloading, there are policies for assigning new URLs during
the crawling process as discovered by multiple methods.

• Dynamic assignment: Here, a central server takes the responsibility of assigning the URLs to each crawling process dynamically. This allows for the load to be balanced uniformly on the available crawlers.

• Static assignment: In this policy, there is a fixed rule stated from the beginning of the crawl, which defines how to assign new URLs to the crawlers. For the assignment function to be effective, it has to satisfy three properties:

a) Each crawling process should receive approximately the same number of hosts
b) If the number of crawling processes increases, the load per process should reduce
c) The assignment must be able to add or remove crawling processes dynamically with minimal impact on the overall functioning of the system.

Indexer

This component is responsible for the indexing function of the search process. Every page that has been encountered and thought by the spider to be worthy of downloading has to be indexed in a manner where the user receives relevant results quickly. The indexer performs a number of functions: it reads the repository containing the web pages downloaded by the spider decompresses the documents and parses them.

Each document is converted into a set of word occurrences called “hits”. The hits record the word, position in document, an approximation of font size, capitalization, etc. It also parses out all the links in every web page and stores important information about them in an anchor file. This file contains enough information to determine where each link points from and to, and the text referred to by the link.

Page Ranking

The success of a search engine is measured by the relevance of its search results. This final job of displaying the search results in your web browser is handled by the page-ranking component of a search engine. People who dabble in search engine optimization make it their task to figure out ways to program web pages so as to enable them to show up within the first 10 results of a search query.

Hence, unraveling the secrets of the algorithms employed by the search engines can help enthusiasts create pages with the high rankings.

Although the inner details of the algorithms employed by the search engines are not publicly available, the manner in which the results are displayed gives some insight into the criteria used for ranking pages. These tell-tale signs could be the title of the web page, headings, links of titles within a page, frequency and prominence of a word on a page, popularity of a link, freshness of information on that page, etc.

Some users employ devious means to wrongfully create pages to appear at the higher ranks in a search query even though the page might not contain relevant information. This is known as “spamdexing”. People who practice spamdexing are called search engine spammers.

There is a subtle difference between search engine optimization and spamdexing. Spamdexing is more of a crooked art aimed at misleading search engines while SEO is the art of adding quality-rich results to a search query, in such a way that the page gets listed by way of its own merit.

There is a downside though to spamdexing which makes the spammers wary—if the search engine concludes that a particular page has resorted to deceitful means to enhance its ranking, the particular site can be blacklisted and not considered in the future. The golden rule is that a page gets listed only if it has original quality content.

Future of search engines

Though the current breeds of search engines are adequately efficient at retrieving results, they are still incompetent at interpreting the actual intent of the user. For example, if the user has typed “windows” and is hoping to get a listing of prices and styles for windows to be fitted at his house, the results would probably be a listing of results related to “Microsoft Windows” while the actual information on home window styles would be buried somewhere within.

This means the user would have to painfully sieve through a plethora of results to find the relevant answer to their query. Life would be much easier if the search engines could decipher the actual intent of the query rather than taking everything at its face value. This is exactly the area where all the search engine powerhouses are directing their efforts-personalization.

To have the ability to interpret what the query means, search engines need to track the surfing habits of the user. This culminates in the user parting with some of his privacy to obtain the correct results. Therefore, as stated in our previous example, if a search engine were smart enough to notice that the user sieved through the many incorrect search results and select the one on architectural windows—which is what the user was looking for-it would save this information linked to that particular user.

Therefore, the next time the user types a similar query, it could throw up more relevant results compared to what it did before. On a deeper level, a search engine could “profile” a user with questions relating to preferred living habits, etc and use this to add further relevance and accuracy to its search results. The possibilities of enhancement of search engines are limitless and over a period of time, future search engines could make today’s generation of web search systems look archaic and clunky.

Are You Ready For a Kick Ass SEO Service?

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Speak Your Mind