The officially informative, sometimes humorous, and decidedly wonderful blog.

How Search Engine Indexes Work

| 2 Comments

HOW SEARCH ENGINE INDEXES WORK

Anas Shad, search-engines, July 21, 2011 via Flickr, Creative Commons License

In order to answer the question “how do search engine indexes work?” you first need to understand what they do. The designers of search engines incorporate academic algorithms in order to sort, collect, and store data. Those algorithms include mathematics, cognitive psychology, physics, linguistics, computer science, and a handful of other variables that allow the software to accurately disperse information based on the request.

Think in terms of a cookbook. There are hundreds of recipes in the average cookbook. If you want a recipe that contains cantaloupe, you could take the time to look under the general headings of Fruit Dishes or Salads. If you want to move quickly, you could go to the back of the cookbook and look in the index. From here you would be able to find out exactly which recipes contain the ingredient and where they are located. This same concept is used on a larger scale with search engines to track trillions of bytes of digital information.

Some search engines index whole sentences of content that help you find particular information while other search engines index individual words, which gives you a larger pool of information to draw from. All of the indexes are finely tuned so that users can access the most relevant information available according to the search terms entered. Here’s a look at the process.

Web Crawling

Before a search engine can provide you with an answer to your query it must gather data by “crawling” websites. To do this, software bots called “spiders” are sent out into cyberspace to collect data. The spiders usually start at the most popular sites and work their way through to the least popular ones. Google is an example of how a once small academic search engine becomes a giant one. When Google started they would send out three spiders at a time. Each capable of maintaining 300 opening connections back to the origination database. These spiders were able to “crawl” roughly 100 pages per second, thereby generating about 600 kilobytes of data every second. It turns out that spiders are hungry beasties and require a lot of territory. This began the expansions of search indexes and the World Wide Web. The spiders kept venturing further, collecting more information, and expanding the database.

Meta Tags

Meta tags were developed to fine tune the information gathered by the spiders. Meta tags allowed website owners to provide specific keyword and concept phrases allowing the spiders to categorize and index the pages more efficiently. Meta tags, however, are not perfect. An unscrupulous website owner can integrate meta tags for her site that reflect popular searches, but actually have nothing to do with the website itself. This creates errors within the index. Companies like Google have taken recent measures to eliminate this problem by reprogramming spiders to gather data more organically.

Creating the Index

Now that the spiders are done (technically, a spider’s work is never done) gathering data, they send it home. The homebase sorts the data using two particular parameters: information stored with the data and methods by which the information is indexed.

The most basic of methods would involve a word and the URL where it is found. Life is not that simple and neither is information. Search engine databases track not only the word and its location but how many times that word appears at that location as well as other locations which establish a “ranking.” The ranking allows a search engine to list the most relevant pages first. To further refine the indexing process, the engine might assign a “weight” to a word based on its importance to the document or website.

Engines also use a method of assigning a numerical value to words, called hashing, in order to balance the scales, making search fairer. If you were to look in the dictionary you would find that the “S” section is much thicker than the “X” section. Since there are so many more “S” words than “X” words, a strictly alphabetical search would take longer for topics beginning with “S.” Hashing allows for a more level search resulting in equal search speeds.

Creating the Search

Creating a search requires users to make queries. A query can be as simple as a single word or a more complex one requiring the use of Boolean operators such as “and”, “or”, “followed by”, “near”, and “not.” The act of asking for information helps engines decide which topics carry more interest versus ones that do not.

Search engines can operate on a literal base, conceptual base, or natural language basis. A literal search can provide misleading results. For example, some searching for “bed” may come up with results having nothing to do with what they are looking for. A “bed” can be something you sleep in, a place to plant flowers, a nest for hens to lay eggs, the back of a pick-up truck, and many others.

A conceptual search is based in statistical analysis and has yet to be fully implemented but is something being researched. An example of a conceptual based search is Google anticipation searches. Google collects the data of account users based on where they frequent, the emails they receive, and the literal searches they perform.  Google then uses the information to anticipate what you are looking for.

For natural language queries, think AskJeeves, where you ask a question just like you would a person and the engine produces results based on how you and other have asked the same question.  On many levels this is very efficient.

Faults

These indexes are not without flaws. Sometimes there are problems relating to language barriers and new document information. After all, words that mean one thing in one language can mean something else in another (especially if they are used as slang). Consequently, you might end up with less refined search results. It’s also possible for new information to be too new. A search engine may be indexing the new information, but might not have had ample time to prepare it for a search engine. The result is older information being available to the searcher until the spiders are finished indexing the new document.

Search engines and their indexes are constantly evolving, adding new data and developing new ways to bring users that information more quickly. Search engine designers are seeking better ways to identify the data while providing more accurate intelligence, and it is these ambitions that will usher in an age of effortless inquiry at a user’s fingertips.

  • http://www.insoft.com/ Sudhakar

    Hi , i have one doubt in google indexing, i have two different site, two sites have different themes but main content of two sites are same. that one is will cause any problem in google indexing and if problem causes please advise how to solve my issue. i dont like canonical and 301 redirect because two of them are different site but main content only same.

    • namedotcom

      Hey Sudhakar, you'll definitely want to make sure that you have unique content on both sites otherwise you'll get dinged. The simplest way to think of SEO is if you genuinely position yourself as a thought leader and produce relevant, unique content, you'll come out on top. We're good friends with Perry of SEO Sherpas, (http://www.seosherpas.com/) he may be able to help you with more specifics.