Instructional Module W17a

Search Engine Basics


to Top Overview: Anatomy of a Search Engine

to Top

Web search engines are made of three parts:

  1. User Interface, or "front-end"
    This is the part we see. Its function is to take our request for information, get the information from the database, put it in order, and send it to us.
  2. Database, or "back-end"
    Here, vast amounts of information are processed and stored in carefully-organized form to make it quick to retrieve.
  3. Crawlers, also known as "spiders" or "robots"
    These agents systematically move around the Web harvesting information.

Figure 1 shows the relationship between the three parts.

Figure 1: Anatomy of a Search Engine
Figure 1


to Top Crawlers
What They All Do

to Top

to Top

Crawlers are software programs that run on computers belonging to the search engine. (They don't actually move from one computer to another.) The job of the crawlers is to automatically harvest information. This is the general pattern of their actions:

  1. Start at a page determined by humans to be useful.
  2. Send the page back to the Search Engine to be processed.
  3. Follow links to other pages, sending back information and following links on them.

In this way, information from millions of Web pages is sent back to the search engines.

Differences

to Top
to Top

Search engine crawlers aren't all programmed the same. There are a number of ways in which their designers try to make them better, or more specialized.

  • Where they start: Crawlers are generally started on pages with lots of links for them to follow. For example, Starting a crawler on the Yahoo home page would give it links to many categories of Web sites, including Autos, Finance, Games, Groups, HotJobs, Maps, Mobile Web, Movies, Music, Personals, Real Estate, Shopping, Sports, Tech, Travel, TV, and Yellow Pages. A crawler could spend many happy hours following all those links!
  • What sort of information do they bring back from each page? Some brought back only a summary, but now they almost all bring back the entire page, including images.
  • How far do they go when following links? Some go to the first few pages on a site; most go as far as they can.
  • What sort of site do they visit - all, or only certain kinds? The best-known search engines visit all the sites they can find. Some are specialized, searching only certain types of site, such as government sites or health-related sites. (Specialized search engines can save you time!)
Learning Options

Learn more about any of these topics by preparing a PowerPoint presentation:

  • Compare crawlers - Compare any two crawlers, researching and listing the differences between:
    • where they start
    • what sort of information they bring back
    • How far they go
    • What sort of site they visit
  • Caution: this topic may require extensive searching.


to Top User Interface
What They All Do


to Top

What is a hit?
A hit is an entry in the database that matches one or more of the search words entered by the user.

to Top

The User Interface (UI) is what we see when we use a search engine. It's more than a pretty page, of course! The UI is responsible for these jobs:

  • Welcome the user and make it easy to figure out how to use the system;
  • Accept user input and parse it into recognizable words and commands (such as quoted strings, "+", "-", AND, OR);
  • Find relevant hits in the database;
  • Compute the degree of relevance of each hit;
  • Sort the hits into order by relevance (and possibly slip a "sponsored link" in the right place);
  • Show the list of hits to the user.
Differences

to Top
to Top

Differences in the UI are the most obvious differences between search engines. They include:

  • Complexity of the starting page: which is more complex?
  • Type of search commands recognized;
  • How relevance is computed:
    • Frequency of search terms in the pages
    • Number of links pointing to pages
  • Order in which hits are presented
    • Relevance only: most search engines
    • Relationship of one to another: Kartoo
  • Which facts are presented about each Web page: Originally, search engines presented different selections of facts about hits; now they mainly opt for simplicity, and include only:
    • Page title
    • Two lines with the context of hit words
    • URL
    • Size
    • Whether or not available in cache
  • Visual style of the page and additional material presented there.
Learning Options

Learn more about any of these topics by preparing a PowerPoint presentation:

  • Compare search engine user interfaces - choose one search engines and one metasearch service. Compare the differences in their:
    • complexity
    • type of search command recognized
    • order in which hits are presented
    • facts presented on the page
    • visual style
    • overall user friendliness
  • Caution: this topic may require extensive searching.
 
Extending Searchto Top
Meta Services

to Top
to Top

With all the differences between search engines, it's often a good idea to ask more than one. This is where the meta-search services come in. They submit your search to multiple search engines, and organize the results for you.

There are two kinds of meta services:

  • on-line: they look just like a regular search engine, through a Web browser;
  • desktop: these are tools that are downloaded and installed on your computer

On-Line Meta Search

  • Clusty: clusters results by topic
  • Dogpile: returns results from most of the standard search engines, in standard format
  • Ixquick: returns results from most of the standard search engines, in standard format
  • Kartoo: shows relations between topics visually
  • Mamma: free on-line service from Copernic; returns results from most of the standard search engines, in standard format
  • Seekz: returns results from most of the standard search engines, in standard format
  • SuperCrawler: includes topic-specific directories; returns results from most of the standard search engines, in standard format
  • MyWebSearch: use Ask or Yahoo (your choice); select topics from list of categories; simple listing

Desktop Meta Search

Two main kinds:

  • Web search: you can usually set them up to search automatically, on a schedule, to keep you updated on topics important to you
  • Hard drive search: lets you do a search of your own computer, as you would search the Web
Learning Options

Learn more about any of these topics by preparing a PowerPoint presentation:

  • Meta-search engines - select any three meta-search services and evaluate what value they add to the search process for you, compared to individual search engines.
Desktop Web Search

to Top

Market for these is shrinking, as on-line services provide similar features

  • Copernic: aimed at corporate desktop users; three versions of which the Basic is free
  • LemmeFind: free Web search
Desktop Hard Drive Search

to Top

These are increasing in popularity as hard drive capacity increases, and it get harder to find files you saved last year - or yesterday.

Dark Matter

to Top

to Top

Lots of information on the Web is not available to search engines. Why not?

Increasingly, large sites do not store information directly on the Web. Instead, they store the information in databases. People can get access to this information by going to the site's home page and typing in a question or request. The site's Web server sends the question to the database, which returns the data. The data is formatted in a helpful way that looks like a Web page, but is not accessible by following links, and so is not easily available to search engines.

Dark matter
the information on the Web that is not accessible to search engines.
This database information is informally known as dark matter, using a metaphor from astronomy. In astronomy, dark matter is the part of the universe what doesn't emit energy we can detect. Calculations have shown that the mass of the universe is much greater than what we can actually detect, so the majority must be "dark matter".

What do search engines do about dark matter? Some can be gotten by special crawlers equipped with lists of common search terms, or by closely guarded techniques. Most of it is accessible simply by going to a Web site and searching it yourself.

Since most e-commerce sites use databases but want their "dark matter" to be found, many of them have special arrangements with the major search engines to bring their content up in hit lists. For example, you can search for the title of a book, and Amazon.com will provide a link to their information about the book (hoping you'll order a copy from them!).

Learning Options

Learn more about any of these topics by preparing a PowerPoint presentation:

  • Dark matter on the Web - research and present information on "dark matter": information accessible to people surfing the Web, but difficult for search engines to access. Include information on:
    • What kinds of information are difficult for search engines to access?
    • How do the major search engines attempt to get access to this information?
    • What arrangements are made by e-commerce sites to make their information come up in search engine hit lists?
  • Caution: this topic may require extensive searching.
About Cache
to Top

to Top

Have you noticed the little link that says cached at the end of many hits?

What does cache mean?
According to Webster:
Pronounced \kash\
1 a: a hiding place especially for concealing and preserving provisions or implements
b: a secure place of storage
2: something hidden or stored in a cache
vt: to place, hide, or store in a cache

When a search engine crawls a page, it brings back the contents for indexing. Most search engines also save the page in their cache and make them available through the link in the hit entry.

Why do this? Web pages change quickly. You can't assume that a page listed in the hit list is actually there. It might have been moved to a new URL, or it might have been removed because of being old.

Sometimes, Web pages are removed because their content has caused controversy. In those cases, the cached version on a search engine may be the only way to get information about t he controversy. These caches themselves are controversial, though. Often, the people who remove a Web page want to completely prevent people from getting the information. In those cases, they may request the search engines to remove the cached pages. Should the search engines comply? That's the controversy!

Learning Options

Learn more about any of these topics by preparing a PowerPoint presentation:

  • Controversial Caching - research and report on controversies involving cached information on Web search engines.
    • Which search engines cache pages?
    • What is each search engine's policy on removing cached pages?
    • How long are cached pages kept on each?
    • Has legal action been taken against search engines to compel them to remove cached pages? Have the courts sided with the search engines or the plaintifs?


to Top About This Document
Audience

to Top
to Top

This module is for people who want to know how search engines work.

 

Objectives

On successful completion of this module, you will be able to:

  1. List the three main components of any search engine, and explain the purpose of each;
  2. Explain how a search engine finds information;
  3. Discuss the factors that distinguish search engines in how they find information;
  4. List the types of information that may be stored in search engine databases;
  5. Define "hit" in the context of search engine use;
  6. Discuss differences between the way search engines handle search terms, including normal, advanced, and the use of boolean operators;
  7. Explain the concept of ranking hits;
  8. Discuss the impact of commercial interests on search engines;
  9. Describe the “advanced search” functions of search engines;

to Top
Modulew17a: Search Engine Basics
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, W: World Wide Web.. This document has been used in the following classes: INP 160.
History:
Original: 16 October 2003, by Laurence J. Krieg
Last modification: Monday, 31-Aug-2009 11:48:01 EDT
Copyright:
Copyright © 2003-2008, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.

to Top