Building Your Own Search Engine From Scratch

What Underlies Modern Search? Simple Concepts.

David Yastremsky
Dev Genius

--

How many times do you search the web daily? 5, 20, 50? If Google is your search engine of choice, you can look at your history of searches here.

Despite how deeply search underlies our daily activities and interaction with the world, few of us understand how it works. In this post, I work to illuminate the underpinnings of search. This is from implementing a search engine, based on the original Google implementation.

Search Engine
Photo by Benjamin Dada on Unsplash

First, we will look at a preliminary step: understanding web servers. What is a client-server infrastructure? How does your computer connect to a website?

You will get to see what happens when a search engine connects with your computer, a website, or anything else.

Then, we will look through the three base parts of a search engine: the crawler, indexer, and PageRank algorithm. Each of these will give you a rich understanding of the connections that make up the spider web that is the internet.

Finally, we will go through how those components are combined to produce our final prize: the search engine! Ready to dive in? Let’s go!

Part 0: Web Servers

The mighty web server! A web server is what your computer contacts every time you search a URL in your browser. Your browser acts as a client, sending a request, similar to a business client. The server is the sales representative who takes all those requests, processing them in parallel.

The requests are text. The server knows how to read them, as it is expecting them in a specific structure (the most common protocol/structure now is HTTP/1.1).

A sample request:

GET /hello HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.sample-why-david-y.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: user=why-david-y

The request can have arguments, given by its list of cookies. It may have a body with more information. The response follows a similar format, allowing the server to return arguments and a body for clients to read. With the internet becoming more interactive, all the hard work of generating content is done on the server.

If you ever want to program a web server or client, there are many libraries available to you to do most of the parsing and low-level work. Just keep in mind that client requests and server responses are just a way of structuring text. This means we are all speaking a common language!

Part 1: Crawlers

If you are building a search engine, the crawler is where you spend a good chunk of time. The crawler browses the open internet, starting with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com, NYT.com). It will read each page, save it, and add new links to its URL frontier, which is its queue of links to crawl

Web Crawler
Photo by Kevin Grieve on Unsplash

Many domains also have a robots.txt file, such as google.com/robots.txt. This page specifies rules the crawler must respect to avoid breaking any laws or being treated as a spammer. For example, certain subdomains cannot be crawled and there may be a minimum time between each crawl.

Why is so much time spent here? The internet is very much unstructured, an anarchist’s dream. Sure, we may have some norms that we agree on, but you will never realize how much they get broken until you write a crawler.

For example, say your crawler reads HTML pages, as those have structure. The author could still, for example, put non-links in link tags that break some implicit logic in your program. There could be emails (test@test.com), sentences, and other text that your verification may miss.

You may be crawling a page that looks different every time but actually generates dynamic content, such as including the current time? What if page A redirects to B, B redirects to C, and C redirects to A? What if a calendar has countless links to future years or days?

These are some of many cases that can arise when crawling millions of pages, and every edge case needs to be covered or recoverable.

Part 2: Indexing

Once you have the crawled content saved in a database, next comes indexing! When a user searches a term, they want accurate results quickly. This is where indexing is so important. You decide what metrics matter the most to you, then you pull them from the crawled document. Here are some common ones:

  • Forward Index: This is a data structure holding a list of documents with their associated words, in order. For example:
document1, <word1, word2, word3>
document2, <word2, word3>
  • Inverted Index: This is a data structure holding a list of words with documents with that word. For example:
word1, <document2, document3, document4>
word2, <document1, document2>
  • Term Frequency (TF): This is a metric stored for each unique word in each document. It is commonly calculated as the number of occurrences of that word divided by the number of words in a document, resulting in a value between 0 and 1. Some words may get weighted more heavily (e.g. special tags) and the TF may be normalized, preventing extreme values.
  • Inverse Document Frequency (IDF): This is a metric stored for each unique word. It is commonly calculated as the number of documents with that word divided by the total number of documents. Given that it requires the number of documents, it is usually calculated after crawling or at query time. It may be normalized to prevent extreme values.

With these four values, you can design an indexer that allows you to return accurate results. With the optimization of current databases, the results will also be reasonably fast. Using MongoDB, our project used these to return results in approximately 2 seconds, even for longer queries. You can do even more with just these four metrics — for example, allowing exact match queries.

These were essential metrics used by search engines in the early days. Now, search engines use these plus many more to fine-tune their results further.

How do we combine these to generate results? We will discuss that in the integration section.

Part 3: PageRank

PageRank is an algorithm that determines the authoritativeness of a page on the internet. Say someone is searching “Earth Day.” We need to look at how trustworthy a page is. If we do not, our search engine could send them to a random blog page that says “Earth Day” over and over, rather than a Wikipedia or EarthDay.org page. With the prevalence of SEO and marketers attempting to drive traffic to a page, how do we ensure users get quality results?

PageRank looks at links between pages, treating them as a graph (a set of nodes and vertices). Each vertex is a connection between two nodes in the direction that it points (source URL to destination URL)

In each iteration, the algorithm looks at all URLs pointing to a page, say Google.com. It gives Google some percentage of its referrers’ PageRank, based on how many other URLs those pages also point to. After a few iterations, the PageRank values are relatively stable and the algorithm terminates.

PageRank Example Graph
Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg

There are other tricks used, like a random surfer, which assumes that some percentage of the time the user gets bored and clicks to a new page. These tricks aim to avoid corner cases with PageRank. For example, sinks are pages that can absorb all the PageRank due to having no outbound links.

Putting It All Together

You have the main pieces for a search engine now.

When a user searches a phrase, you look up which documents have each of the query terms in them. Your database returns documents that match all terms.

For every document, you can take the TFIDF (TF * IDF) for each query term and sum them together. Then, combine the sum with that page’s PageRank (e.g. multiplying them together). This is more an art than a science, so leave time to see what works, finetuning as you go.

Your engine can now return sorted results whenever a client makes a request to your server’s URL. As discussed in part 0, all this work is being done within the server. The client makes the request, and your server returns the results to the client in the expected format.

From here, you can:

  • Add new indexing metrics to your database to create higher-quality results
  • Optimize your queries to speed up query times
  • Build new features in your engine

Congratulations, you can now build your own search engine!

--

--