Do you know how search engines like Google find, crawl, and rank the trillions of web pages out there in order to serve up the results you see when you type in a query?
While the details of the process are actually quite complex, knowing the (non-technical) basics of crawling, indexing and ranking can put you well on your way to better understanding the methods behind a search engine optimization strategy.
A Massive Undertaking
At the time of writing, Google says it knows of more than 130 trillion pages on the web. In actuality, it’s probably far more than that number. There are many pages that Google keeps out of the crawling, indexing and ranking process for various reasons.
In order to keep its results as relevant as possible for its users, search engines like Google have a well-defined process for identifying the best web pages for any given search query. And this process evolves over time as it works to make search results even better.
Basically, we’re trying to answer the question: “How Do Google Search Results Work?”. In a nutshell, this process involves the following steps:
- Crawling – Following links to discover the most important pages on the web
- Indexing – Storing information about all the retrieved pages for later retrieval
- Ranking – Determining what each page is about, and how it should rank for relevant queries
Let’s look closer at a simplified explanation of each …
Crawling the Web
Search engines have crawlers (aka spiders) that “crawl” the World Wide Web to discover pages that exist in order to help identify the best web pages to be evaluated for a query. The method of travel by which the crawlers travel are website links.
These website links bind together pages in a website and websites across the web, and in doing so, create a pathway for the crawlers to reach the trillions of interconnected website pages that exist.
How about a visual example? In the figure below, you can see a screenshot of the home page of USA.gov:
Crawling the entire web each day would be too big of an undertaking, so Google typically spreads its crawl over a number of weeks. In addition, as mentioned earlier, search engines like Google don’t crawl each and every web page that exists.
Instead, they start with a trusted set of websites that serve as the basis for determining how other websites measure up, and by following the links they see on the pages they visit, they expand their crawl across the web.
Indexing the Data
Indexing is the act of adding information about a web page to a search engine’s index. The index is a collection of web pages—a database—that includes information on the pages crawled by search engine spiders.
The index catalogs and organizes:
- Detailed data on the nature of the content and topical relevance of each web page
- A map of all the pages that each page links to
- The clickable (anchor) text of any links
- Other information about links, such as if they are ads or not, where they are located on the page, and other aspects of the context of the link and what that implies about the page receiving the link
… and more.
The index is the database with which search engines like Google store and retrieves data when a user types a query into the search engine. Before it decides which web pages to show from the index and in what order, search engines apply algorithms to help rank those web pages.
Ranking the Results
In order to serve up results to the search engine’s end user, search engines must perform some critical steps:
- Interpreting the intent of the user query
- Identifying web pages in the index related to the query
- Ranking and returning those web pages in order of relevance and importance
This is one of the major areas where search engine optimization comes in. Effective SEO helps influence the relevance and importance of those web pages for related queries.
So, what does relevance and importance mean, anyway?
- Relevance: The degree to which the content on a web page matches the intent of the searcher (intent is what searchers are trying to accomplish with that search, which is no small undertaking for search engines—or SEOs—to figure out).
- Importance: Web pages are considered more important the more they are cited elsewhere (think of these citations as a vote of confidence for that web page). Traditionally, this comes in the form of links from other websites to that web page, but there could be other factors that come into play as well.
In order to accomplish the task of assigning relevance and importance, the search engines have complex algorithms designed to take into account hundreds of signals that help determine the relevance and importance of any given web page.
These algorithms often change as search engines work to improve their methods of serving up the best results to their users. And even though they are constantly being tweaked, some of the fundamentals of what the search engines are looking for are pretty well understood.
Though we’ll probably never know the complete list of signals that search engines like Google use in their algorithms (that’s a closely guarded secret and for good reason, lest spammers use that knowledge to game the system), the search engines have revealed some of the basics through knowledge sharing with the web publishing community, and we can use that knowledge to create lasting SEO strategies.
How Search Engines Evaluate Content
As part of the ranking process, a search engine needs to be able to understand the nature of the content of each web page it crawls. In fact, Google puts a lot of weight on the content of a web page as a ranking signal.
In 2016, Google confirmed what many of us already believed: content is among the Top 3 ranking factors for web pages.
In order to understand what the page is about, search engines analyze the words and phrases that appear on it, and then build a map of that data, known as a “semantic map”—which helps define the relationship between the concepts on a web page.
What Search Engines Can “See” on a Web Page
In order to evaluate content, search engines parse the data found on a web page to make sense of it. Since search engines are software programs, they “see” web pages very differently than we do.
Search engine crawlers see web pages in the form of the DOM (as we defined it above). As a human, if you’re trying to see what the search engines see, one thing you can do is look at the source code of the page. To do this you can start by right-clicking on the web page in your browser.
This will show you the source code of the web page, which might look like this:
In addition to the unique content on the page, there are other elements on a web page that search engine crawlers find that help the search engines understand what the page is about.
This includes things like:
- The web page’s metadata, including the title tag and meta description tag, found in the HTML code. Though not readily viewable on the web page that humans see, these tags serve as the title and description of the web page in the search results, and should be maintained by website owners.
- The alt attributes for images on a web page. These are descriptions that website owners should maintain to describe what the image is about. Since search engines can’t “see” images, this helps them better understand the content on the page, and also serves an important role for those with disabilities who use screen-reading programs to describe the content on a web page. Learn more about web accessibility.
What Search Engines Cannot “See” on a Web Page
It’s important to understand the elements on a web page that search engines can’t see as well, so that you can help tailor your website’s content to help crawlers better understand it.
We already mentioned images, and how alt attributes help crawlers understand what those images are about. Other elements that cannot be seen by search engines include:
Flash files: Google has said that it can extract some information from Adobe Flash files, but it’s difficult because Flash is a pictorial medium. When designers use Flash to design websites, they typically don’t insert text that would help explain what the files are about. Many designers have moved to HTML5 as an alternative to Adobe Flash that’s search engine friendly.
Audio and video: Just like images, it’s hard for search engines to understand what audio or video is about without context. There are some exceptions where search engines can extract limited data in the ID3 tags within Mp3 files, for example. This is one of the reasons many publishers have accompanied audio and video with transcripts on a web page to help give search engines more context.
iframes: An iframe tag is typically used to embed content from elsewhere on your own website into the current web page, or to embed content from another site into your web page. Google may not treat this content as if it is part of your page, especially if it’s being sourced from a third party web site. Historically, Google has ignored content within an iframe, but there may be cases that are an exception to that general rule.
Learn more! Understanding Google’s 2018 Search Updates
At face value, search engines seem so simple: type a query into the search box, and poof! Your results await. But this instant gratification is powered by a complex set of processes behind the scenes that help identify the most relevant data to the end user, so she can do things like find a recipe, research a product or get an answer to her question.
Why should you care?
Knowing the fundamental principles of crawling, indexing and ranking helps website owners better tune their sites to make it easy for search engines to read and understand, and to better target them to the right search results.
Need help with fine-tuning your site for better search engine results? Here’s how we do SEO at Perficient Digital.
Art of SEO Series
This post is part of our Art of SEO series. Here are other posts in the series you might enjoy:
- 15 Crucial Elements of an SEO Audit
- Machine Learning and Search: Doing SEO When the Future Is Now
- Everything You Need to Know About Subfolders, Subdomains, and Microsites for SEO
- Keyword Cannibalization and SEO: What You Need to Know
Eric Enge leads the Digital Marketing practice for Perficient Digital. He designs studies and produces industry-related research to help prove, debunk, or evolve assumptions about digital marketing practices and their value. Eric is a writer, blogger, researcher, teacher, and keynote speaker and panelist at major industry conferences. Partnering with several other experts, Eric served as the lead author of The Art of SEO. Learn More About Eric Enge