Does Google Respect Robots.txt NoIndex and Should You Use It? | Perficient Digital

Does Google Respect Robots.txt NoIndex and Should You Use It?

The availability of the Robots.txt NoIndex directive is little known among webmasters largely because few people talk about it. Matt Cutts discussed Google’s support for this directive back in 2008. More recently, Google’s John Mueller discussed it in this Google Webmaster Hangout. In addition, Deepcrawl wrote about it on their blog.

Given the unique capabilities of this directive, the IMEC team decided to run a test. We recruited 13 websites that would be willing to take pages on their site and attempt to remove them from the Google index using robots.txt NoIndex. Eight of these created new pages, and five of them offered up existing pages. We waited until all 13 pages were verified as being in the Google index, and then we had the webmasters add the NoIndex directive for that page to their Robots.txt file.

This post will tell you whether or not it works, explore how it gets implemented by Google, and help you decide whether or not you should use it.

Difference Between Robots Metatag and Robots.txt NoIndex

This is a point the confuses many, so I am going to take a moment to lay it out for you. When we talk about a Robots Metatag, we are talking about something that exists on a specific webpage. For example, if you don’t want your www.yourdomain.com/contact-us page in the Google index, you can put the following code in the head section of that webpage:

robots meta tag

For each page on your website that you don’t want indexed, you can use this directive. Once Google recrawls the page and sees that directive, it should remove the page from their index. However, implementation of this directive, and Google’s (or Bing’s) removing it from their index, does not tell them to not recrawl the page. In fact, they will continue to crawl the page on an ongoing basis, though search engines may, over time, choose to crawl that page somewhat less often.

A common mistake that many people make is implementing a Robots Metatag on a page while also blocking crawling of that page in Robots.txt. The problem with this approach is that search engines can’t read the Robot’s Metatag if they are told to not crawl the page.

In contrast, implementing a NoIndex directive in Robots.txt works a bit differently. If Google, in fact, supports this type of directive, it would allow you to combine the concept of blocking a crawl of a page and NoIndex-ing it at the same time. You would do that by implementing directive lines on Robots.txt similar to these two:

Disallow: /page1/
Noindex: /page1/

Since reading the NoIndex instruction does not require loading the page, the search engine would be able to both keep it out of the index AND not crawl it. This is a powerful combination! There is one major downside that would remain though, which is that the page would still be able to accumulate PageRank, which it would not be able to pass on to other pages on your site (since crawling is blocked, Google can’t see the links on the page to pass the PageRank through).

Raw Results

On September 22, 2015, we asked the 13 sites to add the NoIndex directive to their Robots.txt file. All of them complied with this request, though one of them had a problem with it: implementing the directive caused their web server to hang. Since this server crash was unexpected, I tested this on PerficientDigital.com for a given page, and it also caused a problem for our server, resulting in this message:

Internal Server Error

Later on, I retested this, and the problem dawned on me. I had implemented the NoIndex directive in my .htaccess file instead of robots.txt. Evidently, this will hang your server (or at least some servers). But it was an “Operator Error”! I have since tested implementing it in Robots.txt without any problems. However, I discovered this only after the testing was completed, so that means we had 12 sites in the test.

Our monitoring lasted for 31 days. Over this time, we tested each page every day to see if it remained in the index. Here are the results that we saw:

Rate at Which URLs Dropped Out of the Index

Now that’s interesting! 11 of the 12 pages tested did drop from the index, with the last two taking 26 days to finally drop out. This is clearly not a process that happens the moment that Google loads Robots.txt. So what’s at work here? To find out, I did a little more digging.

Speculation on What Google is Doing

My immediate speculation was that it seems like Google is only executing the Robots.txt NoIndex directive at the time it recrawls the page. To find that out, I decided to dig into the log files of some of the sites tested. The first thing that you notice is that Googlebot is loading the Robots.txt files for these sites many times per day. What I did next was review the log files for two of the sites, starting with the day that the site implemented the Robots.txt NoIndex, and ending on the day that Google finally dropped the page from the index.

The goal was to test my theory, but what I saw surprised me. First was the site that never came out of the index during the time of our test. For that one, I was able to get the log files from September 30 through October 26. Here is what the log files showed me:

Googlebot Access for Page That Never Came Out of Index

Remember that for this site the target page was never removed from the index. Next, let’s look at the data for one of the sites where the target page was removed from the index. Here is what we saw for that page:

Googlebot Access for Page Removed From the Index

Now that’s interesting. Robots.txt is regularly accessed as with the other site, but the target page was never crawled by Google, and yet it was removed from the index. So much for my theory!

So then, what is going on here? Frankly, it’s not clear. Google is not responding to the NoIndex directive every time they read a Robots.txt file as a directive. Nor are they under any obligation to do so. That’s what led to my speculation that they might wait until they crawl the page next and consider NoIndex-ing the page then, but clearly that’s not the case.

After all, the two sets of log files I looked at both contradicted that theory. On the site that never had the target page removed from the index, the page was crawled five times during the test. For the other site, where the page was removed from the index, the target page was never crawled.

What we do know is that there is some conditional logic involved, we just don’t know what those conditions are.

Summary

Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index. That’s pretty useful in concept. However, our tests didn’t show 100 percent success, so it does not always work.

Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term). And PageRank still matters. The latest Moz Ranking Factors Results for more information on that still weighs different aspects of links as the two most important factors in ranking.

In addition, don’t forget what John Mueller said, which was that you should not depend on this approach. Google may remove this functionality at some point in the future, and the official status for the feature is “unsupported.”

When to use it, then? Only in cases where 100 percent removal is not a total necessity, and in which you don’t mind losing the PageRank to the pages you are removing in this manner.

Thanks to the IMEC board: Rand Fishkin, Mark Traphagen, Cyrus Shepard, and David Minchala. And, for completeness, here is my Twitter handle.

Thanks also to the publishers participating in this test. Per agreement among the IMEC team, participants remain anonymous, but you guys are awesome! You know who you are.

Contact Us

Uncertain about which services will best address your business challenge? Let our digital marketing experts help.

CONTACT US

31 responses to “Does Google Respect Robots.txt NoIndex and Should You Use It?”

  1. Matthew Diehl says:

    Thanks for sharing the results Eric!
    How did the time to remove compare between the two “types” of pages – new vs. existing?
    Would be interesting to know if preexisting PR/juice/value of a page may be an influence on Google’s time to act.

  2. Evgeniy says:

    A month ago we tested this issue and asked John Mueller about would it be just a validation error in search console, or would Google expect this directive. He answered (https://goo.gl/ge1yBB), this directive would be not officially supported and it isn’t recommended to rely on it. But our tests were successful too, like yours.

  3. Tobias says:

    Hey Eric,
    question: Do you think this might also work for subdomains? We somehow made it possible to get our staging environment in the index.
    So would an addition like
    disallow: submain.domain.com
    noindex: submain.domain.com
    work?
    Of course, we would add this in this robots.txt file of that specific subdomain.
    Thank you for getting back to me.

  4. Stephen Watts says:

    When to use this feature? In those occasional cases where a CMS won’t allow you to NoIndex a page in the traditional way. I’ve found that with some enterprise-level CMS’s that it can be difficult to NoIndex a single page without development work. This would be maybe the only time that using the Robots NoIndex is a better option than using the traditional methods (HTTP header or tag).

  5. Eric Enge says:

    Tobias – your subdomain should have its own robots.txt file, and you should put the NoIndex in there.

  6. Eric Enge says:

    Hi Matthew – does not like we saw any clear difference there. Of course, we did look only at 12 pages, so I wouldn’t be able to draw a firm conclusion. But, on the 12 pages we looked at, some of them did NoIndex quickly, and some took a long time, from both classes of pages (new and existing).

  7. Eric Enge says:

    Good to know, thanks!

  8. Eric Enge says:

    Stephen – that might be a use for it. I’d still be careful as Google considers the featured unsupported.

  9. Mikhail says:

    I would bet the time it takes to deindex a page is similar to the time it takes to index a page – that is, it’s probably related to how popular the site is.
    You’ve probably noticed that a new page on reddit (for instance) can often be indexed in Google within minutes of its creation, whereas a page on another random website can take hours or even days. I’m guessing this would also ring true for deindexation, but people aren’t intentionally deindexing their pages that often so it wouldn’t be so well known.

  10. Alec Bertram says:

    Good test. I recently ran a similar one but with nosnippet and noarchive – the jury is still out on noarchive, but it looks as though Google also obeys nosnippet in robots.txt. It may be worth running a larger scale test on all robots directives with IMEC?
    The results of my tests are at http://abertram.com/technical-seo/nosnippet-and-noarchive-inside-robots-txt/

  11. Eric Enge says:

    Hi Alec – interesting idea. I will see what the IMEC team wants to do!

  12. Eric Enge says:

    resonable guess!

  13. Brian Jensen says:

    Hey Eric,
    Very interesting study. Just curious, what role if any XML Sitemaps played in the test i.e. were the pages that were successfully dropped from the index listed in a Sitemap that was listed in the robots.txt and vice versa?

  14. Brijesh says:

    Hi Eric, I think John Muler and other Googlers said it many times on record that when it comes to removing a page, they want to know webmaster’s absolute intention. Google doesn’t do on first time what webmaster has said through such coding. For example, even 301 redirect take some time to come in action. Google wait to confirm that webmaster hasn’t done it accidently.

  15. Jakob says:

    From my own experience I can tell that the robots file is not paid respect to every time…

  16. These are to be used prior to Googlebot spidering which is why you have a delay in the results. You would use Google URL removal tool to remove a URL from your website verified in search console to remove a URL. not a noindex meta.. that is for pages that require canonicals or have a limited visibility and expiration date…
    robots.txt would be used to block directories not pages, of course this must be done prior to launching the website, or you would use parameter filter tools in search console? no?

  17. Tim Etler says:

    When doing the test, did you ask the test sites to do both Noindex and Disallow, or just Noindex?

  18. Eric Enge says:

    Just NoIndex

  19. Interesting post Eric
    I have some pages on my site that display backlink reports for some of my clients’ competitors.
    All of these pages were noindexed, nofollowed and disallowed in robots.txt. There were no links to these pages from any other page on my site and none were included in my sitemap. None of the pages appeared in Google’s index.
    Last week, I received a warning from Google for malicous links on these pages (now resolved).
    It seems it doesn’t matter much how you instruct Googlebot, it will still crawl every page it can possibly find.
    I have now put these pages behind a login which is probably what I should have done in the first place. 🙂

  20. John Cliff says:

    Thanks for sharing the results Eric!

  21. james says:

    I recently found a mistake on a website in which a meta noindex tag was used on a product page that was in the main navigation of the website. I checked the index and it was still there. So now I’m wondering if multiple ppc landing pages with duplicate content and the same topic are hurting the website even though they have noindex tags on them .

  22. Eric Enge says:

    Interesting to hear that the page was still in the index. The NoIndex tag is supposed to be a directive, which means that Google is obligated to respect it. How did you verify that the page was still in the index?

  23. james says:

    I stuck the url into google search and it appeared. I also have rank tracking setup and would have noticed if it fell out of the index.

  24. John says:

    Hi Eric,
    Great post. Just want to make sure I understand one of the details here. I was perhaps under the wrong impression that any page blocked from robots.txt with a disallow:/* will NOT pass any equity that it accumulates to the rest of the site. Sure that page itself may have some page rank, but with no one allowed to crawl or see the links on the page, it stops there and does not benefit the rest of your site. Is that not the case? Something you said above made me think that equity would be pass on to the rest of the site.
    “Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term). And PageRank still matters.”
    Do you mean “block” via the on-page meta deceleration or block within Robots.txt?

  25. Eric Enge says:

    I was referring to within robots.txt. Agree on the point that they can still accumulate PageRank, but at least that we’re out of the index! This helps a lot when we decide that we want to remove pages from the index, and we want to relieve crawl budget at the same time.

  26. John says:

    Thanks for the reply! Does that mean I shouldn’t be worried about the passage of equity from pages that are blocked via disallow in robots.txt? Is google still able to see the links on that page and pass equity to them? I get that equity would still pass to let’s say /page
    but if /page is blocked in robots and links to /page1 , /page2, and /page3 internally, would’ny 1,2, and 3 miss out on the equity that’s behind /page if crawlers are disallowed?
    Sorry if I’m beating the same point over and over. I just worry that robots.txt as my main weapon is leaving equity on the table where a robots ON page would likely be better.

  27. Jeroen says:

    Nice blog eric! One more question: when to use disallow and when noindex in the robots.txt? What is the difference?

  28. Khurram says:

    Robots.txt is the absolute last means to use for blocking pages. Do not block a page with robots.txt unless you have tried all other options. A more appropriate method of keeping a page out of the index is the noindex tag.

  29. Eric Enge says:

    As noted in the article, this is about using a tag within Robots.txt whose impact is identical to the NoIndex tag that you are referring to.

  30. Khurram says:

    Thanks for the clarification Eric! Didn’t get that at first sight.

  31. David says:

    Hi Eric,
    Great test and thank you for sharing it. It would be yet more insightful to run the same test with pages/urls that boast some external inbound links. I find Google totally ignores the disallow robots directive even when the page boasts one single backlink. I’d be happy to contribute to such experiment.
    Thanks again for this great post.

Leave a Reply