It’s a common practice to block entire staging environments via robots.txt – We see roughly 50% of our website migration clients do it. Here’s why it’s not a good idea.
Webmasters, when preparing a website migration, often decide to add a robots.txt file with the following content to the staging environment of the new website:
These instructions ask robots to stay away from all pages on the site. The idea behind such an implementation is to prevent the content of the staging environment from being indexed by search engines.
Use the “real” robots.txt file
The short and simple robots.txt file we saw above obviously isn’t the one that will be needed once the new website goes live. The “real” robots.txt will be a bit more complex. It’s a good idea to add this final robots.txt to the staging environment already, for the following reasons:
- First, you can already simulate real crawling behaviour on the staging environment with a crawler that respects the robots.txt file, and without custom robots.txt settings. This way you can check, for instance, if all pages are linked correctly internally, even when blocked pages are not crawled, or if crawl resources are wasted on pages that should be blocked via the robots.txt file, but aren’t yet.
- Second, by using the “real” robots.txt file on the staging environment, you eliminate the risk of going live with a robots.txt file that blocks your entire site. Most SEOs have probably heard of several cases where this happened and caused a lot of pain.
Don’t use “noindex” for all staging pages either
A very similar (but less common) habit is setting all pages on the staging environment to “noindex”. Again, the idea is to prevent them from being indexed, but the problems are the same as the ones described above:
You want to be able to check which pages are correctly set to “noindex” before the website goes live, or which pages need to be set to “noindex”, but aren’t yet. And you definitely don’t want your new website to go live with all pages set to “noindex”.
Just like blocking all pages on your staging environment via robots.txt, setting them all to “noindex” is also a bad idea.
How to properly protect your staging environment
It is of course of utter importance to make sure your staging environment doesn’t get indexed by search engines, but blocking all pages via robots.txt or setting all pages to “noindex” are not the right way to go. Staging environments should be protected with an HTTP username and password, or the access should be limited to certain IP addresses.
In both cases, the crawlers you use for your pre-migration analysis can still access the staging environment, if configured correctly. Almost all crawling tools can log into an HTTP username and password protected staging environment, and most of them can also crawl your staging environment using a static IP that you can whitelist.
By using the robots.txt and “noindex” settings that you will need when the page goes live, you can make sure to simulate real crawling behaviour on the staging environment. Another advantage of the methods recommended here is that they are a lot more secure – robots.txt and “noindex” instructions can always be ignored by crawlers, while HTTP authentication and IP restriction definitely keep them out.
What are your thoughts and experiences?
What do you think about blocking entire staging environments via robots.txt files? We’d love to hear your opinion in the comments.