It’s a common practice to block entire staging environments via robots.txt – We see roughly 50% of our website migration clients do it. Here’s why it’s not a good idea.
Webmasters, when preparing a website migration, often decide to add a robots.txt file with the following content to the staging environment of the new website:
These instructions ask robots to stay away from all pages on the site. The idea behind such an implementation is to prevent the content of the staging environment from being indexed by search engines.
Use the “real” robots.txt file
The short and simple robots.txt file we saw above obviously isn’t the one that will be needed once the new website goes live. The “real” robots.txt will be a bit more complex. It’s a good idea to add this final robots.txt to the staging environment already, for the following reasons:
- First, you can already simulate real crawling behaviour on the staging environment with a crawler that respects the robots.txt file, and without custom robots.txt settings. This way you can check, for instance, if all pages are linked correctly internally, even when blocked pages are not crawled, or if crawl resources are wasted on pages that should be blocked via the robots.txt file, but aren’t yet.
- Second, by using the “real” robots.txt file on the staging environment, you eliminate the risk of going live with a robots.txt file that blocks your entire site. Most SEOs have probably heard of several cases where this happened and caused a lot of pain.
Don’t use “noindex” for all staging pages either
A very similar (but less common) habit is setting all pages on the staging environment to “noindex”. Again, the idea is to prevent them from being indexed, but the problems are the same as the ones described above:
You want to be able to check which pages are correctly set to “noindex” before the website goes live, or which pages need to be set to “noindex”, but aren’t yet. And you definitely don’t want your new website to go live with all pages set to “noindex”.
Just like blocking all pages on your staging environment via robots.txt, setting them all to “noindex” is also a bad idea.
How to properly protect your staging environment
It is of course of utter importance to make sure your staging environment doesn’t get indexed by search engines, but blocking all pages via robots.txt or setting all pages to “noindex” are not the right way to go. Staging environments should be protected with an HTTP username and password, or the access should be limited to certain IP addresses.
In both cases, the crawlers you use for your pre-migration analysis can still access the staging environment, if configured correctly. Almost all crawling tools can log into an HTTP username and password protected staging environment, and most of them can also crawl your staging environment using a static IP that you can whitelist.
By using the robots.txt and “noindex” settings that you will need when the page goes live, you can make sure to simulate real crawling behaviour on the staging environment. Another advantage of the methods recommended here is that they are a lot more secure – robots.txt and “noindex” instructions can always be ignored by crawlers, while HTTP authentication and IP restriction definitely keep them out.
What are your thoughts and experiences?
What do you think about blocking entire staging environments via robots.txt files? We’d love to hear your opinion in the comments.
Join the discussion 5 Comments
Hello, Eoghan. Thanks for your answer. I am thinking about these uses:
– I would like to test the new web on Fetch & Render
– I Would like to remove from index many pages indexed in a previous moment, when the IP wall was not build
Tunneling is a great solution I did not know. Thank you!
Hi, Eoghan, I have a question that goes further. Supouse that I want to do some tests with GSC on staging environment: developing.example.com. I would need to have a robots.txt that allow the URL I am testing, wouldn’t I? If i am not wrong, i cannot do this through an http user+password neither blocked IP methods. Any idea/solution? Thanks in advance!!
Thank you for your comment. What would you like to test with GSC on a staging environment? GSC is interesting for live pages, but I cannot think of many use cases for a staging environment. What are your ideas?
If you’re thinking about tools like the Structured Data Testing Tool or the Mobile-Friendly Test Tool, tunneling might be an option for you. Does this help?
Hi Eoghan! Thanks for this nice post.
I think that you are on point when saying that the best solution is just allowing the staging website to be accessed from certain IP’s, or with a user and password, by the way, that adds some management complexity too.
For IP’s if you are having remote workers (w/o a common VPN), external workers, automated tests through 3rd party services, etc. Can become a pain in the ass to manage. For user and password adds some complexity too for crawling and testing, but we should be ok with tools that exist today.
When I’m working with good tech distributed teams w/automated tests, we use only robots.txt as its safe enough in most scenarios. When I know the team can do some strange stuff and there is a slow time-to-deploy, I use user and password to protect the project.
Thanks for sharing!
Thank you very much for sharing your thoughts. I agree that IP restriction can be a pain, especially when you have people involved that are not on a static IP, or if you want to use tools that don’t offer this feature. My favourite solution is also HTTP user and password.