Note: This article was first published on rebelytics.com in 2015 and has since then been updated and moved to this blog.
Spam traffic in Google Analytics has been a major issue in the digital marketing community lately. Especially since the introduction of Universal Analytics, the amount of spam traffic has increased dramatically. This is due to the fact that Universal Analytics accounts are much easier to spam than classic Google Analytics accounts, a problem that will be discussed later on in this article.
A lot has been written about spam traffic in Google Analytics and there are plenty of useful resources to help you tackle the problem. Nevertheless, some of the suggested solutions seem to be misleading due to a lack of understanding of the problem. This is why this article focuses on explaining the problem in easy terms, before discussing some solutions that have been suggested to eliminate spam traffic and adding some new ideas.
Let’s start by having a look at how spammers actually get data into your Google Analytics accounts. There are two ways of spamming Google Analytics accounts that I am aware of:
- Web crawlers that visit sites with Google Analytics tracking codes
- Direct data insertion into Google Analytics accounts via the Measurement Protocol
We will first discuss the theory behind the two ways of spamming Google Analytics data, before talking about possible solutions.
Spamming Google Analytics accounts with web crawlers
Web crawlers, or bots, are software programmes that visit lots of websites across the internet automatically. Most web crawlers have useful functions, like (for example) the Googlebot, which crawls all websites it can find and helps Google index the entire web. Visits by web crawlers are normally not captured by Google Analytics, because crawlers identify themselves as crawlers, and not as real users, and because they are not interested in executing tracking codes on the websites they visit.
Some spammers use web crawlers to manipulate the data in your Google Analytics account. They send a web crawler to your website that identifies itself as a real user, executes tracking codes, and is therefore captured by Google Analytics. On top of that, the web crawler pretends to be coming from a link that points from another domain to your site. These links normally don’t exist, but the domains that the links are supposed to be on are real and belong to the spammers.
But what do spammers gain from this? By simulating visits from their own domains to thousands, or even millions, of Google Analytics accounts, they generate a significant amount of traffic to their own websites. Google Analytics users want to find out where all of their new, unexpected traffic is coming from, so they check out the domains that appear in their referral reports. I’m sure that you have done this yourself at least once.
The spammers are happy about all of this traffic, because they make money with the ads they show on their websites. It’s as simple as that!
Let’s now have a look at a new, easier, and more sophisticated way of spamming Google Analytics accounts.
Spamming Google Analytics accounts by inserting data via the Measurement Protocol
As I mentioned at the beginning of this article, spam traffic has become an even bigger problem since the introduction of Universal Analytics. One of the best innovations of Universal Analytics is the Measurement Protocol, an interface that allows you to insert data into your Google Analytics account from any given system, without requiring the classic tracking code we all know from website tracking.
The Measurement Protocol makes Google Analytics a lot more powerful than it was before, because it enables an easy integration of different systems into your website tracking. One example of a tracking feature that has become a lot smoother to implement since the introduction of the Measurement protocol is phone call tracking. Call tracking providers can now insert data into your Google Analytics account via a simple HTTP request and they can include all the dimension and metric data that a normal page view or event on your website would include.
Maybe you already know where this is heading: While the Measurement Protocol is the most powerful innovation of Universal Analytics, it is also its biggest vulnerability. Anybody can send anything to your Google Analytics account! All they need is your tracking ID, and off we go.
So most of the spam visits you see in your Google Analytics account didn’t actually happen on your website, somebody just sent data to your Google Analytics account via the Measurement Protocol through a simple HTTP request. By using this method, the spammers can manipulate any dimension or metric they like. This is why, with Measurement Protocol spam, you do not only see spam domains in your referral report, but also in your events report or in your organic search keywords report.
If you want to explore how the Measurement protocol works in detail, I suggest you play around with Google’s Measurement Protocol Hit Builder a bit. I challenge you to send a spam message to the Google Analytics data of this website!
The goal of the spammers that use the Measurement Protocol method is the same as the goal of those that use web crawlers: They want to make you curios about their websites by making their domain names show up in all kinds of different places in your Google Analytics account. When you visit their websites, they make money with the ads they show.
How can you prevent your Google Analytics accounts from being spammed?
Now that we have discussed how spammers push data into your Google Analytics account and why they do it, let us have a look at some solutions to get rid of the unwanted traffic in your statistics. We will have a look at some of the advice that can be found across the web (including some of the really bad advice) and I will present the solution that I believe is best suited for tackling spam traffic inserted via the Measurement Protocol.
Let’s start with some of the bad advice, so that you know what NOT to do about spam traffic right from the start.
Do NOT use the referral exclusion list to exclude referral spam
The referral exclusion list, like the Measurement Protocol, is another feature of Google Analytics that has been introduced with Universal Analytics and did not exist in the classic version. Its main function is to prevent a new session from starting when users leave the tracked website to perform an action that is hosted on a different domain and are then referred back to the tracked website.
A classic application for this is payment via external providers. If you send your website visitors to paypal.com to pay for their purchases in your online shop, and Paypal then sends them back to your website, their return will show up as a new visit from paypal.com, and your referral report will look like Paypal is sending you lots of buying customers.
To prevent this from happening, you can include paypal.com and the domains of other payment providers you work with in your referral exclusion list. Now, when a user comes to your page from paypal.com, Google Analytics will check whether this user has already started a session on your website. If so, the open session will be continued and the return of the user to your website will not be counted as a new visit and its source will not be noted as paypal.com, but as the source of the session that has already been started.
If, on the other hand, Google Analytics detects a user that comes from paypal.com that has not recently started a session on your website, the visit will be counted as a new visit and the referral “paypal.com” will be omitted. The visit will thus be counted as a direct visit.
And this is why you should never, never, ever, include spam referrals in your referral exclusion list! The spam visits will still be counted, but instead of counting them as referral visits from spam domains, Google Analytics will count them as direct visits. This actually makes the problem worse, instead of making it better. Now you won’t even be able to distinguish spam visits from real direct visits or real visits from other sources that are counted as direct visits for technical reasons.
So, whatever you do to fight spam traffic in Google Analytics, do NOT use your referral exclusion list to tackle the problem, even if this piece of advice can be found in the most reputable sources. It will not help you, but make your spam problem worse instead.
Let us now have a look at another piece of bad advice that can be found in resources dealing with the problem of spam traffic in Google Analytics accounts.
Do NOT use a country filter to exclude spam traffic
Just like the useless advice with the referral exclusion list, this is another very bad idea I have read about in various otherwise reputable sources. Some digital marketers seem to think that simply filtering traffic from “obscure” countries will solve the problem. What they do not realise is that there are real internet users in those countries that might be interested in your website and services, just like there is a lot of spam traffic that shows up as traffic from your own country.
Using a hostname filter can be risky and is not really necessary
One of the solutions that is suggested accross most of the resources on this topic is using a hostname filter for your Google Analytics data that only includes valid hostnames. This is a pretty good solution, but it is far from perfect and comes with some risks. It will help you eliminate most of the Measurement Protocol spam, because the spammers that push data into your Google Account do not actually know your domain name. They generate Google Analytics tracking IDs randomly and use random hostnames in their hits, or often their own domains.
The hostname filter solution suggests that you only include hits in your Google Analytics data that have your own hostname(s), along with some other “good” hostnames, such as Google Translate. And this is where the solution becomes extremely unreliable. Who knows which other “good” hostnames will appear in future, because other companies will launch services similar to Google Translate, where your content is hosted on a different hostname, for the benefit of the user, and with no harm to your business?
Nobody actually really needs this solution, as there is a much better way of eliminating Measurement Protocol spam that works very effectively and comes with no risk for the quality of your data. Let us have a look at this solution now.
Get rid of Measurement Protocol spam once and for all
At the digital marketing agency I used to work for, we came up with a quick, clean and easy solution for eliminating Measurement Protocol spam. All you need to set up this solution is the Google Tag Manager.
If you are not using Google Tag Manager yet and if you are still placing Google Analytics tracking codes directly in the source code of your website, you should change that NOW. There are few arguments for not using Google Tag Manager with Google Analytics. You can do lots of great things to your Google Analytics configuration when you use Google Tag Manager and the quality of your data improves significantly.
Setting up a Measurement Protocol spam traffic filter with Google Tag Manager is easier than it sounds. All you need to do is clearly identify all hits that you control and exclude all other hits, that do not carry this identification. You can achieve this by passing a certain value in all hits that happen on your website in a custom dimension. Hits include pageviews, events, transactions and all other interactions of your tracking code with the Google Analytics servers.
In Google Tag Manager, you just add this value, that you define yourself, to a custom dimension in all of your Google Analytics tags. Think of this value as your password, although it need not be cryptic or safe. It is just an identification for you, that will help you recognise all the hits that you are in control of.
Now you set up this custom dimension in your Google Analytics account and create a filter for the data view you are working with that only includes hits that carry the value you defined in the custom dimension you defined. You will see that from now on, Measurement Protocol spam will not show up in your data anymore, because the spammers don’t know your password and Google Analytics filters the hits they create.
Here’s a step-by-step guide for what I just described:
Step-by-step guide for setting up a spam filter with Google Tag Manager
In Google Analytics, add a new custom dimension and call it “Password” or something similar:
Once you’ve saved the custom dimension, memorise the index number Google Analytics has given it. You will need this number for the Google Tag Manager part in the next step:
In Google Tag Manager, if you are using a Google Analytics settings variable, set up a custom dimension with the index number you memorised in the previous step, and a password you choose. Copy the password to your clipboard:
If you are not using a Google Analytics settings variable, simply make sure you set up this custom dimension in every Google Analytics tag you are using.
Next, go back to Google Analytics and set up a filter for your data view that only includes hits that have the value of the password you have copied to your clipboard in the custom dimension you have set up:
That’s it! Your Google Analytics data is now protected from spam traffic inserted through the Measurement Protocol, as only hits that contain your password will show up in your filtered data view.
If you have any questions about setting this up, please don’t hesitate to leave a comment under this article. I’ll be happy to help!
Important note: If you yourself are using the Measurement Protocol to push data to Google Analytics for certain tracking features, like phone call tracking, you have to make sure you also include your identification value (password) in those hits.
As you can see, it is very easy to exclude spam traffic that is pushed into your account via the Measurement Protocol. But what about the other type of spam traffic that we have discussed? Web crawler spam traffic is much more difficult to tackle, but let’s not give up! We will now have a look at the options for tackling this type of spam traffic.
Using referral exclusion filters to eliminate crawler spam
Once you have set up the Measurement Protocol spam traffic filter solution discussed above, you will see that the amount of spam traffic in your account decreases dramatically, but some visits from obscure referrals will keep showing up. These are visits that actually happened on your site, but they were not caused by real users, but by crawlers that pretend to be real users.
The solution I am using at the moment is identifiying those referrals on a weekly or monthly basis (depending on the size of the account) and excluding them from the data view, custom reports and dashboards using filters. This works great but it obviously is a pain, because it takes up a lot of time, so I am looking for an automated solution.
One tool that is worth checking out and that promises to solve the problem of having to check your referral report manually and clean it up regularly is Simo Ahava’s spam filter insertion tool. Let me know if you have tested how well it works!
What about the Google Analytics setting “Exclude all hits from known bots and spiders”?
This Google Analytics standard feature, which can be found in the data view settings in the admin area of your Google Analytics account, is a quite useful idea, but in reality, it does not have much of an impact. You can test it yourself by creating one data view with this feature activated and another data view without this setting. You will see that the difference is marginal. Using this setting is definitely a good idea, but it leaves you far from solving the problem.
What does the solution of the future look like?
In future, I hope for a solution that does not consist in fighting spam, but in identifying real users better. If we can identify a real users on a website by the way they behave, we can look at real users only and ignore spam completely.
There are already some very nice and helpful scripts out there that measure user behaviour (and at the same time, without it being their main purpose, help us identify real users), like the brilliant Riveted by Rob Flaherty.
If we manage to develop a tool like this that works 100% reliable on all device types, we will not have to worry about spam traffic in Google Analytics anymore. We will just create segments with our real users and analyse what they are doing on our pages.
Join the discussion 23 Comments
I was able to find “Custom Dimensions” in only one of my Tags on Google Tag Manager account. It’s for the Universal Analytics Tag which has “Tracking Type” set as “Pageview”. However, I can’t see Custom Dimensions option in any of the other tags that I have for various events on my website which have “Track Type” set as “Event”.
Can you advise what I am missing here?
Thanks in advance!
You might be using a Google Analytics settings variable for the other GA tags. In this case, you have set the custom dimensions in the GA settings variable, instead of the tags themselves.
I hope this answers your question! Please let me know if anything remains unclear.
Hi! Thank you for this article! When you say the following:
“If you are not using a Google Analytics settings variable, simply make sure you set up this custom dimension in every Google Analytics tag you are using.”..
do you mean selecting it from the drop down list for the Non Interaction Hit?
I think I have to add it to all of my tags so just wondering. Thanks!
Thank you for your question. That sentence refers to two different options you have in GTM for adjusting different settings in GA tags. A few years ago, before GA settings variables existed, you had to set up things like custom dimensions or IP anonymisation in every single GA tag (pageviews, events, etc.) you were using. Now, with GA settings variables, you can manage all settings in one place. That sentence in my article is just for people who are still not using GA settings variables.
I hope this helps! Please let me know if you have any further questions.
We just started getting “traffic” from a variety of non-US countries, mainly India and Pakistan, for a new page on our site. We usually get about 100-150 sessions daily, this morning at 8AM, we are over 300, mostly from India. It is all from Google/Organic or Direct, without any new referral traffic. Does this sound like the same issue you’ve described? We would have to set up GTM to try the solution.
Thanks in advance!
Thank you very much for your question. It’s hard to tell without any additional info if this traffic is entered via the measurement protocol, if it’s a bot that’s executing your tracking code, or if there’s another reason. Feel free to send me some more information (screenshots of your GA data etc.) if you want me to have a look. I understand you’d have to start using GTM just to try out the solution described above, so it would make sense to make sure that this is actually measurement protocol spam first.
HI I am a newbie here.
I am getting a lot of traffic from many domains like this one:
Is all the stuff you guys are talking about here good to filter this type of traffic?
oops I made a mistake:
many domains like these:
It’s hard to tell from just looking at the domains if it’s spam traffic and what kind of spam traffic it is. If you like, you can send me an e-mail with more information and I’ll be happy to have a look.
What is the field’s name? I can’t find name password 🙁
In the new view filter I mean…
Once you’ve created the new custom dimension for the password and given it a name (“password” in the example above, but you can use another name), you will be able to find this name in the view filter field at the very buttom of the dropdown menu (custom dimensions always show right at the end).
I hope this helps! Please let me know if you have any further questions.
Brilliant suggestion. I’m working with an e-commerce client who currently gets about 150-200 hits per day, mostly new users. They reached $1 million in sales last year but that was unprofitable PPC (literally no profit and the PPC firm kept saying the answer was more spend!) I’m trying to get them back to that level but keep money in their pocket, so I’m doing my best to get everything as clean as possible so my experiments have legit data to validate bigger investments in SEO and PPC as well as a site design overhaul. I set up a hostname filter as well as referral spam filters for campaign source using regex expressions from Carlos Escalera.
Because the traffic isn’t great enough to see a difference immediately from those filters, I set up a segment to test and ran it on historical data. To my surprise, the segment only found about 7 users who were identified as spam in the 3 months I’ve been working with this client. It did find over 500 in the last year, but I think more of that was click fraud, since they were running more adwords before I came on board.
The thing is, I don’t believe there have been only 7 spam users on the site in the last three months. So I followed your suggestion, too. Rather than setting it up as a filter, I started with a segment. It’s giving me zero data so far, so I think I did something wrong, because people have hit the site in the time this is running my “password-protected” dimension in analytics/tag manager.
Regardless if I did that correctly (I’ll go through it again), I haven’t found anyone talking about the issue I’m trying to solve: I have users who come by one to four times per day, usually business days, sometimes hundreds of times over months and never buy. (It’s an industrial supply site for wire and cable– no one window shops this stuff. For price, sure, but at some point, you’d quit shopping us if we were always too expensive.) How many of those can be real users? Some bounce every time they show up, and some seem to have legit behavior as seen in the User Explorer, but why would they come to the site and search for a different product almost every day?! Doesn’t make sense. I feel like there’s a bot or a type of sweat-shop traffic scheme hitting the site that I can’t figure out.
Invalid hostnames are small. Invalid referral traffic is pretty low, too. Where do I start? I have this crazy-even distribution of users hitting a few out of thousands of pages one time and bouncing, (and like I said, some users don’t bounce even time but do hangout and view more than one page) but it’s lots of users. I feel like there’s an army out there tag teaming the whole site. One user even spends about 25 minutes on the site once every few days, like its following an algorithm for passing the sniff test against spammer behavior. Gah! This is frustrating. Have you seen this?
Thank you very much for your interesting question. I guess you will find some users with unusual behaviour on most websites: Employees of the company, developers and contractors working on the website, clients or business partners that visit the website for other reasons than buying, etc… And there are also some crawlers and bots that execute the GA tracking script (most don’t, so 99% of bots don’t even show up in GA).
I often find it difficult to ignore this kind of stuff and focus on the metrics that are actually important, so I absolutely understand your concerns.
One method I have had good experience with is using scripts that measure and identify real user behaviour (e.g. scrolling activity), like this one: https://riveted.parsnip.io/
Once you’ve set this up, you can create a segment with sessions from real users (ones that show real user behaviour) and another one without (probably bots that execute GA). And even if this might not help you identify the bots and crawlers, at least it’s a way to focus on good data and exclude the rest from your view.
I hope this helps. Please let me know if there is anything else I can do for you.
Thanks, Eoghan. Another helpful answer. I suppose I am a little overly concerned with this user behavior, and since there is no pattern that others are seeing with respect to similar illegitimate traffic, I have to assume it’s real. So your suggestion to use a script to measure scrolling is a great idea. I’ll keep at it.
Very interesting. Never thought of that.
I’ve tried this on my website and it seems to work as a segment. But not as a filter (=0 traffic).
Any idea why this is the case? I’ve followed your steps carefuly 🙂
If the solution works as a segment, this means that the data is being tracked correctly in Google Analytics, with the “password” value in a custom dimension. So the reason that the filter isn’t working might be due to an error in the filter configuration.
Did you apply the filter to an existing data view and the view then stopped tracking traffic? Or did you create a new view with this filter and there wasn’t any traffic in the view? In the latter case, this might just be due to the fact that new views never contain any historical data. If you wait a bit, you should see data starting from the moment you created the view.
If the above doesn’t help, please feel free to share more information and I’ll be happy to have a closer look.
With the “Step-by-step guide for setting up a spam filter with Google Tag Manager”, that will not stop spammers if they inspect your page correct? They could see the index there and simply add it to their measurement protocol hit to pass the spam through to Google Analytics.
Yes, that’s absolutely correct! It would be possible to find the password on the website and get past the filter. As those spammers normally just generate tracking IDs automatically, and don’t really care about the websites they spam, this should normally not be an issue though.
I found a tool that seems to work well here:
Previously I was using:
but it seems to have stopped working now for some reason, I don’t know if it’s just me.
Thank you very much, Tony!
Hi Tony — I’m the author of the Quantable spam filter, which indeed was broken, not just for you (my firewall was blocking the oauth pass-back from Google). It should work again now. Thanks!
My spam filter checks what hostnames are receiving traffic so that does help reduce the risk of a hostname filter.
Good article, I agree with the recommendations. Maintaining a big list of bad actors as my filter does (and requires updating) is a lot more ongoing work than adding a key via GTM.
Thank you, Jason. It’s a great honour to have you commenting here. I’ve been a big fan of your work for a long time.
That’s a pretty cool solution, nice!
You’re referring to Simo Ahava’s spam tool, but he’s taken down the tool and said that there are better ways to tackle the problem.
Thank you very much for the heads-up! I hadn’t noticed that the tool was taken down. I will check it out and update the article as soon as I get the chance.