Google Analytics is great for gathering data on who uses your web
application, but becomes worthless if spam sessions start infesting your
data. Here’s how we’ve tried to combat the problem for oddbird.net.
Like many websites, we use Google Analytics to gather data about our
users – what OS and browser they used, how they came to our site, etc.
But a number of months ago we started seeing lots ofthis:
It’s not a new problem, but it’s particularly problematic for smaller
sites that don’t receive lots of traffic. On a given day, spam hits were
accounting for anywhere from ten to ninety (!) percent of oursessions.
There are many solutions out there; since we mostly saw spam in the
“referral” field, we wanted a simple way to block spam referrals from
being included in our analyticsdata.
One common approach is to disallow any site visits where
document.referrer matches a known spam domain. There are free
services that create the necessary Google Analytics “filters” for you,
but they must be re-configured frequently as new spammers are added to
thelist.
Instead, we tried spam-referrals-blocker, which is a script that
blocks referrals found on a community-contributed list of referrer
spammers. Rather than relying on the owner of the script to update it
periodically with the latest disallowed-list – or maintaining our own fork of
the repo – we decided to fetch the latest list as part of our
build/deploy process, using gulp and gulp-download:
Once we have an up-to-date disallowed-list, we import it with the webpackraw-loader and block any referrer found on thelist:
import spammers from'raw-loader!./spammers.txt';
window.isSpamReferral=function(){const list = spammers.split(' ');const currentReferral = document.referrer;if(currentReferral){for(const spammer of list){if(spammer && currentReferral.indexOf(spammer)!==-1){returntrue;}}}returnfalse;};
And in our HTML, after the JS file has beenexecuted:
<script>if(!window.isSpamReferral()){// ... initialize Google Analytics}</script>
Bonus: Excluding InternalTraffic
Without much extra work, we can also exclude internal traffic from our
analyticsdata:
const devHosts =[// List your local development servers'oddsite.hexxie.com:3000','localhost:3000','127.0.0.1:3000'];
window.isDevelopment=()=> devHosts.indexOf(window.location.host)!==-1;
And our modifiedHTML:
<script>if(!window.isSpamReferral()&&!window.isDevelopment()){// ... initialize Google Analytics}</script>
This approach has worked relatively well – in the first two weeks, we
only saw nine spam sessions sneak through. But we weren’t entirely
thrilled with it,either.
First of all, a disallowed-list of domains-to-block is much more difficult to
maintain than an allowed-list of domains-to-allow (even if we’ve off-loaded
most of the maintenance to the community). And second, there’s something
less-than-ideal about fetching a raw .txt file directly from someone
else’s GitHub repo, making assumptions about the format of the file
contents, and then relying on it as part of our build/deployprocess.
We haven’t been using this technique for long, but so far the results
have been positive. If it continues to work well, we’ll likely remove
the referral-blocking codeentirely.
If you use Google Analytics, how have you tackled the problem of spam
infecting your data? Let us know via Twitter!
Learn how to leverage Web Platform Tests to ensure your polyfills are implementing upcoming browser features correctly, including how to generate a comprehensive report of failing/passing tests on each change.
OddBird sponsored Python Web Conference 2023 and sent me to attend. In this article I showcase my favorite talks and activities from this excellent online event, including a list of useful resources for web application security, introductions to new PaaS providers, and a comparison of the most popular Python web frameworks.