October 11, 2012

Battling Bots: comScore’s Ongoing Efforts to Detect and Remove Non-Human Traffic

By: Brian Pugh

Last year I wrote about the important challenge of staying one step ahead of invalid traffic in the audience measurement business. Today, I’m hoping to re-visit the subject and help further explain the impact of some forms of invalid, non-user initiated traffic. Audiences and impressions form the basis for the monetization of digital media, and it is paramount that what we measure reflect the behavior of actual persons and not of bots or other forms of non-human traffic (NHT). It is a responsibility we at comScore take very seriously.

This challenge of filtering out NHT is one that is always evolving, particularly as the landscape becomes more social. In today’s world of digital media, content flows seamlessly across the digital landscape - bouncing from server to server, being exposed to multitudes of audiences and quickly being absorbed into the ether that is the Internet. Publishers invest heavily in making sure their media is continuously being ‘shared’, ‘liked’ and ‘fanned’ so that it may reach and attract new audiences that can be monetized. Such efforts would appear to be paying dividends, with ever-increasing numbers of visitors and events being observed.

But how do we really know for sure that there is an actual, living person accessing this content? It turns out that computer software specifically designed to mimic human behavior online – commonly known as bots – have massively inflated the number of media impressions associated with digital content as a direct result of the social sharing revolution. The increase in activities like registration, voting, commenting and sharing have contributed to NHT increasing from approximately 6% of all web traffic in 2011 to a whopping 36% this year.

This trend is similar in many ways to the spam epidemic that consumed the web back in the early 2000s when email emerged as the standard means of communication. Over the past decade, our industry has developed new techniques to become pretty effective at combating spam. Today, we are challenged with the effort of effectively combating NHT in a similar fashion.

How NHT Manifests Itself Today
Here is a typical scenario that an Internet user might experience:

“My browser seems to open automatically in the background and plays ads. No Adware or Anti-Virus programs I have seem to stop it. I don't actually SEE my browser running, I just hear the ads and see the browser.exe process running. Even If I quit thebrowser.exe process, it just comes back about 10 minutes later.”

Scenarios like the above are increasingly common today and they can generate browsing activity without the user’s intent. Full page pop-ups, pop-unders and browser hijacks that redirect the user to a site other than the one they intended to visit following a search click, are some examples of how unintended traffic can be generated. Hidden background processes spawn a chain of events that result in multiple redirects, hopping from server to server before finally reaching a publisher site.

Bots are only the tip of the iceberg with the NHT problem, and the deeper you dig the more interesting it gets. comScore has observed secondary sets of activity, generated from users’ computers while consuming media, that exists in a parallel thread on the user’s computer that they never see. These processes have an explicit intent of driving incremental usage to publisher sites and can therefore wreak havoc on both the publishers and the advertisers as it becomes harder to tell whether that content, and the monetization associated with it, is being delivered against an actual human.

Often NHT begins with a user’s machine being infected with some form of malware; either from a site they visited or bundled with a free application they download. A typical malware call would first reach a traffic re-seller. In some cases, the traffic re-seller would then attempt to assess the quality of the received traffic by routing the call through its own test servers and algorithms. The call is then redirected to either another traffic re-seller who serves as a middleman or directly to a publisher site. Even though the server call is completely hidden from the user and doesn’t occur within the user’s browser, it is able to trigger all the elements from the publisher’s web page, including web analytics calls and ads. One indicator of such an event’s deleterious effects: a recent comScore vCE study, which measures validated ad campaign delivery against human audiences, showed that just 2.8% of ads co-occurring with malware processes running on user’s machine were viewable to an actual web user.

In addition, malware processes don’t simply trigger web analytics calls on a publisher site but frequently generate a new cookie for the same user upon each call. When this NHT activity gets counted, it can significantly inflate the publishers’ web analytics reports for both unique cookies/visitors and pages viewed. These calls have evolved to such an advanced level that publishers are simply incapable of distinguishing this NHT from quality user-intended traffic and clicks.

Why Combating NHT is So Important
So why should we even care about NHT? The reality is that if not properly accounted for, this traffic gets counted as audiences and impressions for which marketers end up footing the bill. Recent studies have estimated bot traffic to be anywhere from 4% to 31% of total web traffic in the U.S., which translates to anywhere between $650 million and $4.7 billion of wasted marketing spend.

Invalid traffic exists to the detriment of the entire industry. Even those who might experience short term benefits, such as being able to claim a higher audience to advertisers, need to understand that invalid practices are being actively identified and might face the reputational downside of knowingly engaging in such practices. One of the ways to tackle this problem is to have a third-party company develop robust NHT detection systems dedicated to filtering out this activity so that audience and impression reporting reflect the behavior of actual people.

It is in the digital ecosystems’ collective interest to actively employ forensic web analysis to ensure that no one is unintentionally supporting non-user generated traffic. If such activity is discovered, publishers should immediately re-examine relationships with companies that might be enabling this activity. comScore has always taken proactive measures to ensure that NHT doesn’t count towards its digital audience estimates, and more than a dozen years later we continue to innovate our detection methods to make sure we’re staying two steps ahead of the game.

Tags: Audience Measurement