Blog - February 21, 2008

The Ugly Reality of Using Site Server Data for Media Planning

Marv Pollack

Kevin Mannion’s January 25 blog about engagement metrics on MediaPost’s Online Metrics Insider raises some interesting issues about online measurement, but fails to address the inability of site server data to provide any of the key people-based metrics that are needed for online media planning and analysis. In the world of online media, I never cease to be amazed at how easy it is for some to forget or ignore that basic reality of advertising. Earlier in my career, I spent 14 years at Leo Burnett and can assure you that we didn’t plan our media efforts around advertising to TV sets. No, we advertised to the people watching the TVs. And so it is with the Internet. Online media planners, just like their offline counterparts, fundamentally need to know how many unique people are visiting a particular site, with what frequency per person, and they need to know this by demographic segment. I have yet to see server log data that comes anywhere close to providing this people-based information.

It’s left up to the comment by John Grono of GAP Research in Australia posted on Mannion’s blog to set the record straight by highlighting some of the key problems with site server data and revealing the ugly reality of what happened in Australia, where the use of server data resulted in an estimate of that country’s online population that was more than twice the size of the entire Australian population! As John notes, the Australian experience has unequivocally shown that site server data grossly overstate the true number of site visitors.

The most deleterious problem with site server data is caused by cookie deletion. An important Comscore study published last year showed -- beyond a shadow of a doubt -- that 31% of Internet users delete their cookies in a month and that the average cookie deleter does so four times per month, resulting in the placing of five different cookies for a single site on a deleter’s computer in the course of a month. Every time an Internet user returns to the same site after deleting their cookies, they will be counted as a new user. This can result in a dramatic exaggeration of the size of a site’s audience by as much as 2.5 times when using site server data. To make the problem even more acute, there is such a large variation across sites in the rate of cookie deletion among their visitors and their frequency of visitation to the sites that it is essentially impossible to build a model to predict the degree of exaggeration for individual sites based on site server data.

Beyond the cookie deletion problem, there are other problems with server data:

Cookies (or page tagging / beacon approaches) are incapable of accurately counting the true number of individual users:
1. The same person may use different computers (e.g. a work and home machine) to visit the same site in a day and will be counted as two individual visitors
2. Different people using the same computer and visiting the same site will be counted as one visitor
Server data cannot reliably determine if the visitor is a real person or a computer. In fact, some industry observers have estimated that “bot” traffic at a wide variety of sites now accounts for more than 30% of all traffic to those sites.
Using server data, there is often no way to reliably identify the geographical location of a visitor. For may U.S. sites with large numbers of International visitors, this can create massive exaggeration in server-based estimates of U.S. audiences by as much as 4.5X, which is incremental to the overstatement caused by cookie deletion.
Last, but by no means least, server data provide no information on the demographic characteristics of site visitors.

It is challenging for publishers to reconcile the conflicting data they get from multiple metrics sources, and understandable in today’s competitive environment that there is a desire to tout the highest audience count possible to both advertisers and investors. However, as Comscore and other industry bodies have been working to educate users on the fundamental differences between server data from web analytics providers, and panel data from audience measurement companies, it is encouraging that publishers are increasingly relying on the audience measurement data to tout audience size. This is as it should be.

Intriguing though the concept may appear, the claim that server data can be integrated with panel data in order to obtain accurate online audience data is fundamentally flawed. No amount of “black box” manipulation can correct for the fact that server data are incapable of knowing who is on the computer visiting the site. You simply can’t make a silk purse out of a sow’s ear.

I fail to see how Quantcast can make the claim that panels aren’t capable of measuring the mid to long tail of the Internet (“the fragmentation of the Web kills the utility of the panel”) while at the same time saying that they start with panel data to adjust the server data. You can’t have it both ways. If the panel data’s sample sizes are too small to be able to measure these sites, how on earth can one possibly say it can be used to adjust for all the errors inherent in server data when it comes to counting people. No, the simple fact is that this methodology will end up relying on the server data and exaggerating the size of the sites’ visitor base.

As a final point, I think it’s important to reiterate the observation made by John Grono in his comment that some of the largest and most important sites in Australia decided to not provide their server data to third parties. This would appear to be a “deal killer” for any integration initiative before it even gets started. If many of the most popular Web sites in Australia decided to “not play ball”, I think we can take it for granted that the U.S. market will see many more sites deciding that their competitive position will NOT be enhanced by sharing their data with third parties.

The Ugly Reality of Using Site Server Data for Media Planning

Marv Pollack

More About