Comscore Crawler

A crawler, also known as a spider or a bot, is the software Comscore uses to visit and access the content of webpages.

The Comscore crawler:

Identifies itself,
Only downloads the static, textual content,
Honors the rules of a robots.txt,
Doesn't execute JavaScript to generate ad impressions,
Crawls at a slow rate by default.

FAQ

Here are answers to the most common questions. If you need to know more, please contact us.

Why does the crawler visit my site?

Comscore's contextual content analysis enables advertising partners to determine the best matching campaign for a page's content.

When does the crawler visit my site?

When an ad is about to be served, the crawler visits the page and the content of the page is contextually analyzed. The frequency (how often) a page is being visited depends on many factors such as type of content, change of content, number of ad elements, etc... Any number of factors can affect the crawl frequency of individual sites.

Sites may also be crawled in a linear fashion to provide site-level analysis to advertising partners who are interested in a specific site.

How does the crawler identify itself?

The crawler identifies itself with the user-agent:

Mozilla/5.0 (compatible; proximic; +https://www.comscore.com/Web-Crawler)

How can I whitelist the crawler?

Many premium publishers explicitly allow our crawler to access their sites. Publishers benefit from our analysis and gain deep insights on their inventory to optimize direct sales and to accurately target campaigns.

To whitelist our crawler please add a separate paragraph to the robots.txt like this:

User-agent: proximic
 Disallow:

Can the crawler send custom authentication headers?

For those who need a more secure method of whitelisting, we can also add custom headers for requests to your site.

Does the crawler extract any of my content?

The crawler does not extract and store any source code, but only provides data about the publicly available content of the page, such as the content language, the content's rating (G, PG13, R) and relevant IAB categories of the content (e.g. "Real Estate::Buying/Selling Homes").

This analysis helps the advertiser to place topically relevant campaigns onto a safe environment. Relevance drives CPM, which benefits publishers.

Why does the crawler access invalid URLs?

In general this should not happen. Unfortunately, some advertisers are stripping the URL parameters, which means a working URL like www.forum.com/showthread.php?t=123 is rendered into something like this: www.forum.com/showthread.php?

How do I exclude this crawler?

If you want to exclude our crawler to not visit specific sections of your site, please add a separate paragraph to the robots.txt and specify the path you'd like to exclude:

User-agent: proximic
 Disallow: /path/

Make sure that the robots.txt is in the correct location. It must be in the top directory, e.g. www.domain.com/robots.txt.

Placing the file in a subdirectory won't have any effect. Furthermore please note that the IP addresses used by the crawler change from time to time and that it may take up to a day for changes in robots.txt to propagate across all systems.

How do I control the rate at which bot can visit my site?

Our bot usually crawls 1 request per second. However, you can control it by adding the Crawl-Delay directive to your robots.txt. A crawl-delay setting tells the bot to wait for a specific amount of time between two requests.

User-agent: proximic
Crawl-Delay: 2

With the above setting, our bot will crawl no more than 1 request per 2 seconds.