A spider, also known as a crawler or a bot, is the software Proximic uses to visit and access the content of webpages.
- are friendly and identify themselves,
- only download the static, textual content,
- honor the rules of a robots.txt,
- crawl at a slow rate by default.
Here are answers to the most common questions. If you need to know more, please contact us.
Why does the Proximic spider visit my site?
Proximic's content analysis enables advertising partners to determine the best matching campaign for a page's content to achieve the highest CPM for you as a publisher. Proximic works with many advertising partners and it is very likely that one of them is serving ads to your site.
When does the Proximic spider visit my site?
When an ad is about to be served, the spider crawls the page, our system processes the content on the page and provides the page-level analysis to the requesting advertiser. The frequency (how often) a page is being crawled depends on many factors such as type of content, change of content, number of ad elements, etc... Any number of factors can affect the spider frequency of individual sites.
Sites may also be crawled in a linear fashion to provide site-level analysis to advertising partners who are interested in a specific site.
How does the Proximic spider identify itself?
The spider identifies itself with the user-agent: Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)
How can I whitelist the Proximic spider?
Large publishers like PC World explicitly allow our spiders to crawl their content. Publishers benefit from our analysis and gain deep insights on their inventory to optimize direct sales or accurately target campaigns.
To whitelist our spiders please add a separate paragraph to the robots.txt like this: User-agent: proximic
Can the spider send custom authentication headers?
For those who need a more secure method of whitelisting, we can also add custom headers for requests to your domain. Please contact us and we'll be happy to set these up for you.
Does the spider extract any of my content?
We do not extract and store any source code, but only provide data about the page to our advertising partners, such as the content language, the content's rating (G, PG13, R) and relevant IAB categories of the content (e.g. "Real Estate::Buying/Selling Homes").
This analysis helps the advertiser to place topically relevant campaigns onto a safe environment. Relevance drives CPM, which is your win.
Why does the spider access invalid URLs?
In general this should not happen. Please contact us and we will find out what is causing it.
Some advertisers are stripping the URL parameters, which means a working URL like www.forum.com/showthread.php?t=123 is rendered into something like this: www.forum.com/showthread.php?
How do I exclude this spider?
We successfully work with many large publishers and please feel free to contact us if you have any concerns or questions. If you want to exclude our spiders to not crawl specific parts of your site, please add a separate paragraph to the robots.txt and specify the path you'd like to exclude: User-agent: proximic
Disallow: /path/ Make sure that the robots.txt is in the correct location. It must be in the top directory, e.g. www.domain.com/robots.txt.
Placing the file in a subdirectory won't have any effect. Furthermore please note that the IP addresses used by the spiders change from time to time and that it may take up to a day for changes in robots.txt to propagate to all of our spiders.