Pantheon Community

Tips for making site traffic stats more accurate

Both of those blocks I mention use AJAX. Would this run up the numbers on Cloudflare, too? Because the numbers there are sky high as well.

Thanks for the help Josh. I really appreciate it.

1 Like

Yesterday we were notified that Pantheon is moving us from Performance Small to Large, increasing our monthly cost by 260% based on your secret metrics that we have no evidence to believe are accurate.

We will appeal and request an audit, but that kind of individual solution is not an effective way to address a problem that is clearly impacting many customers. Charging us based on mysterious black box statistics is an irresponsible business practice, and failing to offer transparency into what is behind these statistics is potentially a large security hole.

I’ve always been such a fan of Pantheon and have been referring people to you for years and years. I just can’t express how disappointed I am about this.

5 Likes

@ruby - the metrics are definitely accurate. Our pricing is based on the amount of traffic served by the platform. Every request to the platform is logged and this is the source of the numbers. This does not (and won’t ever) align with other data sources that attempt to measure views. It’s apples and oranges.

@johndubo - Yes, the AJAX requests are likely a big part of your numbers. As per the above, AJAX requests (as well as RSS, XML, JSON, etc) are all going to show up in our traffic stats because they are requests we serve. However, they’re not going to be in any “pageview” metrics like Google Analytics.

Thanks for everyone’s feedback on this threat. I understand that this is difficult for some customers, and I especially empathize with folks who are operating under budget constraints, especially long-time customers for whom this is coming as a surprise. It’ll frankly be a lot easier for new customers who find out right away that they’re on the wrong plan.

Even though it’s difficult, our pricing structure has to be enforced or else it’s not really meaningful. Our goal is to do this consistently. Happy to take additional feedback on what you think would make the process more fair or equitable.

1 Like

Is there an ETA on when we’ll be able to see these metrics on a request by request basis? Does Fastly provide raw access logs?

@jfoust - more detailed logging is going to take some time. The data is based on the raw logs, but those are platform-wide, which is billions of requests a day. Aggregating and storing those on a site-by-site basis for all sites for all days would be quite costly in terms of both compute and storage, so we need to build a data pipeline that can produce reports that are per-site on demand. It’s a little complicated.

Hey everyone, the Sales Engineering Team at Pantheon has been working with several of our contract customers to help them understand what’s going on with their traffic patterns and we wanted to share some of our findings with you.

First, some clarity
Before we dig into types of requests that count towards your metrics, let’s first explain how we define a request, how requests are calculated, and what types of requests we count.

  • There are two types of measurements, visits and pages served.
    • A visit is a unique combination of IP address and User Agent within a 24-hour period. For example, if you are at home and visit your website from your laptop, and then again from your phone, those are counted as two unique visits. Alternative scenario, 10 users in a campus computer lab using the same browser, could register as 1 unique visit.
    • Pages served are considered a single request against your site, which could be a standard HTML response, an API endpoint, or an RSS feed.
  • We don’t count the common bots (or any that identify as crawlers) that regularly hit your site, considering bots like GoogleBot, Yahoo, Bing, SEMRush, etc. We also don’t count bots that identify as uptime monitors like Pingdom or New Relic.
  • We don’t count static assets that are not generated by PHP. For example, icons, documents, or images stored in a theme folder or uploads (wp-content/uploads, sites/default/files, etc.) If the asset is dynamically generated, such as using Responsive Images in Drupal to render various images styles (the first time), then we do count that call.
  • We don’t count redirects or errors (301/302, 4xx, 5xx).

Outside of these basic rules, the rest of the traffic is just part of your standard internet traffic, which we’ll dive into some of the differences below.

Non-human Traffic
This topic has been discussed previously but warrants a mention here.

If you’ve set up an API endpoint or custom scripts on your site that other pages are calling, that’s going to be counted in your Pages Served. A few examples include:

  • RSS feeds
  • JSON feeds
  • API endpoints
  • PHP scripts
  • Modules or plugins with embedded endpoints / direct scripts
  • AJAX calls

A great example for modules or plugins that count against your pages served would be the Statistics module in Drupal core. Every node visit will additionally call /core/modules/statistics/statistics.php, which will double the number of requests per view.

These pages will never show up in your Google Analytics reports unless you have done something special with the GA API and Virtual Pageviews.

Bots that are falsifying user agents are also going to count heavily to your Pages Served but not necessarily to your visits since they are commonly coming from a single IP and user agent. If you look into your logs, you may see some old versions of Chrome (as old as 31.x) requesting a single page from one or two IPs.

Common WordPress Patterns
WordPress sites have two common vectors that bad actors will always try out first.

xmlrpc.php
The first is the XML-RPC page, WordPress introduced that a while back for offline content creation that would sync back to your site when you came online. Most of you aren’t using this and shutting it down would be a very good thing for the safety of your site. There are plugins out there that will do that for you, but we also offer protected paths on the platform for you to lock it down.

wp-login.php
The second one is the login page. As a platform, we don’t have anything built in to stop your content owners and developers from logging in to your site but it does garner a lot of attention that can be less desirable. For our contract customers, we can help out with a customized CDN configuration that will white (or black) list IPs, regions, or even implement a full WAF implementation. Other techniques could include changing the URL of the login page, whitelisting the pages in PHP, and limiting the number of login attempts. You also want to consider enforcing strong passwords and multi-factor authentication.

Common Drupal Patterns
Drupal sites don’t have the same obvious pointy bits that WordPress sites do but we are commonly seeing huge spikes in traffic over search results. This was a common technique to cripple the database in the days before SOLR and Redis, but is rarely of concern in this day and age in terms of site stability. It is, however, probably still malicious and should be addressed. Mitigation is a lot harder in these instances and would probably rely on blacklisting suspicious IPs.

To a lesser degree, we do see some probing going on with the Drupal login pages so it’s worth mentioning, mitigation techniques would be the same as for WordPress.


Overall, there are many ways to identify, reduce, and deflect these requests to reduce any impact to your site, but these are still requests that are coming into your site - whether or not they’re being served a cached response from the Global CDN. The only way to reliably offset and reduce the number of malicious requests is through a site specific layer of protection, such as an additional CDN layer or WAF.

I understand that you are measuring something other than Google Analytics and other services do. However it makes sense that there should at least be some sort of proportionality between the two scales.

Especially when Pantheon reports us having 12 times more pageviews than Google does, it indicates that something is not right. It could be some errant code on our site that is putting unnecessary strain on your servers, it could be someone maliciously crawling our site, or it could be a problem with how you are measuring traffic. Given that this issue has been reported by many Pantheon customers, it raises the possibility that the issue is not unique to each one of us.

I have no way of knowing which one it is, and neither do you if you don’t audit the numbers. Last week when we were notified of the up-sizing, we responded appealing the change and asking for an audit of the traffic, which is what @greg recommended I do when I asked in Slack. I haven’t received any response to that ticket yet, which is making this increasingly frustrating and frankly disheartening.

It’s not a question of right or wrong. What this indicates is that you have a fair amount of traffic that isn’t being measured by Google Analytics. That could be for a wide variety of reasons, as is explained in this doc and here on the thread.

We’ll get you some more information and hopefully that can help.

1 Like

We’re having the same issue as @ruby describes at the onset. I’ve read all the comments. A couple of things on my mind:

  1. I haven’t seen mentioned as included/excluded is traffic potentially generated by third party marketing/tracking services. Is this a concern?

  2. I’m looking for a meaningful way to analyze the traffic, to find that silver bullet, “A-ha! it’s that Monitis ping I have setup to hit the site once every 2 minutes.” I’d need to be able to have that tool in the dashboard or a tutorial on how to do with New Relic or Loggly or Goaccess (I too have had issues installing that @anne). Given that the cost of hosting is determined by visits and pages served, I need a ready way to see what’s behind those numbers so I can determine if the traffic is the traffic we want and, if not, I can work to turn off a hight hit service, tune it or block it.

Thank you,
Stephen Musgrave
Partner
Capellic

2 Likes

Me again. I’d like to propose that this topic be covered on the January 8 office hours. @joshkoenig, would you be able to get the right people lined up for that?

Thank you,
Stephen Musgrave
Partner
Capellic

3 Likes

@stephencapellic – I can’t promise that date but I am trying to get some folks together to discuss this and will keep you posted. Thanks for the suggestion.

1 Like

@stephencapellic Have you tried parsing your logs with with GoAccess? That tool will give you the top IP addresses that have been requesting your site. Usually, you’ll find some IP address that has been requesting your site way more than other ones.

2 Likes

I did give it a go and hit a brick wall during the installation process. Furthermore, based upon this fact from a post above, I’m not inclined to force my way through the GoAccess installation process only to get a partial view of the traffic.

1 Like

I’ve collected a list of traffic that is and is not counted as traffic from within this discussion and in docs. Some specifics that I’m not certain on that I’d like to have addressed. @sparklingrobots, can you tell me if the following are counted or not counted?

  1. Google AMP
  2. Facebook Instant Articles
  3. TrustWave scans
  4. Easycron
  5. Lazy Loading of images
  6. Monitis

Thanks!

1 Like

Hey Stephen! I think I can help with some of these:

  • Google AMP and Facebook Instant Articles are counted when those services fetch from your site, or if the page makes AJAX calls back for more data, but not when being served from their own cache since it acts as a CDN in front of our CDN.
  • Cron isn’t specifically excluded, so it will be counted.
  • Lazy Loading images doesn’t count since they’re static assets, unless it’s loading thumbnails that are being generated by PHP.
  • I believe TrustWave and Monitis are counted, but if you can find the User Agent for those services, I can confirm for you.
2 Likes

Monitis user agent:
Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)

The user agent is editable but we’ve only ever used the default which is above. If it has ever changed that would be due to Monitis changing the default.

I’m checking into TrustWave.

1 Like

Kyle,

Would an image like this be counted as a call?

/sites/default/files/styles/140px_by_180px_user_pic_epsa_crop/public/magazine-covers/buzz_wu_mar15cov.jpg?itok=fvcGGrx7

This was generated using Drupal image styles. You can find this image on this page: https://thebuzzmagazines.com/magazine

Thanks.

-John

1 Like

Only the first time the image style derivative is generated. After it is generated, then it is served as a static asset and not counted.

1 Like

What about this paragraph:

“We don’t count the common bots (or any that identify as crawlers) that regularly hit your site, considering bots like GoogleBot, Yahoo, Bing, SEMRush, etc. We also don’t count bots that identify as uptime monitors like Pingdom or New Relic.”

For the “uncommon” bots, do we just need to pick those off one by one if they are pinging our site? That seems very labor intensive. Is this what is expected, though?

How do I know what is a common bot vs. an uncommon bot? AhRef? Common or uncommon? Moatbot? Twitterbot? Seznambot?

A Pantheon Support Manager sent me a log from a 24-hour period from last month, and it is filled with references to bots in the user agent info.

Any help would be appreciated.

Thanks.

2 Likes

Interesting because before this I’ve only been told that files/ is static and so not to worry. That isn’t accurate. So if we’re constantly clearing out image cache then that traffic is counted. Of course we think we’re only doing this once in a blue moon but I wonder if a misconfiguration would cause image cache files to be regenerated unnecessarily. Would be nice to have a tool to see the image cache contribution to traffic.

3 Likes