Pantheon Community

Tips for making site traffic stats more accurate

It sounds like even if you get access to the nginx logs (and I’ve never been able to get sufficient historical data from the log files via STFP) that they aren’t necessarily directly related to the numbers of views and visits Pantheon reports.

It is hard to understand why Pantheon doesn’t provide more transparency here - there should be a straightforward way to see what resource views are counting against limits, especially since they are determining price.

@sparklingrobots - I tried following the Advanced CDN link you posted but it seems to be a general page for professional services. Is there some way to see CDN traffic and block unwanted traffic that could be inflating my page views?

We have a 4x discrepancy between Google and our Pantheon metrics and not having realistic tools to figure out why is incredibly frustrating.

3 Likes

@natepixel Ah, yeah–the Advanced CDN is an offering from our Professional Services team. It’s additional configuration to our existing CDN. Sorry for getting jargon-y on you.

Like I mentioned above, there isn’t currently a way to directly audit/review the CDN traffic. I can’t promise anything, but I’m talking to our product team about whether we might be able to do that.

To block any unwanted traffic, you can look at stacking a CDN over ours (whether it’s our own Advanced CDN offering or another CDN of your own choosing), using a WAF, geo-blocking IPs, or a solution like that.

I hear your frustration–I’m working internally to improve this process as best we can.

@anne : my apologies–this is the script I was referring to: https://pantheon.io/docs/logs#automate-downloading-logs It’s not the same as the one you found, but it does get existing logs from all application containers.

Re: your site only showing 5 pages served in 60 days…do you have folks logging in regularly to add/edit content? Those authenticated users would likely see uncached content (perhaps unless you have Redis set up, but I’d be surprised to see that so low). If you’re serving entirely anonymous traffic, then yes, we do see the CDN serving huge numbers of cached pages that never hit Pantheon’s appservers. It’s part of what keeps a site up during heavy traffic spikes, even on our smallest plans.

@sparklingrobots, I’ve followed the instruction on the #automate-downloading-logs link, which is essentially the same kind of script at the github project that I linked to earlier. Same results. (Although today, we’re up to 7 lines.)

So far today, I’ve spent more than an hour trying to get Goaccess running. < sigh > Still doesn’t work. I’m putting effort into that because most lines in the log don’t make sense to me. Here’s an example:

163.172.105.97 - - [23/Nov/2019:15:31:18 +0000] “\x04\x01\x00\x19\xBC}I\x1D\x00” 400 166 “-” “-” 0.102 “-”

(Aside: I’ve lost track of how much time I’ve spent trying to follow instructions provided to download all logs, then understand the data that is there, but I’d estimate 8-12 hours for three sites. That’s non-billable time. Frustrating doesn’t even begin to describe it.)

You ask: Re: your site only showing 5 pages served in 60 days…do you have folks logging in regularly to add/edit content?

The answer is no. I’m the only user account. So the site is serving virtually entirely anonymous traffic. If that’s so, and the CDN cache is serving virtually all pages, then how am I supposed to figure out how to identify and block the bots (or whatever they might be) that I’d like to keep out?

Thanks so much for hearing us out. We do appreciate it!

1 Like

Thank you so much for trying to find a solution to this, @sparklingrobots!

I feel like there’s a fundamental problem with Pantheon charging based on traffic but not giving us any way to see the traffic you measured. It can’t just be a black box where we’re supposed to trust you to decide how much to charge us, right? That just doesn’t seem like a good way to do business.

Without access to this we have no way of knowing if someone is hitting our site with malicious scans or something else that might not only be inflating our metrics but also possibly endangering the rest of your customers as well as us.

4 Likes

Your point about “no way of knowing” is excellent. One concern is that perhaps there’s something in the code that’s being exploited that hard to discover without access to complete logs.

2 Likes

In my previous go-around with this a suggestion was that I turn off caching for the entire site temporarily to allow me to see more in my nginx-access.log. Take with a grain of salt.

1 Like

Just wanted to provide a short update: I’ve included all of the folks commenting here on an internal feature request to allow access to the Global CDN logs.

While I can’t say when or if this will come to pass, thanks go to each of you for your honest feedback. I hear your frustration. <3

1 Like

I’d love to find a solution to this. Not accessing the logs. I can do that via FTP. I need to find a solution to Pantheon’s site traffic metrics being radically higher than Google Analytics. Pantheon has already increased our monthly charge, and the new plan is not tenable. We will have to move to another hosting platform if we can’t figure out what is going on. Any advice would be appreciated, because I’d hate to leave Pantheon, but we cannot afford what we are currently being charged. Help!

3 Likes

In lieu of some way to see what Pantheon’s metrics are based on (I agree with @johndubo that log files ain’t it, even if we could get them more easily), it seems like there should at least be a way to appeal when those stats are having a financial impact on us. And in the case of an appeal, I think it should be Pantheon’s responsibility to investigate for potential errors in their data and signs of malicious activity such as crawlers that we can’t do anything about.

I just don’t see how we can be charged for bandwidth that we aren’t actually using in any way that we are aware of or in control of.

2 Likes

Totally agreed that we customers need more information from the traffic metrics, especially since that’s what the pricing is based on. We should be able to get a detailed report of the traffic breakdown for any time period, including all URLs and traffic counts for those URL requests.

3 Likes

I concur 100% with this request.

2 Likes

Is there any movement on this? I installed Cloudflare in front of our site and I don’t think it is going to make much difference. Pantheon’s CDN metrics are about 10x what New Relic and Google Analytics is giving.

There has to be a way for me to figure out why that is. Has to. I have already called Acquia to move over to them because I cannot afford to have to go to a Performance XL plan for my little site that, according to Google Analytics, had 60,000 pageviews in the month of November. Pantheon logged 1.49 million. What the hell? What. The. Hell?!?!? I am so frustrated.

4 Likes

I’m also very interested in getting better visibility into the CDN metrics. The Pantheon metrics report multiples times what Google Analytics reports.

2 Likes

Hey Ruby - we’re definitely open to appeals, and where we can identify bots, crawlers, and status checkers we do exclude those from the metrics. We do this proactively, but also can’t catch everything, so having customers surface issues is definitely helpful.

However, there’s also just a lot of traffic that won’t show up in Google (see the docs for examples) which we do need to account for in the measurements, so it’s never going to be an exact match. Depending on the site and its usage pattern it can unfortunately differ by quite a lot.

I know that’s difficult when it impacts a budget. Over time we’ll be able to show more and more visibility in the metrics UI, but we do have to be consistent and fair with how our pricing works.

1 Like

CloudFlare isn’t necessarily a good tool to reduce your usage of Pantheon. I’ve seen them actually increase the amount of pages served from Pantheon due to pre-fetching.

I know it’s no fun to have a shocker traffic bill. It sounds pretty unusual to have 60k “pageviews” tracked in GA, but 1.5M “pages served” from Pantheon, but there could be a high volume of API calls, or clients that don’t report back to GA, or a number of other causes.

Unfortunately from our end we have to be consistent and fair with what we measure and how we charge. In the future we’ll be able to provide increasing detail and insight in the metrics area, but it will never be perfect, and it also cannot possibly match Google since we’re fundamentally measuring different things.

1 Like

Thanks Josh. This is a Drupal platform we are using. Could things like having, in the sidebar on every page, a mini calendar and a quicktab (module) with the most popular pages and most recent Disqus comments be driving up our number of “pages served”? I’ve wondered if that has more to do with this issue than any sort of malicious traffic or bots.

1 Like

If your site makes AJAX or other types of requests that will run up the traffic count.

Disqus probably not specifically since those requests will be to them (not Pantheon), but I’ve definitely seen cases where something “programatic” like this results in off-the-charts kind of numbers.

1 Like

Both of those blocks I mention use AJAX. Would this run up the numbers on Cloudflare, too? Because the numbers there are sky high as well.

Thanks for the help Josh. I really appreciate it.

1 Like

Yesterday we were notified that Pantheon is moving us from Performance Small to Large, increasing our monthly cost by 260% based on your secret metrics that we have no evidence to believe are accurate.

We will appeal and request an audit, but that kind of individual solution is not an effective way to address a problem that is clearly impacting many customers. Charging us based on mysterious black box statistics is an irresponsible business practice, and failing to offer transparency into what is behind these statistics is potentially a large security hole.

I’ve always been such a fan of Pantheon and have been referring people to you for years and years. I just can’t express how disappointed I am about this.

1 Like