Pantheon Community

Tips for making site traffic stats more accurate

I know it’s pretty common for Pantheon’s site traffic stats to vary from other analytics that measure actual visitors. But our Pantheon stats are more than 4 times what Google says. This page pantheon.io/docs/traffic-limits says

Analytics suites (e.g. Google Analytics) are measuring fundamentally different things vs Pantheon’s request log. While analytics suites focus on measuring visits, our request log more comprehensively measures traffic.
We track every single request to the platform, whereas analytics tools will typically only track complete “pageviews” where an HTML page including a tracking snippit is completely loaded by a browser and can fire off a subsequent request to the analytics platform.

So I’m wondering if anyone has suggestions for where we can look to identify and hopefully stop any services that may be hitting the site and driving the apparent traffic up. This isn’t just an annoyance, the overcounting could potentially cost us hundreds of dollars a month, and we’re a nonprofit so we’re not making money off this traffic (whether it’s real or not).

Here are the monthly stats from the past 12 months from Pantheon and Google Analytics:

2 Likes

Definitely interested in this thread, as this is something we’ve been dealing with, too. As I understand it, a lot of the traffic we’re not seeing is bot-related. We put CloudFlare on top of our busiest site to try to mitigate some of the numbers, but I haven’t seen our Pantheon numbers drop. :frowning:

The frustrating bit is that those numbers are what Pantheon uses for determining the right tier of hosting package and we’re ending up paying for traffic we can’t really control or want to support.

1 Like

My previous experience with an issue where the Pantheon metrics were wildly different than GA ended up being an uptime checker that someone had enabled and pointed at a site with a very frequent check. I ended up finding it through the nginx access logs when some of the requests were logged because the cache expired.

1 Like

Thank you, that’s a great suggestion! We’ve been using Pindgom for years, since before we were on Pantheon. That could well be it. I’ll report back.

I use StatusCake on all of my Pantheon customers, but at 5 minute intervals. The rogue check was running every 2 minutes on top of my valid checks. That added up quick!

Well it turns out Pingdom isn’t active on our site anymore, so it’s not that. :thinking:

Any other suggestions? Or tips for how to get to our log files to look at the history?

You can grab your nginx-access.log, but it may not be helpful. If the requests aren’t making it past Fastly they won’t be there and also any container rotation that site goes through will nix those logs.

I’m also trying to gather data on what causes some client sites to have Pantheon metrics up to three times the metrics reported by other tools. The nginx-access.log was not helpful. Any other suggestions?

General question: If Pantheon does not make raw data available, how do they recon we’re supposed to conduct traffic audits?

Here’s another general question: The page on traffic counts says:

Pantheon excludes automated traffic from legitimate crawlers and bots that would otherwise count towards your website’s total traffic. We do this by examining the user-agent of traffic, as well as the source IP address.

Source: https://pantheon.io/docs/traffic-limits#what-about-bots

Do we know what those legitimate crawlers and bots are? That of course implies that there’s an answer to blocking illegitimate bots and crawlers.

Hi, all–I’m trying to wrangle some answers for ya. Thanks for raising the questions & hopefully we can be a little clearer about this soon.

2 Likes

I’m back! And I’m hopeful I can outline a few ways to understand the traffic reports.

At a high level, we are taking this data from the Global CDN. Any traffic served at the CDN level or the application level counts against those traffic limits.

In our experience, there are typically two types of traffic where the data will get really divergent:

    1. Scans/probes on your site 
    2. Non-human-readable content

Regarding scans/probes on your site, we don’t charge for known beneficial traffic (e.g. Googlebots crawling your site). We know it’s outside of your control, and we also know it will help your site succeed–so we want to be sure it gets through unhindered.

At a platform level, we can determine and block widespread malicious traffic. But we can’t really determine “unwanted” traffic on a site level (what’s the old saying? One site’s trash is another site’s treasure?) So if you are getting traffic that you don’t want, you’ll want to block that—e.g. using a WAF, our Advanced CDN, or some other solution. We typically see a lot of probes on things like CMS-standard login pages. Your logs may also show really outdated browsers or operating systems in the User-agent, or lots of traffic from a location where you don’t have customers.

If you want to pull your logs to look for that kind of behavior, the instructions are here. If you have multiple application containers (Performance Medium & up), you’ll want to be sure you’re pulling from all the application containers—the docs have a script to make that easier.

Note: Your logs only contain traffic that actually hits the application containers—so anything served by the CDN isn’t included here. We don’t have an auditing solution currently on the roadmap (it hasn’t been a big request from customers), but I’ve let the product team know there’s some interest.

Regarding non-human-readable content, we consider traffic like API calls or RSS feeds to be legitimate traffic, but they are almost never tracked by something like Google Analytics. That’s another place you can look to explain the discrepancy. There are also edge cases (prefetching content or users who don’t have Javascript enabled) but those don’t apply to the vast majority of our customers.

Lots of other details here: https://pantheon.io/docs/traffic-limits

I’m thinking I’ll submit some changes to our documentation to try to clear some of this up, but I’m happy to keep discussing & brainstorming here.

1 Like

We did finally learn about this tool that enables downloading of logs from all containers:

Using sftp as described in https://pantheon.io/docs/logs referenced above is not adequate, as the site may be on multiple containers or may have switched containers. Now that we know about the log retriever script, I’ve searched through all the log documentation pages to see how I missed it. I’ll be doggone if I can find a reference to it.

Anyway, using the log retriever script, we were able to get full logs for a Performance Large client site, and hacked our way around to get better data. That was extremely helpful. It is rather frustrating to have to do this manually, and we are still guessing at what log entries show us data we really need to know and what log entries are not counted toward the Page View limit. (Multiply that by the number of sites that we manage as an agency, and doing this manually quickly becomes unwieldy.)

However, I’m still befuddled by a small site that I own. The site dashboard shows an average of 2,500-2,700 Page Views per week, but it’s a little teeny site that gets maybe 10-20 page views per week. We downloaded those logs using the pantheon_log_retriever, just as we did for the large site. Those logs show a total of 5 entries for the last 60 days – yes, the logs says only 5 “pages” were served in 60 days.

Rereading another documentation page, I note this:

Requests served by the Pantheon Global CDN will not hit the nginx webserver and will not be logged in nginx-access.log .
(source https://pantheon.io/docs/nginx-access-log)

So could that be why the small site doesn’t have many log entries - all visitors are seeing cached pages? I’m so confused.

There must be an easier way to understand and reconcile the Metrics, which are the basis for how much you/the client pays for the site.

1 Like

It sounds like even if you get access to the nginx logs (and I’ve never been able to get sufficient historical data from the log files via STFP) that they aren’t necessarily directly related to the numbers of views and visits Pantheon reports.

It is hard to understand why Pantheon doesn’t provide more transparency here - there should be a straightforward way to see what resource views are counting against limits, especially since they are determining price.

@sparklingrobots - I tried following the Advanced CDN link you posted but it seems to be a general page for professional services. Is there some way to see CDN traffic and block unwanted traffic that could be inflating my page views?

We have a 4x discrepancy between Google and our Pantheon metrics and not having realistic tools to figure out why is incredibly frustrating.

3 Likes

@natepixel Ah, yeah–the Advanced CDN is an offering from our Professional Services team. It’s additional configuration to our existing CDN. Sorry for getting jargon-y on you.

Like I mentioned above, there isn’t currently a way to directly audit/review the CDN traffic. I can’t promise anything, but I’m talking to our product team about whether we might be able to do that.

To block any unwanted traffic, you can look at stacking a CDN over ours (whether it’s our own Advanced CDN offering or another CDN of your own choosing), using a WAF, geo-blocking IPs, or a solution like that.

I hear your frustration–I’m working internally to improve this process as best we can.

@anne : my apologies–this is the script I was referring to: https://pantheon.io/docs/logs#automate-downloading-logs It’s not the same as the one you found, but it does get existing logs from all application containers.

Re: your site only showing 5 pages served in 60 days…do you have folks logging in regularly to add/edit content? Those authenticated users would likely see uncached content (perhaps unless you have Redis set up, but I’d be surprised to see that so low). If you’re serving entirely anonymous traffic, then yes, we do see the CDN serving huge numbers of cached pages that never hit Pantheon’s appservers. It’s part of what keeps a site up during heavy traffic spikes, even on our smallest plans.

@sparklingrobots, I’ve followed the instruction on the #automate-downloading-logs link, which is essentially the same kind of script at the github project that I linked to earlier. Same results. (Although today, we’re up to 7 lines.)

So far today, I’ve spent more than an hour trying to get Goaccess running. < sigh > Still doesn’t work. I’m putting effort into that because most lines in the log don’t make sense to me. Here’s an example:

163.172.105.97 - - [23/Nov/2019:15:31:18 +0000] “\x04\x01\x00\x19\xBC}I\x1D\x00” 400 166 “-” “-” 0.102 “-”

(Aside: I’ve lost track of how much time I’ve spent trying to follow instructions provided to download all logs, then understand the data that is there, but I’d estimate 8-12 hours for three sites. That’s non-billable time. Frustrating doesn’t even begin to describe it.)

You ask: Re: your site only showing 5 pages served in 60 days…do you have folks logging in regularly to add/edit content?

The answer is no. I’m the only user account. So the site is serving virtually entirely anonymous traffic. If that’s so, and the CDN cache is serving virtually all pages, then how am I supposed to figure out how to identify and block the bots (or whatever they might be) that I’d like to keep out?

Thanks so much for hearing us out. We do appreciate it!

1 Like

Thank you so much for trying to find a solution to this, @sparklingrobots!

I feel like there’s a fundamental problem with Pantheon charging based on traffic but not giving us any way to see the traffic you measured. It can’t just be a black box where we’re supposed to trust you to decide how much to charge us, right? That just doesn’t seem like a good way to do business.

Without access to this we have no way of knowing if someone is hitting our site with malicious scans or something else that might not only be inflating our metrics but also possibly endangering the rest of your customers as well as us.

4 Likes

Your point about “no way of knowing” is excellent. One concern is that perhaps there’s something in the code that’s being exploited that hard to discover without access to complete logs.

2 Likes

In my previous go-around with this a suggestion was that I turn off caching for the entire site temporarily to allow me to see more in my nginx-access.log. Take with a grain of salt.

1 Like

Just wanted to provide a short update: I’ve included all of the folks commenting here on an internal feature request to allow access to the Global CDN logs.

While I can’t say when or if this will come to pass, thanks go to each of you for your honest feedback. I hear your frustration. <3

1 Like