@stephencapellic – I can’t promise that date but I am trying to get some folks together to discuss this and will keep you posted. Thanks for the suggestion.
@stephencapellic Have you tried parsing your logs with with GoAccess? That tool will give you the top IP addresses that have been requesting your site. Usually, you’ll find some IP address that has been requesting your site way more than other ones.
I did give it a go and hit a brick wall during the installation process. Furthermore, based upon this fact from a post above, I’m not inclined to force my way through the GoAccess installation process only to get a partial view of the traffic.
I’ve collected a list of traffic that is and is not counted as traffic from within this discussion and in docs. Some specifics that I’m not certain on that I’d like to have addressed. @sparklingrobots, can you tell me if the following are counted or not counted?
- Google AMP
- Facebook Instant Articles
- TrustWave scans
- Lazy Loading of images
Hey Stephen! I think I can help with some of these:
- Google AMP and Facebook Instant Articles are counted when those services fetch from your site, or if the page makes AJAX calls back for more data, but not when being served from their own cache since it acts as a CDN in front of our CDN.
- Cron isn’t specifically excluded, so it will be counted.
- Lazy Loading images doesn’t count since they’re static assets, unless it’s loading thumbnails that are being generated by PHP.
- I believe TrustWave and Monitis are counted, but if you can find the User Agent for those services, I can confirm for you.
Monitis user agent:
Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)
The user agent is editable but we’ve only ever used the default which is above. If it has ever changed that would be due to Monitis changing the default.
I’m checking into TrustWave.
Would an image like this be counted as a call?
This was generated using Drupal image styles. You can find this image on this page: https://thebuzzmagazines.com/magazine
Only the first time the image style derivative is generated. After it is generated, then it is served as a static asset and not counted.
What about this paragraph:
“We don’t count the common bots (or any that identify as crawlers) that regularly hit your site, considering bots like GoogleBot, Yahoo, Bing, SEMRush, etc. We also don’t count bots that identify as uptime monitors like Pingdom or New Relic.”
For the “uncommon” bots, do we just need to pick those off one by one if they are pinging our site? That seems very labor intensive. Is this what is expected, though?
How do I know what is a common bot vs. an uncommon bot? AhRef? Common or uncommon? Moatbot? Twitterbot? Seznambot?
A Pantheon Support Manager sent me a log from a 24-hour period from last month, and it is filled with references to bots in the user agent info.
Any help would be appreciated.
Interesting because before this I’ve only been told that files/ is static and so not to worry. That isn’t accurate. So if we’re constantly clearing out image cache then that traffic is counted. Of course we think we’re only doing this once in a blue moon but I wonder if a misconfiguration would cause image cache files to be regenerated unnecessarily. Would be nice to have a tool to see the image cache contribution to traffic.
I got GoAccess working, motivated by the fact that we had our monthly Trustwave vulnerability scan on Jan 2. I certainly see the hits/visitors spike on the Time Distribution graph during this time. I took a look a the Visitor Hostnames and IPs graph. I thought for sure the top IP, when looked-up, would indicate that Trustwave owned it but not so. Here’s the list of the top 5
184.108.40.206 < Google Cloud
220.127.116.11 < Google Cloud
18.104.22.168 < Google Cloud
22.214.171.124 < Google Cloud
126.96.36.199 < Google Cloud
I can only imagine that this is traffic from a Pantheon CDN? Can someone Pantheon explain what I’m seeing?
Right, good question. Most of the search crawler ones are covered, like Yahoo, Bing, Google, Ahref, SEMRush, Yandex, and other spider/crawlers (including like Applebot for Apple News). On-demand bots that are often fetching previews or metadata like Slack and Twitter, currently aren’t covered. I don’t think Moatbot or Seznambot are covered, so they will be counted.
So in that explanation post above, that’s where we mention files being generated by PHP. While this includes image derivatives (remember, they only have to be generated once), it could also include dynamically generated PDF files or other reports (like invoices rendered through a script):
We don’t count static assets that are not generated by PHP. For example, icons, documents, or images stored in a theme folder or uploads (wp-content/uploads, sites/default/files, etc.) If the asset is dynamically generated, such as using Responsive Images in Drupal to render various images styles (the first time), then we do count that call.
For this piece, the next step would be looking at the User Agent associated with the IP address in the raw log (just open it up in a text editor and find the IP). On some of my recent audits, we’ve been seeing foreign botnets utilizing GCP/AWS IP ranges, so the UA and request volume / spread are key in determining those malicious waves.
Twitterbot accounted for nearly 5,000 of the 24,000 entries in the log I was given. That is a boatload. We use Twitter to promote our articles. If I block Twitterbot, isn’t that going to mess up our ability to use it to connect with our site?
What about Facebook? I tested out blocking all known bots with Cloudflare and it really screwed things up for our Facebook and Twitter connectivity.
What do I do about that?
Yea, that would do it. Because the nature of your site is a media site, that metadata fetching is important for previews and embeds in social networks, so this would just be part of that legitimate traffic coming to your site. So I wouldn’t block those social bots, because you’re definitely going to need them.
For your type of site (high-traffic, media industry) and you’re not making a lot of changes to existing content, you could extend your TTL in CloudFlare to reduce the number of trips to Pantheon - basically telling CloudFlare to keep cached content longer than the default 1 or 2 hours, and extend that to a day or longer. Does that make sense?
It does make sense. I set it to 1 day. I’ll see what kind of difference that makes.
And all this really underscores why we need to be able to see the Pantheon Metric logs. I can’t possibly know which bot counts against me and which don’t without seeing logs.
Yep, I hear you. Like Josh said, it’s something that’s on our radar, but hopefully in the meantime, this thread will serve as a good resource that at least outlines how traffic is currently counted on the platform.
It’s a good resource. But without the logs, I just cannot know what to filter out. The Dec 6 log I was sent had Google bot traffic on it, at least, when I entered the user agent and IP into Cloudflare, it marked those as known bots coming from Google. There wasn’t any straight up “Googlebot” traffic in the Dec 6 log, but clearly some user agents from Google “snuck through” and counted against us. We are a small company, and have seen our plan bumped up twice.
I appreciate the need for Pantheon to need to charge for the traffic that you all handle, but it feels a little sketchy to be so opaque about what is being counted. Especially when I do finally get a peek at a month old log and find it filled with bot traffic that I thought wasn’t being counted. To say nothing of the fact that the log had about 24,000 entries in it, and when I looked at the metrics on my dashboard for that day, it recorded over 35,000 pageviews. I’m not sure if I said it here or on my support ticket I have open, but it leaves me feeling very cheated.
So this is what I had mentioned earlier to Stephen, that you can see Google IP addresses, which are likely GCP VMs, that are being used for crawling or bots. What you want to do is isolate those IP addresses, and look at the associated User Agent - that would be the key thing to look at.
For the log entries, are you referring to the log analysis provided by support from the CDN, or from GoAccess / Nginx logs? Feel free to reach out to me here via DM and I’ll take a look!
Stephen, in our nginx-access.log files, the first IP address is an IP address of a routing endpoint to the server. The actual IP of the customer is the first IP address, I believe, in the last section of the log entry, in the quotes. This is mentioned here in our Docs: https://pantheon.io/docs/logs#what-is-the-first-line-in-nginx-accesslog
Not sure if that answers your question, but my thought was that your seeing lots of traffic from Google Cloud IPs could be because you were looking at the wrong IP address. We have some info in our Docs about how to configure GoAccess to help with this as well: https://pantheon.io/docs/nginx-access-log (You may have already seen that.)