Pantheon Community

Tips for making site traffic stats more accurate

@amit One note about your comment with regard to bots continually trying to access different URLs with wp-login.php in them on your Drupal site…

According to @kyletaylored’s comment from Dec. 18, 2019, those requests should not be counting against your Pantheon metrics, since those would be 404 responses:

We don’t count redirects or errors (301/302, 4xx, 5xx).

@gravelpot They counted because they were hitting URLs for Views that won’t return a 404 because of the way Views works. For example, one of the site’s Views is for a blog at /blog and, as is typical, has category pages at /blog/{category}. So all the attempts by bots to hit /blog/wp-login.php weren’t returning 404s.

I’ve now added code to settings.php to look for wp-login.php in any URL and return a 404. But we haven’t seen a big drop in metrics, so we are shooting in the dark.

In the meantime, the client’s site was automatically bumped from Performance Small to Performance Large at the end of Feb, despite Pantheon support having told me it would be Performance Medium and having taken a full month to get back to us with a partial analysis. I asked for an escalation & audit on Fri, but haven’t received any acknowledgement.

3 Likes

Well… this is horrifying. What a great example of how hard it is to catch this moving target.

If this is the case, it makes me think that it might be a good best practice for anyone running Drupal on Pantheon to be sure they have enabled fast_404 and added wp-login.php to the list of patterns that function tries to match.

1 Like

Crazy, but can confirm I see that on my views too. A little Googling landed me on the views404 module. Going to take that one apart and see what they are doing to address it!

General note: If you have concerns about a specific plan right-sizing, reply to the email you received and let us know what your concerns and questions are and we’ll take it from there.

^ I am doing this, and also opened a ticket.
My plan bumped from $125 to $450 which there is no way I can afford. I do not have the time to be scrambling to find a new host.
I use Pantheon all day at work and love it – we are a partner – but this is no good.
(this is a personal site of mine and apparently is getting hit by Chinese and other bots).

From that ticket below:

Our colleague worked on getting the data you requested . He ran two reports, January 23, 2020 and February 6, 2020. On both days, there was a steady amount of foreign bot traffic identified user agents (identify as older Chinese consumer smartphones with outdated versions of Android). These bots generate a lot of unique visitors because they come from a very wide range of IP addresses, bundled with 4 or more user agents. For January 23, these bots contributed to around 98% of the total page visits (unique visitors) recorded.

Unfortunately, we are limited to blocking bots at the platform or Global CDN level, as these can be legitimate user agents that are spoofed, but what makes identifiable is by the volume of requests and IP range in a short amount of time.

The best course of action long-term is to implement a WAF. If you have any troubles finding the right WAF for you, you can always reach out to your account manager or the support team about our Advanced CDN product that can be used to filter these types of requests (GeoIP blocking, whitelist / blacklist IP ranges, etc) prior to reaching the Global CDN.

Alternatively, you can always block these requests at the application level, at the cost of a performance trade-off until a WAF is put in place. If you block IP addresses at the application level, it will break the CDN level caching, requiring Drupal to respond to all requests – which could cause performance during a flood of requests. But any error codes returned by Drupal (400/500) would not be counted in your metrics tab.

2 Likes

I received similar information as to why my site was receiving too much traffic. Its reassuring that someone else is having this same issue but bad at the same time. I looked into using the advanced CDN and that was far too much money for a personal site. I cant remember exactly what they priced it out as but I know it was in the thousands. Sooo that solution was out for me and I’m guessing for you as well. Another option is Cloudflare but you still have to pay money on top of your plan to get that working. My question for pantheon is, if the IP’s are coming from malicious sources then why are they not stopped right then and there. We cant control or monitor these issues until its too late.

1 Like

I would not mind doing the Cloudflare route, if the $20 plan will do what is needed, and as long as there are easy to follow instructions on how do it (these are not that easy): https://pantheon.io/docs/cloudflare

But yes, I am not sure why:

  1. Pantheon includes bot traffic in the pricing model.

  2. Pantheon can’t block the bots.

Probably there is a good reason for #2, not sure there is one for #1. IMHO they should eat the bot traffic and include it in the overall pricing. Raise everything 10% to do that if necessary.

HUGE fan of Pantheon and it’s corporate culture that has been so great for devs like me and customers in general.

Companies get big, then sometimes the “suits” take over and wreck everything. I hope that is not happening here.

2 Likes

It’s also worth considering that on a more traditional platform if your site is targeted for a high volume of traffic you might have some options on scaling the infrastructure up to cope with the bump, or you might have a way of blocking that traffic to reduce the spike, or worst case, the site is inaccessible for a day or three. Other hosting companies don’t just come back and say “sorry, your site’s traffic has dramatically increased without you being aware of it, you now have to pay us 3x - 10x more, there’s no way for you to be aware that such a traffic spike is happening, there’s no way to know when the traffic drops back allowing you could go back to previous rates, and there’s no option around it.” It’s just a really unfortunate way of doing business.

3 Likes

Interesting. Those outdated-Android-using Chinese bots were the same ones driving up visits on one of my sites that was flagged by Pantheon for overages.

1 Like

Has anyone tried using http:bl to block traffic using Project Honey Pot? https://www.drupal.org/project/httpbl

@DamienMcKenna To the point raised by one of the Pantheon support folks that was quoted earlier, wouldn’t this require disabling caching on your site?

Without disabling caching, the GlobalCDN (Fastly) layer wouldn’t be aware of the blacklist, and therefore wouldn’t block that traffic, but pages served from GlobalCDN still count towards Pantheon’s metrics.

Following up to my March 5, 8:21 AM comment:

I have not yet heard back from Pantheon. I will ping them again today

I am thinking of trying Cloudflare’s Pro Plan ($20/month) . I chatted with them this morning, explained the situation, and pointed them to this discussion.

Person I chatted with was quite knowledgable. He/she said:

there is two option

  • you can get the pro plan ($20) and see if the WAF can help eliminate this.
  • in the case this does not work you can add Rate limiting, if you can create a rule that stops the exact behaviour of those specific Bots, you are sorted

and also:

With the WAF you will be able to do this then
But you will have to create your own rule to block those user agents for instance
You can do this on the pro plan
nonetheless, this is for you to do it, if you are looking for the help of a Solution engineer we have the Enterprise plan for that

If the $20 plan can do this, it’s more than worth it. Not 100% clear how difficult it will be to create the blocking rules, but will see how it goes I guess. Seems like any rule I would do on my Drupal site on Pantheon would disable caching so that’s no good.

This doc confuses me a bit: https://pantheon.io/docs/cloudflare …I guess I would do Option 1?

But anyhow, in a December 19 comment above , Josh said this:

CloudFlare isn’t necessarily a good tool to reduce your usage of Pantheon. I’ve seen them actually increase the amount of pages served from Pantheon due to pre-fetching.

So not sure what the answer is, if that ^ is correct.

1 Like

@Rick, I could be wrong, but I don’t think that option 1 on that doc would buy you anything in terms of the WAF protection you’re looking for. That is literally only using the DNS functionality of Cloudflare.

This configuration routes traffic to Pantheon’s Global CDN exclusively. Unless you’re paying for advanced Cloudflare features or if you have custom configurations (e.g. many page rules) you’d like to keep, turn off Cloudflare’s CDN so that only DNS hosting services are used

1 Like

Ah OK, thanks. I will go back and ask Cloudflare about that.

We run Cloudflare in front of one of our Pantheon-hosted sites and use option #2 with no issues, other than the fact that you can’t clear the Cloudflare cache from your Pantheon site dashboard, the way you can with Global CDN. You may want to reduce your cache TTL if content timeliness is an issue for your site.

1 Like

@gravelpot: Yes, that is a consideration.

Ultimately this is something Pantheon needs to fix at the platform level, or the community need to come to terms with the fact that Pantheon is no longer a good hosting solution for sites with small budgets.

4 Likes

I keep thinking about this quote from a Pantheon support ticket:

Here’s what bugs me about this… there is a direct implication here that with the right tools (Advanced CDN or some other WAF), a customer can block this traffic, but that Pantheon won’t block it because its character as “bot” traffic is dynamically determined by looking at the given request volume in a specific timeframe, not just specific user agents or specific IP addresses.

I’m not very familiar with WAF configuration options, but if Advanced CDN or some other WAF would allow a customer to block that traffic (and I’m assuming here that they don’t mean playing whack-a-mole by constantly having to manually adjust blacklist rules as these traffic patterns mutate), then why is it not possible (and desirable) for Pantheon to do the same thing at the Global CDN/platform level?

If these patterns can be identified with such high confidence, what legitimate customer on the platform actually wants this traffic??

One thing that would be very interesting for Pantheon to provide in the name of transparency would be an analysis of what percentage of total daily traffic to the platform is comprised of this exact same kind of traffic that was confidently identified as bots in @rick’s ticket.

4 Likes

There’s likely a plug-in that you could use though. We’re thinking of going this route… Might also modify the Pantheon advanced cache plug-in so that it ties into cf

You’re saying “plugin,” so I assume you’re on WordPress, and I can’t speak to that. There is a Drupal module, but unfortunately the Drupal 7 version doesn’t work with Cloudflare’s current API.

Hacking the Pantheon advanced page cache module/plugin would certainly also be an option, but also seems unnecessary given why this is even an issue in the first place.

@gravelpot agree, and from my previous comment I am not sure why they cannot do option #1 below. If they really need more income to pay for the bots, raise the price on everyone by X%. The question, I guess, is what is X? If it is 10% or less, that is a no brainer to me.

I am not sure why:

  1. Pantheon includes bot traffic in the pricing model.
  2. Pantheon can’t block the bots.

Probably there is a good reason for #2, not sure there is one for #1. IMHO they should eat the bot traffic and include it in the overall pricing. Raise everything 10% to do that if necessary.