Ahrefs’ “billion dollar savings” and the “on-prem bigots”

2024-05-16

TLDR: On-prem is the right choice for Ahrefs, and not only because of cost. They’re a unique business with a unique set of problems that cloud providers are not optimized to address. However I disagree with the premise that the cloud is only applicable to niche situations. From my perspective, Ahrefs are the ones in the unusual niche, insulated in an unusual environment that isn’t reflective of the most common problems other companies experience. The “on-prem bigotry” emerges from the assumption that their experience is, or should be, the norm.

As the cloud has increased it’s share of global IT spend, we’ve seen the predictable response in the rise of what I like to call the “on-prem bigots.” These are the people who have an unnatural need to trash the cloud in general, usually by drawing a strawman based on their own narrow experience, and insisting on applying what works for them to everybody else.

Much of this has been led by David Heinemeier Hansson, often known as DHH. He’s become the archtype of the on-prem bigot. Forest Brazeal wrote a fantastic response to DHH’s rants about how everybody needs to leave the cloud. Forest said it far better than I can, and I’ll refer to them later, so I’ll just leave that link here.

Last year, one of the technical leaders at Ahrefs posted to their tech blog about the savings they had achieved by being on-prem rather than in the cloud. I noted that they completely missed the boat by missing the fact that “I” family EC2 instances provide exactly the kind of attached storage that they wanted, and probably would have been a better comparison than “EC2 instance, plus EBS storage” as they did.

I also didn’t think their numbers were that accurate, especially as they declined to bother including Savings Plans that can save up to 72% on the EC2 instances, or a normal enterprise discount for an account this large that would have saved at least 40% on everything. It also didn’t reflect what the costs would be if the application had been architected from the start to make best use of cloud capabilities (by which I don’t mean “take the same application, but make it serverless”). But I didn’t think that even the most radical re-architecture would materially change the fact that on-prem made sense for them, it would only change the magnitude of the difference. The post suffered from the general problem of trying to imagine what might have happened in an alternate reality, but there’s nothing special about that.

At the time I didn’t think of Ahrefs (or at least this one writer) as “on-prem bigots,” rather they seemed like a decent business with a fairly unique use-case that works best with a unusually homogenous infrastructure (even by on-prem standards) to back it up. Their general advice was reminding readers that on-prem is still a valid choice that should be considered. Mainframes are also still a valid choice for the right use case, so there’s nothing terribly new or radical about that idea. I saw it as an interesting case study, and included Ahrefs as an example in my recent talk about cost management.

But they’ve popped up again. The latest post makes a headline claim of almost $1b of savings, expanding the analysis to include all of the company’s history rather than just a three year period.

The bulk of this latest post isn’t bad either. It addresses some of the most common (and most uninformed) critiques of the first one, and if that’s where it had stopped, I would still not think of it needing a response. But the language demonizing the cloud increased in volume and intensity, to the point of comparing cloud providers to drug dealers. The big number in the headline intentionally and knowingly leaves out both the normal enterprise discounts, and the savings plans, that while not changing the outcome, might instantly make the headline less clickbait-worthy. My on-prem-bigot-o-meter was triggered.

I’ve got a few days of recovering from dental surgery, so it’s time to respond. I planned on including some thoughts on what a “radical re-architecture around the cloud” might look like but in the interest of keeping this to a reasonable length, I’ll publish it separately.

What’s special about Ahrefs?

Lack of context in the original blog post is why so many pointless arguments were quickly made against it (like “you should be serverless”). It’s easy to read through it and not understand what Ahrefs’ business is or what their unique challenges are. Without that context, it’s easy to come to incorrect and uninformed knee-jerk conclusions about what “should” be done. (This is the flip side of bigotry. The “cloud bigots” who presume that the entire world is a consumer-facing web app with widely variable utilization patterns and regular significant re-architecture.) Ahref’s latest blog entry addresses some of these issues, but it is also context-light. The conclusions make a lot more sense in the context of other things on the Ahrefs tech blog, but you have to search for them, as I did.

Ahrefs is an SEO company, providing tools for marketers to optimize their presence online. They run the third largest web crawler after Google and Bing. They index 5m pages per minute on an ongoing basis, and have 170t rows in the key-value store that is their primary storage. Every day they index another 10m new pages and update metrics for 300m pages in their 15.3b page index. [Numbers as of the time of publication.] To manage all this data, they’ve even created their own database software because they couldn’t find anything else that worked in their environment.

They operate an unusually homogeneous backend. All those servers are configured pretty much the same way (though CPU throughput, RAM, and attached storage per server has presumably increased over time). The architecture of the backend software (database, crawler, etc.) does not seem to have ever changed in a way that required rethinking the basic configuration of the infrastructure underlying it.

You are not Google!
(Except Ahrefs kind of is)

One of the bits of advice I always give companies trying to figure out how to operate is that “you are not Google!” Too many companies choose to follow the lead of the largest tech companies, who usually are trying to solve problems nobody else has. Following them blindly can be incredibly costly.

Ahrefs is an exception. Their crawler and the data it stores exist at Google scale. That alone should give one pause when critiquing their architecture, but it should also give them pause when assuming that their solutions apply anywhere else. The most recent update from Ahrefs dismisses the common critiques, so I won’t bother getting into the details here beyond saying that I agree with all but one of them. Serverless doesn’t make sense for consistent 80+% workloads. Autoscaling doesn’t apply. Spot instances are irrelevant. Containerization is probably a waste of effort. None of the common compute-related optimization strategies make much sense for their use case. If they were operating in the cloud they’d fine tune things like instance types and sizes and would save a few percent, but that would not make a material overall difference.

Their arguments about headcount are also straightforward and as far as I can tell, correct. I think they have a blind spot about how easy it would be for a more conventional company with more conventional staffing and a less homogeneous environment to repeat what they’ve done. As Forest noted in his post, running data centers sucks, most companies don’t do it well, and the kinds of people you need to do it well don’t exist in most places. I started my career in the datacenter. The fact that most companies do it badly isn’t new, and despite some online rants, it isn’t that we’ve “forgotten.” Most of us never knew. It’s one of the reasons companies want to be out of the datacenter business. It’s really hard to do well and the people who do it well are hard to find. Even AWS struggles with this, as I learned the hard way when I failed to keep one datacenter team on a tight enough leash during the pandemic.

One billion dollars!!!???

The proposed cost factors that Ahrefs dismiss are the available discounts. Ahrefs’ accounting for their “savings” is particularly disingenuous for it’s failure to include those well-known savings for big users. These are dismissed arrogantly with the statement that “for these discounts to make financial sense, they would need to be at least 90%, reducing the costs to just 10% of the original expenditure. You may try… Hopefully, this article will help you articulate better.“

That statement is true. It’s also bullshit.

If you’re trying to make a case using accurate accounting, do accurate accounting. Saying “we didn’t count the biggest thing because it wouldn’t be enough to change our decision anyway,” then headlining your article with “we got a 90% discount on $1b of hardware,” is at best intellectually dishonest. The only reason I can think of to do this, is that the writer was more concerned with the “ONE BILLION DOLLARS!” clickbait headline than with taking the otherwise solid accounting through to a more defensible conclusion. Anytime you see big round attention-getting numbers like that in a headline, you should immediately think of a certain caricature villain and take the claim with a great grain of salt.

They also completely neglect that nobody who has long-running EC2 commitments pays full price. Even this blog runs on a reserved instance that saves about 62% off the rack rate. An enterprise would probably use savings plans instead, but the discounts on compute capability are similar.

AWS enterprise discounts usually are applied to the savings plan/reserved instance costs, there’s the potential for an 85% discount or more, on the compute, and at least 40% on everything else. Saying “it would still not make sense for us to use the cloud” is valid if that’s what they numbers say. Saying “we can’t be bothered to come up with real numbers, but hey, look at us, we saved a billion dollars” is straight-up lying.

I think Ahrefs made the right call, and that wouldn’t change if the number was “only” $400m savings (which is roughly where they would end up with an enterprise discount alone). Why bother going through such extensive analysis then throw away reality at the end? It may be innocuous, but when $400m of savings are left off because “it wouldn’t be enough to matter,” it’s hard not to think there aren’t ulterior motives, which may be as simple as “get a cool headline, more discussion on Hacker News, and impress my boss.”

I thought well of Ahrefs a year ago. But this makes me think less of them today. Maybe it’s just me. I hate headline-chasers and number-inflators.

Is there a general conclusion?

Reading through Ahrefs’ latest conclusions, I see the same arguments made by DHH and other on-prem bigots. They are substituting their own experience for the general case to a degree that I wonder if they even understand what the normal case is.

Forest Brazeal’s argument is instructive here, in that he addresses “real world IT” as I experienced it prior to AWS and since. But there’s something I’ve changed in how I use his analysis, and it’s worth diving into. (When I showed Forest the change at SCaLE two months ago, he generally agreed.) In his 2×2 matrix, I’ve replaced “growth” with “I.T. change.” The two are often correlated, particularly by those who live in the tech world bubble, but they are not the same thing.

*Original by Forest Brazeal, changes mine*

It is possible for a company’s business to change even as their IT environment remains relatively stable. In my experience, the more digital your business is, the more likely it is that I.T. change correlates directly to business change and business growth. If you’re selling physical products, you can see huge change in the products even as many systems remain the same. [You aren’t likely to change your accounting system as you pivot from lawnmowers to childrens’ toys, for example.] In the tech world, you will find companies like 37 Signals or Ahrefs where the infrastructure hasn’t needed to change much, but they are the exception. If your environment doesn’t need to change much over time, the cloud kind of tilts against you.

Second, those of us in the tech-centric world often minimize (and often demean) those who for a variety of reasons including geography, budget and many other reasons, just aren’t going to be in the “high IT competence” category. By definition, half of businesses must be below average, yet many of us pretend that the entire world can have the same characteristics as the top 10% where most tech commentators exist. Ahrefs is likely in that top 10%, and shouldn’t be assuming that what their infrastructure team has been able to do is the norm. My previous business mostly catered to companies that couldn’t even afford (and didn’t need) a full-time IT infrastructure team. Every change, every incident, every question was a painful experience with the labor billed by the minute. For companies like that, not having to do it themselves is a godsend even if the infrastructure costs are higher. If you’re not a tech company operating at scale, infra cost may not be something you should be chasing.

The cloud providers’ business isn’t all that different from IBM’s 40 years ago. Their scale means they can acquire capacity long-term at favorable costs, and lease it short term at higher cost. It’s a lot easier to do this today because you no longer need to move massive equipment racks from place to place, but the idea is the same. “Buy long term, rent out short” is a time-honored business strategy in every capital-intensive business, and there’s no reason tech should be different. There’s nothing inherently wrong with short-term commitments if that’s what the nature of the business requires. The mistake both sides of this argument make is assuming their circumstances apply to everybody and telling the others that they’re the equivalent of drug addicts. Economics will always favor the longest term commitment that is consistent with business reality, but business realities and planning timeframes vary from weeks to decades.

Follow up: where the money is

In a follow up post, I’ll consider what a parellel-universe Ahrefs in a different reality timeline might have done in the cloud. I’ll look beyond the simplistic “we’ll run our existing application architecture on somebody else’s hardware” and see how much the needle might theoretically be moved.

Tags:ahrefs, AWS, cloud, cloud architecture, efficiency, finops, tech cost

What’s special about Ahrefs?

You are not Google!(Except Ahrefs kind of is)

One billion dollars!!!???

Is there a general conclusion?

Follow up: where the money is

You are not Google!
(Except Ahrefs kind of is)