Why you should deploy on Fridays

2024-07-29

My episode of the ShipIt! podcast has just been published, and I thought I would dive a bit deeper into one of the things we discussed, and that has been very much on everybody’s mind since the Crowdstrike unpleasantness.

Ship It! 114: Deploying on a Friday – Listen on Changelog.com

Question: Should you deploy on Fridays?

Short answer: You probably shouldn’t.

Longer answser: You probably shouldn’t, unless there’s a strong business case for doing so.

Full answer: Read on.

In the beginning, we always deployed on Fridays. By “we” I mean “the industry and handful of companies I was familiar with at the beginning of my career.” My impression from talking to many people who were in the tech world at the time is that it was not at all unusual.

At the time I was working on financial transaction processing. Not the cool stuff that traders and others do on the front end to execute trades, but the boring behind-the-scenes stuff that has to happen to ensure that securities and money moved from one place to another at the end of the day. (More realistically, that the accounting entries determining security ownership were updated, since even in my youth, the US was long past the point of actual pieces of paper changing hands.) At the time, the process of clearing a security trade in the US typically took 5 days (aka T+5 settlement), it quickly moved to T+3 and over time has shortened. In March this year, we adoped T+1 as the standard.

I won’t get into the finer details, beyond saying that for the purposes of institutional trades (which are the bulk of the volume on all exchanges) securities and cash are often not held by the broker that does the trade on your behalf, they’re in a bank or depository trust. This allows the institution to work with multiple brokers. This also means that the broker may not always know in real-time what the customer actually owns, leading to the possibility of errors and failures. As a result, there is a lot of back-and-forth after the trade is done to ensure that everything is right. T+5 settlement allowed information exchanges to happen overnight, one step per day. In the really old days, this might involve somebody running a tape with data over to some clearinghouse’s datacenter and picking up a tape with the results the next morning. (I suppose, in the really, really, old days, this would have involved sheets of paper being run around.) Today, information is exchanged much closer to real-time, which has allowed the whole process to be compressed down from five days to one and I can forsee a world where we eventually have same-day settlement.

What you should take away from this, is that however much it’s compressed, it’s still a multi-step process requiring data and systems integrity from start to finish, across multiple independent business entities. If there is an error early in the process, it can have impact on downstream data in your own company’s internal systems, and potentially also in data that is exchanged with partners, customers, and counterparties. “We found an error, so we’re going to quickly revert to the previous version of the code” isn’t sufficient. You need to be able to back out bad information that may have proliferated throughout the financial system. The more we compress the processing and move towards real-time, the more critical it is that we get it right, and the harder it is to back out errors. Even with safeguards and runbooks in place, it can be time consuming and difficult. In a worst case, the lost time can impact your ability to do business.

So we deployed most things on Fridays. Yes, it meant that in the worst case scenario, we were working weekends. Friday overnight processing was always the most likely to fail. But it gave us a full weekend to recover from something bad that might have impact far beyond our firm. It might suck to have the entire team lose their weekend due to somebody’s bad code, but from a business perspective, it sucks a lot less than traders not being able to function the next morning because all their position reporting was completely screwed up. I can think of several times in my early career when those 48 hours made the difference between “open for business on the next business day” and not.

I’ve made much of the term “You are not Google!” on this blog and elsewhere. Yet much of the most common advice about DevOps practices originates with pioneering work at Google and Facebook (Meta), which are unlike anyplace else you’re likely to work. We can certainly learn from them, but we should be cautious about over-relying on their experience.

There is a subtle difference between completely digital businesses like Google or Meta, and much of the rest of the world. Those big online companies are… online! Unlike say, a bank, Google has no reason to exist if it is not online.

The subtle difference is that while banks are utterly dependent on technology, they view it as a tool to enable them to perform the money-making functions that they have always performed and would still be performing (albeit much more slowly) if the computer had never been invented. The needs of the tech people (who are a minority of their total employees), come somewhere behind and may be in conflict with the needs of the traders and others who need to show up in the morning to reliable and accurate systems. Places like Meta, Google, and AWS and others who are tech-first, have no such inherent conflict. You can prioritize “what’s best for the tech organization” and generally be aligned with “what’s best for the business and customers.”

So let’s take the recent Crowdstrike mess as an example. Almost immediately, we heard the cries of “don’t deploy on Fridays” from the ranks of the technorati online. For many of us, it meant a lost weekend and that sucked. But is it the right lesson to take away from this?

Imagine an alternate timeline in which we were Crowdstruck on Tuesday rather than Friday morning. It may still have taken days to recover. And for those businesses that are closed, or at best partially operational, on weekends, it would have changed a one-day outage to a longer one. Having a weekend to recover probably made a big difference to companies whose workers weren’t going to be using the systems on the weekend anyway. On the balance, there was probably less business impact from a failure ahead of a weekend than during the week.

In My Most Important Lesson, I noted that we in the tech space often tend to see our own importance very differently from the people who run the businesses that we work for. If recovery times can be lengthly or compex due to downstream impact that may not be in your control, and if the business cannot afford to have that kind of recovery going on during business hours, then you’re going to deploy ahead of a break in the business, which may mean Friday afternoon. Or a Tuesday evening. Or some other time you’d probably prefer not to be on call.

If that’s a problem for you, find a pure tech business, or at least a tech-first business, to work for. Anywhere else, your concerns and needs are likely to be a lesser consideration than the needs of the people who are generating money for the business.

I said a the top that you “probably” should not deploy on a Friday. Most of the applications we deploy don’t have much or any downstream impact, and the potential impact is easy to contain. In those cases, small changes that are easily backed, combined with a solid CI/CD infrastructure out are the right approach. Many of the apps we build are consumer-facing and expected to be available 24×7, so there’s no “downtime” around which to schedule. Critical periods can be best handled by blocking deployments when impact will be greatest and otherwise allowing them to flow normally. (For example, Amazon doesn’t allow non-critical changes to most code ahead of Prime Day and the Black Friday retail peak days.) For most of us, the guideline makes sense even if you aren’t Google.

The era I described in my ShipIt! episode is one in which almost everything we did was about core transactional functionality. Today, that functionality is a minority of what a bank’s software will do. The vast bulk of what my friends at banks work on today is useful and customer friendly applications that are far from the number-crunching, transaction-processing functionality that remains at the core of what banks do. As tech has grown to offer more and more functionality to our customers, the percentage of things that “you can’t get wrong, even for a few seconds” has declined significantly. For most of us, deploying on a Friday is a bad idea.

But as with all guidelines, there are exceptions. I’m not convinced that an antivirus update wasn’t one of them, though my opinion on that remains very loosely held. But if you’re working on core transactional systems that can have major downstream impact, if that impact extends beyond your organization and into the non-digital world, and if the nature of the business allows you to schedule changes so that a recovery will take place when the business is closed and business impact is minimized, then Friday (or more generally off-day) deployments should remain a valid a weapon in your arsenal.

Tags:ci/cd, crowdstrike, deployment, devops, Failure, friday