Thoughts on Resilient Cloud Architecture

In this blog post I’d like to cover a few thoughts on how to build and architect resilient services in cloud. Many things are good practice for normal architecture too but some of them are cloud-specific. I should point out that this is solely my opinion.

Maximum Utilisation Of Machines

One would say that from the cost optimization perspective we should target our machines to be maximally utilised, so CPU is burning at 100%, memory is almost full as well as our disk space. But I don’t think it’s a reasonable target as we have no space for potential hiccups or sudden grow in number of users. Relying on auto-scaling can be also counter-productive as new machine is usually ready in more than a minute. Will you degrade performance for all users during that time?

My rule of thumb is to target around 70-75% of utilisation of any resource. I mean CPU, memory, disk space, and network. When reaching those limits you should provision more resources or machines to cover your computational requirements. The reasonable utilisation target can also help you to take into account that many applications cannot actually fully utilise CPU mainly because of synchronisation.

Avoid Single Point Of Failure

It’s not good to have a perfectly designed, auto-scalable service that relies on other service that runs on one node without failover because it serves really tiny amount of traffic. The same applies when the required service is in a container that is regularly rescheduled and during this time is not available. So every important service must have at least one failover machine with instant switch in case of failure of the primary machine. The same applies to containers.

The best scenario seems to be having three active instances of the service on different machines. The first is used for serving active traffic, the second is failover and the third can be used for rolling software upgrades and could be down for maintenance. That allows your service to have always a failover node. You should also bear in mind that deployment of a corrupted service can trigger unrecoverable changes so two nodes are simply not enough.

Everything Can Break

You should expect anything and be prepared that your machine or container can be lost at any time. That means you will not have any indication of such failure. That’s also a reason why I would strongly recommend to run the three node deployment – active, failover, deployment/maintenance. Don’t forget to cycle between active, failover and deployment nodes to avoid rusty instances.

Monitoring

Nothing is worse than alert fatigue. You might experienced your phone giving you zillion notifications before you disabled unnecessary one. The same can happen with down time alerting.

You need to target alerts in a way that normal failover or deployment isn’t triggering critical alerts. If expected downtime of the service triggers an alert you should either reconsider your deployment or increase alerting beyond acceptable service degradation window. Ideally, you should do deployments that don’t produce downtime to your customers.

Conclusion

Building of critical services that are 99.999% of time available isn’t trivial. Although, when you focus on your architecture and simulate failures regularly you can achieve a very good results that you can be proud of.