Public cloud cost optimization

Siddharth Ram
QuickBooks Engineering
5 min readJan 30, 2020

--

Companies migrating to the public cloud often make assumptions that often do not pan out. A particular assumption is that moving to public cloud will result in reduced costs. Often, they are shocked by the results — instead of lower cost, they see higher costs. There are several reasons for this and this article covers how you should think about migration to public cloud in the larger context- and why thinking about Cost Optimization should be a fundamental quality attribute of software delivery, just like availability, scalability and other ‘ilities’.

The Attraction of Public Cloud

1. Private hosting vs Public Cloud
  1. is an illustration of the key difference in cost between private and public cloud. Private hosting generally has a cost forecast built out at the beginning of the fiscal year, and a commitment from teams based on which purchases are made. But the model is generally inflexible. If the actual usage is under plan (overcapacity), then money has been wasted: if it is over plan, then the plan is not flexible enough to accommodate it and customers suffer from under-provisioning. Worse, if there is a sudden rise in demand followed by a reduction, you have paid for excess hardware which now needs to be written off because it is not being used (falling demand on the chart)

Public Cloud, OTOH, has elasticity as a primary attribute — so there is no over capacity or undercapacity. This is very attractive to organizations. However, this road has pitfalls — some of which are outlined below

Why do optimistic cost projections for public cloud fail?

Public cloud migration comes with some gotchas. Not paying attention to them will result in a phone call from your CFO. Here are some of the common pitfalls:

1. The Double Bubble

The Double Bubble is a reflection of the fact that there is a period of time where you are still migrating to public cloud. In the meantime, you are paying for two data centers and moving data back and forth. During this time, you will be paying two bills, to the captive data center as well as the public cloud vendor. Depending on the scale and complexity of migration, this can be a multi year double bubble

The double bubble (X axis:Time)

Lesson: Be aware of the double bubble and plan for it. Work with engineers to minimize the double bubble period.

2. Grokking Elasticity

Even when the double bubble ends, it takes time for Engineering teams to understand the elastic nature of the cloud, if they are new to public cloud. Resources should be scaled up when demand rises: scaled down during low traffic times. Mature companies actively plot out expected traffic months ahead and automatically scale to meet expected demand (See also 3 below)

During the period where companies are relatively immature, a lack of understanding of scalability in the cloud means overprovisioned hardware to meet peak traffic needs, and hence excessive spend

Lesson: Ensure that teams understand elasticity prior to cloud migration — not after.

3. Lack of infrastructure support

A lack of proper infrastructure will get in the way of cost optimization. By infrastructure, I mean allowing engineers to use elasticity without having to design a solution themselves. At Intuit, a large part of the solve is our Kubernetes infrastructure which allows us to scale up and down without product teams being involved based on traffic conditions. Lacking this, teams often have to make decisions — ‘Do I choose the 2x instance or the 16x instance? I am not sure, so let me choose the larger one’ — wasting capacity and money.

Lesson: Use infrastructure like Kubernetes (or serverless) which allows transparent scale up and scale down. This includes the use of spot instances for workloads that are suitable for spot.

4. Transparency

Private hosting is opaque. Costs are opaque, visible to Finance teams, and leaders responsible for hosting. Engineers know nothing about it. This comes from annual forecasts being developed, and everyone having to work in the framework of commitments at the beginning of a fiscal year. By the same token, costs are comprehensively controlled. No change will ever take place without the finance and engineering leaders participation.

This model is turned on its head in the cloud world. Engineers can see how many servers are provisioned, what Services are being consumed — and sometimes what the bill is. They have an ability to spin up new instances and use new services with no involvement from leaders or finance.

This is empowering for engineers. Unfortunately, it can also have a significant downside: overutilization of public cloud and an unexpectedly large bill.

Intuit’s Model: Freedom and responsibility

To fix the problem above, Intuit’s approach is the following: Cost optimization is considered a quality attribute. Teams get full visibility to their cloud spend. They work with their leader to determine what their cost should be, and monitor it — no different than how we think about performance, resilience or scalability.

There is a tradeoff between cost, availability and performance to be considered. More expensive hardware will result in better performance — but is that the right cost/performance ratio? Deploying in multiple regions can result in higher availability (and regional DR) — but is that worth the cost and complexity? Intuit’s approach is to define the principles for making the decision at a leadership level, but push the responsibility for the decision down to scrum teams. This ensures that every engineer is thinking about costs in designs along with proper availability and other patterns, and choosing the optimal path ensuring that cost is a consideration in designs. Not doing so can lead down wrong paths . Examples of considerations are:

  • Is a service important enough to be in multiple AZ’s?
  • Should this service be multi region?
  • Will the configuration result in proper performance ?

To allow teams proper visibility, Intuit offers a flexible cost dashboard that gives visibility from an enterprise level all the way to a particular service: leaders at all levels can view just their expenses and compare with their plans, and make adjustments as needed. Treating cost optimization as a quality attribute has empowered engineers to think about the tradeoffs between availability, performance and costs and make the right calls.

In addition, an off the shelf cost optimization tool is integrated into the CI process. It runs loads on various hardware configurations and makes recommendations for a target (cost/performance ratio) set by a team. This allows teams to quickly take advantage newer hardware, or changes in software that allow it to run on different hardware optimally.

Public cloud it is about time to market, not cost management

Public cloud allows teams to address changes in demand quickly while managing costs transparently. Leaders need to think about the top line (‘can I address changing demand quickly?’ as the reason to move to public cloud as opposed to ‘Will it save me money’? At Intuit, Public cloud migration is more about the top line than the bottom line and cost optimization enables efficient long term investment in the top line

--

--