Spiky is Better! Efficient Scaling of Cloud Resources

Daniel Cross
Apr 12, 2018
3 min read

Updated: Nov 28, 2022

We’re in the middle of our dive into techniques that are critical to getting you the compute power you need, at a fraction of the cost. The areas of focus are:

Efficient on/off scaling of cloud resources
Storage optimization
Use of preemptible/low-priority/spot instances

Today’s topic, efficient scaling of cloud resources.

This is likely the most critical component to financially effective use of cloud. This is, after all, the promise of cloud itself, isn’t it? Pay for what you use, and only what you use? Well, that’s easier said than done. As more and more studios transition to the cloud, we’ve seen common practices waste upwards of 40% of their render budget, just by how and when they spin up cloud resources.

So, why is this? Well, cloud is obviously a metered service, so just like metered parking, you don’t want to start putting money in the slot before your car is even parked, right?!

Seems obvious, but that’s exactly what some cloud rendering methods do. They calculate the requirements, and begin to “pre-allocate” compute resources, turning on a certain number of standard instances, perhaps another set of preemptible instances, etc., until their virtual farm is “ready”. This is an easy mistake to make when transitioning from local resources, where one tends to think in terms of local resource availability; allocating machines, pushing data to those machines, starting up the render submission, and perhaps troubleshooting dependencies; all while the cost clock is ticking.

For cloud computing, there’s a simple principle, the spikier the better! By spiky we’re referring to the time running compute on the cloud. You want to go big, really big, for as little time as humanly possible. Taking you back to your days of calculus, we want to minimize the area under the curve, because in this case, the area under the curve is cost! Take a look at the comparison example below. Here, we’re showing a gradual ramp of resources and rendering against a very quick, “spiky” render.

You can visually see that the slow-ramp of the light red render equates to almost twice the area of the spiky blue render, and thus twice the cost.

So, how do you go about achieving this spiky efficiency? There are a few things that will put you well on your way to maximum performance for the minimum cost:

Autoscaling This feature allows you to quickly spin up large amounts of resources for “burst” requirements such as rendering. Manually spinning up instances is slow and time-consuming, and also costly. The more parallel tasks (see below), the spikier and quicker your work is done. Each cloud provider has some version of this. Services like Conductor also automate this for you.
File caching… just don’t do it Caching is a method whereby you create quick access to local files for cloud operations. In theory, this is a great strategy, because it’s minimally invasive compared to running locally. However, the practical application of this causes the cloud compute resources to sit idle while files are accessed, read, re-read, etc. Caching, itself, can lead to over 20% wasted spend in the cloud. Pull your files cloud-side, and then run your jobs. Simple as that.
Parallelization The beauty of render jobs, in general, is they are excellent candidates for parallel-processing. Aside from some cases of frame-to-frame dependency, frames can be run separately and simultaneously, offering massive time-to-completion benefits. In the cloud, there is zero penalty for complete parallelization. You don’t pay more to spin up 1000 instances versus 500 if you can finish in half the time. It’s simple math of total time under utilization. Grab as many machines as you need for as little time as you need them. Again, spikier the better!

So, hopefully this gives some additional insight into just how you can realize the promise of financial efficiency in the cloud. Massive scale without the massive cost is at your fingertips, if you follow these simple steps.