Load testing at a glance
For awhile now I’ve been wanting to talk about load testing and why we do it. Recently I was approached with a problem where some users were experiencing a very slow experience and I needed to rectify the issue. The way I went about solving this was to check what kind of response times I could expect for the particular path and actions those users were taking. By investigating the problem in this manner I was able to pinpoint the particular part of our system that was behaving slowing and begin to improve that.
It is problems like these where we can use load testing techniques to diagnose issues and put plans in place to fix them.
What is load testing?
The term “load testing” can be used in a few different ways. The first example is using it to describe a general class of testing methods that seek to uncover how a particular system behaves and reacts to changes in usage or traffic volume. The second example of the term refers to a specific type of test, one in which we target a system with normal amounts of traffic to see if it can handle the expected load.
When we are performing a load test we are simulating traffic to a system using some sort of tool and then monitoring our key metrics to decide whether or not the system maintains an acceptable state or the metrics become unacceptable. For example, if we have grocery list app, we might decide that if the response time to add an item to the list exceeds 200 milliseconds it produces an unacceptable experience for the user. In this example my key metric is the response time for adding an item and my threshold of acceptability is 200 milliseconds. It is important when designing and executing a load test that you take time to understand what your key metrics are and what you would deem unacceptable values for those metrics.
Stress Testing
A stress test is a type of load test that seeks to find the point where a system starts to degrade. This might be the volume of traffic at which a system starts to gracefully degrade features or it could be the point at which a system becomes completely unresponsive. A way in which we can execute stress tests is to continue to increase the volume of simulated traffic to the system until the key metrics you are monitoring become unacceptable. Note that all systems will probably have a point where they will degrade or fail, this is important to realize because in our grocery list example, it might be ok for the app to fail if we have 10 thousand users because we don’t expect to have that many concurrent users. When designing a stress test you may want to have multiple points of acceptability, in our grocery list example this may look like:
- 100 users: Adding items must happen in under 200ms
- 1000 users: Adding items can happen in under 500ms
- 5000 users: Adding items can happen in under a second
- 10000 users: Adding items can take any amount of time
Of course we want our system to perform well but we should be looking at reasonable amounts of load, in this case we might be expecting maybe a couple hundred users. If you focus on the amounts of traffic that you aren’t expecting any time soon you will find yourself over-optimizing.
Soak Testing
Soak testing is another form of load testing where we are not increasing the volume of traffic over the duration of the test but rather maintaining a relatively high load for a longer duration. In our grocery list example we might be expecting 100 users a day but we want to make sure that if it gains popularity we can support up to 500 users a day. To handle this we could run a soak test whereby we simulate 500 users for multiple days to see how our system behaves. This form of testing can help find bottlenecks in your system that require more attention for your next stage of growth, perhaps your simple queues can’t handle that amount of load but can handle your current user base fine, these are the types of things you want to know ahead of time so you can fix them before your user base grows.
Spike Testing
Some types of applications have very spiky traffic patterns and others don’t. You might find that your application has 100 users using it at all times but every Friday at 5pm traffic skyrockets to 1000 users. Once you know this you could account for it by using bigger servers or more powerful infrastructure but this also comes at a cost, wouldn’t it be great if we could have some lower amount of capacity most of the week and then still handle the load volumes on Friday evening? This scenario is what spike testing tries to simulate, you run a load test with some baseline amount of traffic and then increase the traffic exponentially to see how your system behaves against your key metrics. Spike testing isn’t just useful for reoccurring spiky traffic but also for preparing for unexpected events. For example, we might find that our grocery list app has baseline traffic of 100 users all day everyday but then we get featured in Groceries Weekly, the premier grocery publication, and our traffic soars in a couple hours. Depending on your application it might be important to make sure you can handle this rapid increase in traffic, imagine trying to get your startup off the ground and having everyone who read your first PR article see a 404 page.
Why do we load test?
We’ve covered some basic types of load tests, what key metrics and acceptable values are, but why would we want to load test in the first place. To summarize the examples above, we may want to know how our systems will behave under unforeseen amounts of load or rapid changes in traffic, we also might want to see whether or not we are over-provisioning our infrastructure based on the regular usage of a system or if our infrastructure can respond to changes in load. All of these questions don’t need to be on the short term either, we might want to perform a load test now to see if we can handle the number of users we predict in 6 months time to make sure we are ready for that slow but steady increase in traffic or usage. Some common times to watch out for when load testing are any seasonal or reoccurring changes of load in your applications domain (Valentines day for flowers shops for example), any upcoming press releases or other traffic generating events, new feature releases to make sure the new feature can handle the baseline traffic, and any increase in complaints of poor system performance from your users.
How can I load test?
There are various tools that you can use to load test systems that range from paid services to free open source tools. Disclaimer: I haven’t used any of the paid services and won’t recommend any here but you can find them with a quick web search. On the free tools I have used Vegeta, Locust, Bees With Machine Guns, and Apache JMeter. Each of these tools have their own pros and cons and it is important to find the one that will work best for your use-case. The other option if you are working in a specialized environment or doing some particularly tedious user-flows is to write your own tool however this can be an expensive task if you’re not going to be load testing often.
When performing a load test it is very important to do two things:
- Do not run the test off your laptop: This is important because your are unlikely to be able to simulate enough load from one consumer machine, at the very least spin up a beefier machine on a cloud provider and use that or multiple machines to run the tests.
- Do simulate common traffic patterns: Many load testing tutorials show you that you should test some particular path of a user. They login, then they hit the dashboard page, then they fill out a form but this isn’t how your users actually behave. You need to look at the patterns your users have and design your test appropriately to make sure you are maximizing it’s usefulness. If 70% of your traffic is logging in and hitting the dashboard and 30% is creating new accounts, your load test should be simulating that.
What metrics do I care about?
When we are performing a load test it is important to keep track of your systems metrics. A lot of the time the common metric you will be monitoring is the response time to the user since this is usually tied directly to their experience of your system, if it isn’t monitor what is. When you are evaluating your key metrics there are a variety of statistics to keep an eye on. The first is the minimum and maximum, in the case of response time this can give you a good idea of the spread of different experiences users will see however these aren’t the best metrics to monitor because you don’t know how many users are seeing each extreme. A much better value to evaluate how your system is performing is your 95th and higher percentiles (P95 = 95th percentile). For example, the 95th percentile tells you that 95% of values are below what it reports. This is much better for understanding what the vast majority of your users are experiencing. The additional upside of monitoring these high percentiles is that they tend to remove outliers from your data. For example we might see some super slow requests due to network latency. The higher the percentile you can achieve acceptable values for the better.
When monitoring your metrics you will probably want to graph them over the duration of your test. This will help you understand how rapidly they change based on the amount of traffic you generate. For example if you double the traffic and you see your P95 triple, you know there is something in your system that is not scaling linearly.
When you are gathering metrics you also want to pick the specific granularity you require, if you are only looking at aggregates across your system it will make it harder to action your findings. I have found that, in the case of web services, monitoring each URLs performance tends to show bottlenecks quickly.
How can I action these metrics?
Hopefully, based off the metrics you gather you are able to tell what your slowest operations are and decide to focus performance efforts there. When picking what to improve you will usually want to pick whatever metric has the biggest delta with your acceptability threshold for that particular metric as they have the most to gain and often is the easier metric to make gains on. Once you investigate the area of the system that is performing unacceptably, you should note any expensive operations there including network calls, database operations and known slow functions. If you have tracing available you might be able to see what operations are using up the majority of time and focus efforts on optimizing those areas.
Going further
If you’ve made it this far and you’d like to learn more I recommend trying to load test a system you have available, you might even find some scalability or performance issues you didn’t know you had. This talk by Rob Harrop is great at explaining what load tests are for and if you have a Pluralsight subscription Mick Badran has a good course on load testing using Azure DevOps.
If you have any questions please don’t hesitate to contact me or reply in the comments.