When applications slow or break, it can wreak havoc on your business. Imagine if your on-premises server went down, your Internet stopped suddenly, or your server room caught fire. Yikes! Developers need to be one step ahead of potential problems so they can quickly resolve any issues that arise or prevent them from happening in the first place.
One strategy that’s been a trending discussion is a cloud-native approach to application management. Here’s an example of how being cloud-native recently saved the day for one of our clients.
The Problem
Our client’s website suddenly slowed down to the point of being unusable when users viewed a critical customizable report. I checked a few metrics related to the database server and found a sudden spike in CPU usage. Over time, the spike didn’t return to its normal level, even overnight when few users would be using the site.
Strangely, no code or infrastructure updates had been made for several weeks, and the application had been running for years without anything similar happening. With minimal insight into the cause, I began to investigate.
The Attempts to Fix The Server
One of the first things I’ve learned to do when a SQL Server is acting up is to use SQL Server Profiler. I ran sample report queries directly in the database and discovered that there weren’t any issues with the code or network connectivity - the query execution time was the culprit. In response to the Profiler, I added indices to several columns and re-ran the reports, only to see a minimal speed increase. I also refactored some of the larger queries, but the overall slowness issue remained. We needed to find a resolution quickly.
The Fix
After a meeting between the client, their IT department, and myself, we decided to take advantage of the RDS instance being in AWS. Our solution was to back up the database manually, take the site offline, kill the old server, and restore the database to a new instance. This approach was a nuclear option, to be sure, but after several days of a site that was slowing down to the point of being unusable, we needed quick and dramatic action.
The backup was easy - we’d been taking backups every night automatically anyway. We then restored from the snapshot, which is a simple process in AWS. Overall, the most challenging part was coordinating a time that evening to do the work.
In the end, the new instance ran as well as before the mysterious slowness showed up, and reports that took five minutes to run now only took a few seconds once again. We did not experience a reoccurrence of the issue.
The Aftermath
After the fix was in place, the development team and client met for a post-mortem to look into options to prevent this from happening in the future. AWS has several metrics that track important data like CPU usage and Input/Output per Second (IOPS), as well as alerts that notify you by email or text if the metrics go outside of your defined boundaries.
Implementing these alerts will immediately notify us if another CPU spike occurs so we can provide a quicker turnaround. Other more substantial efforts we are working on include updating our version of SQL Server and refactoring the reporting system to be more independent of the rest of the application. All these solutions are easier to implement by using a cloud-native database.
As you can tell, one of the main benefits of cloud-native databases are the services and utilities they provide that make backups and restorations simple. If you’d like to learn more about how being cloud-native can benefit your organization, to schedule a time to talk with one of our specialists.