Amazon EC2 offers a flexible and convenient way to run virtual machines in the cloud. With dozens of EC2 instance types available, as well as multiple pricing options, it’s easy to use EC2 to configure the best cloud-based virtual machines for your needs and budget.
One thing that EC2 doesn’t make very easy on its own, however, is monitoring. Although Amazon allows you to collect EC2 metrics through CloudWatch, the native monitoring tool on AWS, CloudWatch offers little guidance about which EC2 metrics to track or how to interpret them. And with dozens of metrics available, it can be hard to know which ones are the most important to follow or how to correlate them with one another.
That’s why it’s important to understand key EC2 metrics in order to monitor EC2 effectively. Keep reading for an overview of the most important EC2 monitoring data to track for typical use cases.
Core Metrics for EC2 Monitoring
There are seven core metrics that you should always track for EC2 monitoring. These metrics, which fall into three categories (CPU usage, network activity, and disk operations) are also essential for providing basic visibility into your EC2 instances’ health and performance.
CPUUtilization
This is the amount of CPU load currently being used, which is expressed as a percent of total CPU capacity for your instance. When CPUUtilization approaches 100, you may need to consider increasing the CPU allocation of your instances.
Conversely, if CPUUtilization is always very low – such as under 50 percent, with no spikes in usage above that level – your instances are likely over-provisioned. You can save money by switching to instances with lower CPU allocations.
NetworkIn and NetworkOut
These metrics tell you how many bytes of network traffic are entering and exiting your network during the monitoring period that you have configured in CloudWatch. (The periods are either 5-minute or 1-minute intervals, so you’ll need to do the math accordingly if you want to convert NetworkIn and NetworkOut data to bytes per second.)
Network traffic can vary widely depending upon multiple factors, such as which types of applications you are running on EC2 and how much exposure they have to the network. There is no universal threshold separating healthy levels of network traffic from problematic ones. Thus, there is no specific number that you should be looking for in NetworkIn and NetworkOut metrics in order to determine that your EC2 instances are healthy.
Still, measuring network metrics helps you track fluctuations in demand for applications that are hosted in EC2. By correlating this data with other events, you can achieve greater visibility into your EC2 instances and the applications that they host.
For example, if you notice that network utilization peaks at the same time as CPU utilization, that is normal behavior, because it makes sense for CPU usage to increase at times when more network requests are being handled by your instances. However, a spike in CPU usage that does not correlate with changes in network traffic merits further investigation. It could mean there is a misconfiguration with one of your instances (or an application hosted on it) that is causing CPU usage to fluctuate for reasons unrelated to application demand.
DiskReadOps, DiskWriteOps, DiskReadBytes, and DiskWriteBytes
These four metrics report information about disk activity. DiskReadOps and DiskWriteOps tell you the total number of read and write operations, respectively, that occurred during your EC2 monitoring period. (Here again, the monitoring period depends on how CloudWatch is configured, so you’ll need to do some conversions to get a read- and write-per-second figure.) DiskReadByes and DiskWriteBytes record the number of bytes that are read and written.
In all cases, these metrics are based on read and write operations to store volumes that are connected to your EC2 instances. If you have no volumes configured, these metrics will report a 0.
Like network traffic metrics, metrics associated with disk operations can vary widely depending on your use cases, and there is no specific number that is considered good when monitoring these EC2 metrics. However, EC2 disk metrics are another set of data points that you can correlate with other activities (such as CPU utilization) in order to gain greater insight into monitoring events and trends.
Other Metrics for EC2 Monitoring
Beyond the seven essential EC2 metrics described above, there are some additional metrics that you may want to track, depending on your goals:
- CPU credit metrics: Amazon uses a concept that it calls CPU credits to allow EC2 instances to “burst,” meaning that they consume more CPU resources than those that are allocated to them by default. If you are running instances that take advantage of CPU bursting, you can monitor your credit usage, consumption, and availability with the following metrics: CPUCreditUsage, CPUCreditBalance, CPUSurplusCreditBalance, and CPUSurplusCreditsCharged.
- Status checks: EC2 can report the StatusCheckFailed, StatusCheckFailed_Instance, and StatusCheckFailed_System metrics to help you monitor whether your EC2 instances themselves are up and running. These metrics don’t offer insight into the health of EC2 instances; they are essentially pings that check whether instances are responding or not. Thus, status checks are not very helpful for tracking EC2 performance, but they can be useful in situations where you are concerned that your instances may not start properly or will crash unexpectedly.
- Account usage metrics: If cost is a concern, you can use the ResourceCount metrics to monitor how many EC2 resources are running in your Amazon AWS account. This type of monitoring helps you prevent cost overages and identify situations where you are consuming more EC2 resources than you intended.
Conclusion
Monitoring EC2 effectively requires tracking a set of seven core metrics that offer visibility into CPU utilization, network activity, and disk operation activity. For certain use cases, there are additional metrics that you may wish to track.
While all of these metrics can be monitored through AWS CloudWatch, CloudWatch only offers basic data visualizations, retention options, and other features. For full-scale EC2 monitoring, consider using an external tool like Sumo Logic, which can collect EC2 metrics as well as metrics for a range of other Amazon cloud services.
Sumo Logic offers a much more sophisticated set of analytics and visualization features than CloudWatch provides natively. With Sumo Logic, you can be confident that you have full ability to monitor all aspects of EC2 performance and stability.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.