What is AWS S3 cost optimization?
Amazon Simple Storage Service (Amazon S3) is one of the most popular Amazon Web Services (AWS) offerings with flexible pricing. AWS S3 cost optimization is the process by which an engineering or DevOps team leverages features and strategies to reduce their overall costs for storage.
Key takeaways
- Example costs for S3 depend on storage costs, API costs, and data transfer by region.
- AWS S3 users can save on costs by choosing the same areas for S3 and EC2 to minimize data transfer costs.
- To get a more granular per-bucket view of your S3 prices, enable cost explorer or enable reporting to the S3 bucket.
AWS S3 pricing: How S3 cost is calculated
Optimizing your AWS S3 costs relies on understanding S3 pricing.
There are three major costs associated with Amazon S3 are:
Storage cost: charged per GB / month. ~ $0.03 / GB / month, charged hourly
API cost for operation of files: ~$0.005 / 10000 read requests write requests are ten times more expensive
Data transfer outside AWS region: ~$0.02 / GB to different AWS regions, ~$0.06 / GB to the internet.
The actual prices will differ based on volume and region, but optimization techniques will remain the same.
S3 costs basics
One of the most important aspects of Amazon S3 pricing structure is that you only pay for the storage used and not provisioned.
AWS pricing example 1:
For a 1 GB file stored on S3 with 1 TB of storage provisioned, you pay for 1 GB only. You pay for provisioned capacity in many other services, such as Amazon EC2, Amazon Elastic Block Storage (Amazon EBS) and Amazon DynamoDB.
AWS pricing example 2:
In the case of Amazon EBS disk, you pay for the size of 1 TB of disk even if you save 1 GB file. This makes managing S3 costs easier than many other services, including Amazon EBS and Amazon EC2. On S3, there is no risk of over-provisioning and no need to manage disk utilization.
Given this, most S3 users don’t need to worry about cost optimization immediately. The best bet is to start simple and worry about the monthly S3 bill after it has crossed a certain threshold.
Choosing the right AWS region for your S3 bucket
Ensure EC2 and S3 are in the same AWS region. The main benefit of having S3 and EC2 in the same region is the performance and lower transfer cost. Data transfer is free between EC2 and S3 in the same region. Downloading files from another AWS region will cost $0.02/GB. Companies that process data within the same region can mostly eliminate the S3 to EC2 inter-region data transfer cost. If the S3 bucket was in a different region, assuming each file is downloaded on average three times per month (3 * 0.02 ~= $0.06 / GB), our S3 costs would triple.
Pick the right AWS naming schema (AWS guide)
Though this doesn’t directly impact the S3 cost, naming can make S3 so much slower that you need to use an additional caching layer.
Monitor AWS S3 credential usage
Engineers and developers typically use IAM access or secret keys inside applications. While this may be required for users to perform operations directly on S3 and may simplify architecture, it also means that any user can potentially create additional costs. This may be malicious or just a simple accident. At a minimum, follow the best practices listed below for S3 credentials:
Use temporary credentials you can revoke later. Give them need-to-know access (minimum rights) to complete the task.
Monitor access keys and credential usage regularly to avoid any surprises.
A good example is on any S3 bucket where the third party can upload objects. You should set up a CloudWatch alert on “BucketSizeBytes.“ This would prevent malicious users from uploading terabytes of data to your S3 bucket.
Postpone using Amazon Glacier
Don’t start with an infrequent access storage class from Amazon Glacier unless you don’t plan to read these objects. This can become costly and may complicate your overall infrastructure.
How to analyze S3 pricing
The best way to start cost optimization efforts is to review the AWS bill and invoices:
On the AWS console review, aggregated AWS S3 spend (link to AWS Console).
- To get a more granular per-bucket view of your S3 prices, enable cost explorer or enable reporting to the S3 bucket.
Cost explorer is the easiest to get started.
Downloading data from “S3 reports” to a spreadsheet gives you more flexibility.
Once you reach a certain scale (e.g., Sumo Logic bill is over 1 GB / month), using dedicated cost monitoring SaaS such as CloudHealth is the best bet.
Remember that the AWS bill is updated every 24 hours for storage charges, even if you pay for S3 storage by the hour.
- Getting per-object data can be handy but beware of the cost if you require it regularly.
You can enable S3 Access Log, which provides entry for each API access. Remember that this access log can grow quickly and cost a lot to store.
You can list all objects using API. Either write your script or use some third-party GUI, such as an S3 browser.
E.g., 85%+ of AWS S3 costs for Sumo Logic are related to storage. The second group is API calls which are around +10% of the S3 cost. However, there are some S3 buckets where API calls are responsible for 50% of the costs. We used to pay for data transfers, but this cost was negligible.
Common S3 cost optimizations
It usually makes sense to focus on areas where you spend the most – storage, API, or data transfer. Some cost optimizations improve your overall efficiency, while others automate waste reductions.
Here are common ideas to consider for reducing your S3 storage costs.
Delete files after a certain date that are no longer relevant.
Many deployments use S3 for log collection but later send them to Sumo Logic. You may automate deletion using S3 life cycles. Delete objects seven days after their creation time. E.g., if you use S3 for backups, it makes sense to delete them after a year.
Delete unused files that you can recreate later.
The same image is in many resolutions for thumbnails/galleries that are accessed rarely. It may make sense to keep the original image and recreate other resolutions on the fly.
When using the S3 versioned bucket, use the “lifecycle” feature to delete old versions.
By default, delete or overwrite in the S3 versioned bucket and keep all data forever and you will pay for it forever. In most use cases, you want to keep the older version only for a certain time. You can set up a lifecycle rule for that.
Clean up incomplete multipart uploads.
Especially if you upload a lot of large S3 objects, any upload interrupt may result in partial objects that are not visible, but you pay to store them. It almost always makes sense to clean up incomplete uploads after seven days. If you have a petabyte S3 bucket, then even 1% of incomplete uploads may waste terabytes of space.
Lower AWS data transfer costs by compressing data
Use fast compressions, such as LZ4, which performs better and reduces your storage requirement and cost. In many use cases, it makes sense to use compute-intensive compressions such as GZIP or ZSTD. You usually trade CPU time for better network IO and less spending on S3. For example, GZIP compresses most Sumo Logic objects. Most likely, we will migrate to ZSTD. This gives us better performance, and we use less space.
Focus on data format for Big Data applications
Using better data structures can enormously impact your application performance and storage size. The biggest changes:
Use binary format (e.g., AVRO) vs. human-readable format (e.g., JSON). Especially if you store a lot of numbers, then binary formats such as AVRO can store bigger numbers with less storage than JSON. For instance, “1073741007” takes 10 bytes in JSON versus the number represented in AVRO as 4-bytes integers.
Using row-based vs. column-based storage. The general rule of thumb is to use columnar-based storage for analytics batch processing which can provide better compression and storage optimization. However, this topic deserves its article.
What should you index, store metadata, or what should you calculate on the fly? Bloom filter may reduce the need to access some files at all. Some indexes may waste storage with little performance gain. Especially if you have to download the whole file from S3 anyway.
Use infrequent access data storage class in Amazon S3
Infrequent access (IA) storage class provides you with the same API and performance as the regular S3 storage. IA is approximately four times cheaper than S3 standard storage ($0.007 GB/month vs. $0.03 GB/month), but the catch is you pay for the retrieval ($0.01 GB). Retrieval is free on the standard S3 storage class.
If you download objects less than two times a month, you save money using IA. Let’s consider the following three scenarios where IA can considerably reduce the cost.
Scenario 1: using IA for disaster recovery
These backup files are for disaster recovery. It makes sense to directly upload any object over 128KB to IA and save 60% on storage for a year without losing the availability or durability of the data.
Scenario 2: use IA for infrequently accessed data
If some class of S3 objects is downloaded on average 20% times a month, it makes sense to keep them in IA. For every 1GB, we save $0.021 GB / month S3 Standard cost GB/month – IA Standard Cost GB/Month – IA Access cost=0.03 – 0.007 – 20% * 0.01). Multiply that by a petabyte, and that’s just the monthly savings.
IA is great, but when is it not?
IA has restrictions such as minimum data size cost and minimum storage retention period. IA charges for at least 128KB of data and minimum 30-day storage. In addition, data migration to and from “S3 standard” costs one API call.
However, IA is significantly easier to use than Glacier. Recovery from Glacier can take a very long time, and any increase in speed will increase your cost. If you store 1TB of data on AWS Glacier, you can extract that data for free at the rate of 1.7 MB / day. To recover 1TB in an hour will require a 998 GB / h peak recovery rate. This will cost 0.01 * 998 * 24 * 30 = $7186! If you decide to recover 1TB in 2 hours, you will pay $3592.
How to save on S3 API costs
Here are some tips on reducing costs for your API access.
API calls cost the same irrespective of the data size
API calls are charged per object, regardless of its size. Uploading 1 byte costs the same as uploading 1GB. So usually, small objects can cause API costs to soar. PUT calls cost $0.005 /1000 calls.
For instance, API cost is negligible if you upload 10GB in a single file. A file divided into 5MB chunks costs ~ $0.01. However, 10KB file chunks will cost you ~ $5.00. You can see the exponential growth in cost as you upload smaller files.
Batch objects whenever it makes sense to do so
Usually, a lot of tiny objects can get very expensive very quickly. It makes sense to batch objects. If you always upload and download all objects simultaneously, it is a no-brainer to store them as a single file (using tar). You should design a system to avoid a huge number of small files. It is usually a good pattern to have some clustering that prevents small files. For example, instead of creating a new file, you can group the data in the same file until 15 seconds have elapsed or the file size is 10MB. Create a new file every 15 seconds and or every 10MB, whichever you hit first.
If you have tiny files, it makes sense to use some database like DynamoDB or MySQL instead of S3. You can also use a database to group objects and later upload it to S3. 10 writes per second in DynamoDB cost $0.0065 / hour or $(0.0065/3600) /sec. Assuming 80% utilization, DynamoDB provides $0.000226 / 1000 calls ([{0.0065/3600}*1000] / (10 * 0.8) ) vs. S3 PUT at $0.005 / 1000 calls. That is 95% cheaper to use DynamoDB over S3 in this use case. The S3 file names are not a database. There needs to be a better design than relying on S3 LIST calls, and using a proper database can typically be 10-20 times cheaper.
How to save on AWS data transfer costs
If you do a lot of cross-region S3 transfers, it may be cheaper to replicate your S3 bucket to a different region than download each between regions each time.
1GB data in us-west-2 is anticipated to be transferred 20 times to EC2 in us-east-1. If you initiate an inter-region transfer, you will pay $0.20 for data transfer (20 * 0.02).
However, if you first download it to mirror the S3 bucket in us-east-1, you pay $0.02 for transfer and $0.03 for storage over a month. It is 75% cheaper. This S3 feature is called cross-region replication. You will also get better performance along with cost benefits.
If any downloads from the servers are stored in S3 (e.g., images on consumer sites), consider using the AWS content delivery network (CDN) called AWS CloudFront. AWS CloudFront can be, in some cases, cheaper (or more expensive) than using S3. However, you gain a lot of performance.
CDN providers, such as Cloudflare, charge a flat fee. If you have a lot of static assets, then CDN can give huge savings over S3, as just a tiny percent of original requests will hit your S3 bucket. You may use S3 to save on data transfer between EC2 in different availability zones (AZ). The data transfer between two EC2 in a different AZ costs $0.02/GB. The data transfer between two EC2 in different AZ costs $0.02/GB, but S3 is free to download from any AZ.
Consider the scenario where 1 GB of data is transferred 20 times from one EC2 server to another in a different availability zone. It will cost $0.20/GB (20 * 0.01). However, if you can upload it to S3, then you pay for storage ($0.03 / GB / month), and the best part is that data transfer between S3 and EC2 is free. S3 charges per hour per GB. Assuming data is deleted from S3 after a day, the S3 cost will be $0.001. 99% cost savings on that data transfer by using S3.
Sumo Logic: AWS monitoring plus multi-cloud support
Sumo Logic's multi-cloud SaaS analytics platform has integrations for major cloud service providers like Google Cloud, Microsoft Azure and AWS, which continue to take hold as cloud adoption grows.
With Sumo Logic's AWS monitoring capability, users benefit from deep integration with the AWS platform and security services. Sumo Logic's log aggregation capabilities, machine learning and pattern detection, make it easy for enterprise organizations to gain visibility into AWS deployments, manage application performance, maintain cloud security and comply with internal and external standards. If you are already working with AWS or considering leveraging it to amplify your business results, consider Sumo Logic for deep insights into your application performance and security.
Learn more about Sumo Logic's AWS Monitoring Solution.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.