I've got an issue whereby uploads to and downloads from AWS S3 via the aws cli are very slow. By very slow I mean it consistently takes around 2.3s for a 211k file which indicates an average download speed of less than 500Kb/s which is extremely slow for such a small file. My webapp is heavily reliant on internal APIs and I've narrowed down that the bulk of the API's round-trip performance is predominantly related to uploading and downloading files from S3.
Some details:
Using the latest version of aws cli (aws-cli/1.14.44 Python/3.6.6, Linux/4.15.0-34-generic botocore/1.8.48) on an AWS hosted EC2 instance
Instance is running the latest version of Ubuntu (18.04)
Instance is in region ap-southeast-2a (Sydney)
Instance is granted role based access to S3 via a least privilege policy (i.e. minimum rights to the buckets that it needs access to)
Type is t2.micro which should have Internet Bandwidth of ~60Mb or so
S3 buckets are in ap-southeast-2
Same result with encrypted (default) and unencrypted files
Same result with files regardless of whether they have a random collection of alpha numeric characters in the object name
The issue persists consistently, even after multiple cp attempts and after a reboot the cp attempt consistently takes 2.3s
This leads me to wonder whether S3 or the EC2 instance (which is using a standard Internet Gateway) is throttled back
I've tested downloading the same file from the same instance to a webserver using wget and it takes 0.0008s (i.e. 8ms)
So to summarise:
Downloading the file from S3 via the AWS CLI takes 2.3s (i.e. 2300ms)
Downloading the same file from a webserver (> Internet > Cloudflare > AWS > LB > Apache) via wget takes 0.0008s (i.e. 8ms)
I need to improve AWS CLI S3 download performance because the API is going to be quite heavily used in the future.
Okay this was a combination of things.
I'd had problems with the AWS PHP API SDK previously (mainly related to orphaned threads when copying files), so had changed my APIs to use the AWS CLI for simplicity and reliability reasons and although they worked, I encountered a few performance issues:
Firstly because my Instance had role based access to my S3 buckets, the aws CLI was taking around 1.7s just to determine which region my buckets were in. Configuring the CLI to point to a default region overcame this
Secondly because PHP has to invoke a whole new shell when running an exec() command (e.g. exec("aws s3 cp s3://bucketname/objectname.txt /var/app_path/objectname.txt)) that is a very slow exercise. I know it's possible to offload shell commands via Gearman or similar but since simplicity was one of my goals, I didn't want to go down that road
Finally because the AWS CLI uses Python, it takes almost 0.4s just to initiate, before it even begins processing a command. That might not seem like alot but when my API is in production usage it will be quite an impact to users and infrastructure alike
To cut a long story short, I've done two things:
Reverted to using the AWS PHP API SDK instead of the AWS CLI
Referring to the correct S3 region name within my PHP code
My APIs are now performing much better, i.e. From 2.3s to an average of around .07s.
This doesn't make my original issue go away but at least performance is much better.
I found that if I try to download an object using aws s3 cp, the download would hang close to finishing when the object size is greater than 500MB.
However, using get-object directly causes no hang or slowdown whatsoever. Therefore instead of using
aws s3 cp s3://my-bucket/path/to/my/object .
getting the object with
aws s3api get-object --bucket my-bucket --key path/to/my/object out-file
I experience no slowdown.
AWS S3 is slow and painfully complex and you can't easily search for files. If used with cloudfront, it is faster and there are supposed to be advantages, but complexity shifts from very complex to insanely complex because caching obfuscates any file changes, and invalidating the cache is hit and miss unless you change the file name which involves changing the file name in the page referencing that file.
In practice, particularly if all or most of your traffic is located in the same region as your load balancer, I have found even a low specced web server located in the same region is faster by factors of 10. If you need multiple web servers attached to a common volume, AWS only provides this in certain regions, so I got around this by using NFS to share the volume on multiple web servers. This gives you a file system that is mounted on a server you can log in to and list and find files. S3 has become a turnkey solution for a problem that was solved better a couple of decades ago.
You may try using boto3 to download files instead of aws s3 cp.
Refer to Downloading a File from an S3 Bucket
While my download speeds weren't as slow as yours, I managed to max out my ISPs download bandwidth with aws s3 cp by adding the following configuration to my ~/.aws/config:
[profile default]
s3 =
max_concurrent_requests = 200
max_queue_size = 5000
multipart_threshold = 4MB
multipart_chunksize = 4MB
If you don't want to edit the config file, you can probably use CLI parameters instead. Have a look at the documentation: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html
Related
Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.
I have a use case where the CSV files are stored on an S3 bucket by a service. My program running on windows EC2 has to use the CSV files dumped on S3 bucket. Mounting or copying, which approach will be better to use the file? And how to approach it.
Mounting the bucket as a local Windows drive will just cache info about the bucket and copy the files locally when you try to access them. Either way you will end up having the files copied to the Windows machine. If you don't want to program the knowledge of the S3 bucket into your application then the mounting system can be an attractive solution, but in my experience it can be very buggy. I built a system on Windows machines in the past that used an S3 bucket mounting product, but after so many bugs and failures I ended up rewriting it to simply perform an aws s3 sync operation to a local folder before the process ran.
I always suggest copying using either by CLI or directly using endpoints or SDK or whatever the way suggested by AWS but not mounting.
Actually, S3 is not built for a filesystem purpose. It's an object storage system. NOt saying that you cannot do it, but it is not advisable. The correct way to use Amazon S3 is to put/get files using the S3 APIs.
And if you are concerned about the network latency, I would say both will be the same and if you are thinking about directly modifying/editing a file within the file system, No you cannot Since Amazon S3 is designed for atomic operations, they have to be completely replaced with modified files.
Can someone explain me whats the best way to transfer data from a harddrive on an EC2 Instance (running Windows Server 2012) to an S3 Bucket for the same AWS Account on a daily basis?
Backround idea to this:
I'm generating a .csv file for one of our Business partners daily at 11:00 am and I want to deliver it to S3 (he has access to our S3 Bucket).
After that he can pull it out of S3 manually or automatically whenever he wants.
Hope you can help me, I only found manually solutions with the CLI, but no automated way for daily transfers.
Best Regards
You can directly mount S3 buckets as mounted drives on your EC2 instances. This way you don't even need some sort of triggers/daily task scheduler along with third party service as objects would be directly available in the S3 bucket.
For Linux typically you would use Filesystem in Userspace (FUSE). Take a look at this repo if you need it for Linux: https://github.com/s3fs-fuse/s3fs-fuse.
Regarding Windows, there is this tool:
https://tntdrive.com/mount-amazon-s3-bucket.aspx
If these tools don't suit you or if you don't want to mount directly the s3 bucket, here is another option: Whatever you can do with the CLI you should be able to do with the SDK. Therefore if you are able to code in one of the various language AWS Lambda proposes - C#/Java/Go/Powershell/Python/Node.js/Ruby - you could automate that using a Lambda function along with a daily task scheduler triggering at 11a.m.
Hope this helps!
Create a small application that uploads your file to an S3 bucket (there are a some example here). Then use Task Scheduler to execute your application on a regular basis.
Working on a problem at the moment where I want to export a file on an EC2 instance running a Windows AMI at four hour intervals to an S3 bucket. Currently, the architecture I'm thinking is as follows.
1. CloudWatch Events rule using scheduled trigger
2. Rule triggers Lambda function to run
3. Lambda function would use some form of the AWS CLI on the windows EC2 instance to extract (sync, cp, etc.) the file
4. File is placed is S3 bucket
Does anyone see a path that's more efficient than this one? I want to ensure that I'm handling this in the most straightforward manner. Thanks in advance for any input!
It is quite difficult to have external code (eg an AWS Lambda function) cause something to execute on a Windows computer. You could use Systems Manager Run Command, but that's a rather complex solution.
It would be much simpler to have the Windows computer push the files to Amazon S3:
Create a scheduled task in Windows
Use aws s3 cp or aws s3 sync to copy the files to Amazon S3
Done!
Your solution seems solid. Alternatively you may want to write daemon-like service (background process) that runs on each EC2 and does the data transfer from that instance to S3. What I like about your solution is how you can centrally control the scheduling easily. For my distributed solution you can have the processes read from central config, but that seems more complicated than the CW/Lambda solution.
For the EC2 process solution, this may be useful:
How to mount Amazon S3 Bucket as a Windows Drive, but it should be easy (and more scalable) to just use the AWS SDK instead to talk to S3
We have a use case where in we need to access almost millions of files from a Java application. Currently we are storing them in EBS volume. This is turning out to be expensive option(as we have reached upto 15TB now) so we are looking for S3 as the file storage. We are okay to bear the latency.
One option is to mount S3 using s3fs and access the files. But I was exploring the option of AWS Storage gateway if that can provide better caching and faster access. We have faced quite a few issues with s3fs so was looking for alternatives.
Avoid using s3fs if possible because it merely emulates a file system and is likely to run into problems with high utilization.
The best solution is for your application to access the files directly from Amazon via S3 API calls, rather than pretending that S3 is a filesystem. This works very nicely for large-scale applications and you would have no administration/maintenance overhead because your application communicates directly with S3. You should serious consider this option.
If you do really need to access the files via a filesystem, consider using AWS Storage Gateway – File Gateway, which can present S3 storage as an NFS share.