Uploading huge no of files into S3 is very slow - amazon-web-services

I am uploading 1.8 GB of data that has 500000 of small XML files into the S3 bucket.
When I upload it from my local machine, it takes a very very long time 7 hours.
And when I zipped it and uploaded it takes 5 minutes of time.
But my issue is I can not zip it simply because later on I need to have something in AWS to unzip it.
So is there any way to make this upload faster? Files name are different not running number.
Transfer Acceleration is enabled.
Please suggest me how I can optimize this?

You can always upload the zip file to an EC2 instance then unzip it there and sync it to the S3 bucket.
The Instance Role must have permissions to put Objects into S3 for this to work.
I also suggest you look into configuring an S3 VPC Gateway Endpoint before doing this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html

Related

Flutter upload files to AWS s3 faster with upload progress

am facing a problem while uploading one or more files i.e images/videos to AWS s3 bucket by using aws_s3_client plugin.
It's taking much time to upload a 10MB file
Not able to track the upload progress percentage
Not having option to upload multiple file at once (if same bucket)
Every time while uploading we have to verify the IM-User access. (since why cant we use single instance at once to verify and keep connection persistent/keep alive until application getting closed)
Hence, am not familiar with AWS services. So, suggest to me a best way to upload a file or multiple files to AWS s3 bucket with faster, with upload progress percentage, multiple file upload at once and persistent connection /Keep Alive verification.
For 1 and 2, use managed uploads, it provides an event to track upload progress and makes uploads faster by using multipart upload. Beware that multipart uploads only work for files having sizes from 5 MB to 5 TB.
For 3, AWS S3 does not allow uploading files having same names or keys in the same bucket. Depending on your requirement, you can turn on versioning in your bucket and that will save different versions of the same file.
For 4, you can generate and use pre-signed URLs. Pre-signed URLs have configurable timeouts that you can adjust depending on how long you want the link to be available for an upload.
Use multi part upload.multi part upload will upload files quickly to S3.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

How to unzip (or rather untar) in-place on Amazon S3?

I have lots of .tar.gz files (millions) stored in a bucket on Amazon S3.
I'd like to untar them and create the corresponding folders on Amazon S3 (in the same bucket or another).
Is it possible to do this without me having to download/process them locally?
It's not possible with only S3. You'll have to have something like EC2, ECS or Lambda, preferably running in the same region as the S3 bucket, to download the .tar files, extract them, and upload every file that was extracted back to S3.
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/

Download files from Amazon S3 in batches using the AWS Command Line Interface (CLI)

I have one use case where I have around 40 million objects located in an Amazon S3 bucket. I want to download them, process them and re-upload them to another Amazon S3 bucket. First issue, I can't download them all of them at once due to shortage of hard disk space.
I want to download them in batches. For example, I want to download 1 million objects, process them, re-upload them and delete them from local storage. After that, I will download the next 1 million objects and repeat same process.
Is it possible to download files from an Amazon S3 bucket using AWS CLI to download in batches?

How to speed up download of millions of files from AWS S3

I've been trying to download these files all summer from the IRS AWS bucket, but it is so excruciatingly slow. Despite having a decent internet connection, the files start downloading at about 60 kbps and get progressively slower over time. That being said, there are literally millions of files, but each file is very small approx 10-50 kbs.
The code I use to download the bucket is:
aws s3 sync s3://irs-form-990/ ./ --exclude "*" --include "2018*" --include "2019*
Is there a better way to do this?
Here is also a link to the bucket itself.
My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.
Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.
Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.
General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:
How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?

Is there any way to upload 50000 image files to Amazon S3 Bucket from a list of URLs

Is there any way to upload 50000 image files to Amazon S3 Bucket. The 50000 image file URLs are saved in a .txt file. Can someone please tell me a better way to do this.
It sounds like your requirement is: For each image URL listed in a text file, copy the images to an Amazon S3 bucket.
There is no in-built capability with Amazon S3 to do this. Instead, you would need to write an app that:
Reads the text file and, for each URL
Downloads the image
Uploads the image to Amazon S3
Doing this on an Amazon EC2 instance would be the fastest, due to low latency between S3 and EC2.
You could also get fancy and do it via Amazon EMR. It would be the fastest due to parallel processing, but would require knowledge of how to use Hadoop.
If you have a local copy of the images, you could order an AWS Snowball and use it to transfer the files to Amazon S3. However, it would probably be faster just to copy the files over the Internet (rough guess... at 1MB per file, total volume is 50GB).