I have lots of .tar.gz files (millions) stored in a bucket on Amazon S3.
I'd like to untar them and create the corresponding folders on Amazon S3 (in the same bucket or another).
Is it possible to do this without me having to download/process them locally?
It's not possible with only S3. You'll have to have something like EC2, ECS or Lambda, preferably running in the same region as the S3 bucket, to download the .tar files, extract them, and upload every file that was extracted back to S3.
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/
Related
I have one use case where I have around 40 million objects located in an Amazon S3 bucket. I want to download them, process them and re-upload them to another Amazon S3 bucket. First issue, I can't download them all of them at once due to shortage of hard disk space.
I want to download them in batches. For example, I want to download 1 million objects, process them, re-upload them and delete them from local storage. After that, I will download the next 1 million objects and repeat same process.
Is it possible to download files from an Amazon S3 bucket using AWS CLI to download in batches?
What is the better option of get data from a directory in SFTP and copy in bucket of S3 of AWS? In SFTP i only have permission of read so Rsync isn't option.
My idea is create a job in GLUE with Python that download this data y copy in bucket of S3. They are different files, one weighs about 600 MB, others are 4 GB.
Assuming you are talking about an sFTP server that is not on AWS, you have a few different options that may be easier than what you have proposed (although your solution could work):
Download the AWS CLI onto the sFTP server and copy the files via the AWS s3 cp command.
Write a script using the AWS SDK that takes the files and copies them. You may need to use the multi-part upload with the size of your files.
Your can create an AWS managed sFTP server that links directly to your s3 bucket as the backend storage for that server, then use sftp commands to copy the files over.
Be mindful that you will need the appropriate permissions in your AWS account to complete any of these 3 (or 4) solutions.
I am using use an API (https://scihub.copernicus.eu/userguide/OpenSearchAPI) to download a large number (100+) of large files (~5GB each) and I want to store these files on an AWS s3 bucket.
My first iteration was to download the files locally and use AWS CLI to move them to an S3 bucket: aws s3 cp <local file> s3://<mybucket>, and this works.
To avoid downloading locally I used an ec2 instance and basically did the same from there. The problem however is that the files are quite large so I'd prefer to not even have to store the files and use my ec2 instance to kind of stream the files to my S3 bucket.
Is this possible?
You can use a byte array to populate an Amazon S3 bucket. For example, assume you are using the AWS SDK for Java V2. You can put an object into a bucket like this:
PutObjectRequest putOb = PutObjectRequest.builder()
.bucket(bucketName)
.key(objectKey)
.metadata(metadata)
.build();
PutObjectResponse response = s3.putObject(putOb,
RequestBody.fromBytes(getObjectFile(objectPath)));
Notice RequestBody.fromBytes method. Full example here:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/PutObject.java
One thing to note however. If your files are really large, you may want to consider uploading in parts. See this example:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/S3ObjectOperations.java
I am uploading 1.8 GB of data that has 500000 of small XML files into the S3 bucket.
When I upload it from my local machine, it takes a very very long time 7 hours.
And when I zipped it and uploaded it takes 5 minutes of time.
But my issue is I can not zip it simply because later on I need to have something in AWS to unzip it.
So is there any way to make this upload faster? Files name are different not running number.
Transfer Acceleration is enabled.
Please suggest me how I can optimize this?
You can always upload the zip file to an EC2 instance then unzip it there and sync it to the S3 bucket.
The Instance Role must have permissions to put Objects into S3 for this to work.
I also suggest you look into configuring an S3 VPC Gateway Endpoint before doing this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html
suggest the best process using aws cli or any alternatives by downloading to local using s3 browser and upload. (After extracting locally it is 60gb file).
Amazon S3 is purely a storage service. There is no in-built capability to process data (eg to unzip a file).
You would need to download the file, unzip it, and upload the result back to S3. This would best be done via an Amazon EC2 instance in the same region. (AWS Lambda only has 500MB temporary storage space, so this is not an option for a 60GB file.)
You could look at a combination of mime-type and content-disposition meta info on the stored files as .gz
Then even let a web browser deal with decompressing etc., once received compressed.
But, I'm not totally sure what you are after.