AWS CLI - is there a way to extract tar.gz from S3 to home without storing the tar.gz? - amazon-web-services

To elaborate,
There is a tar.gz file on my AWS S3, let's call it example.tar.gz.
So, what I want to do is download the extracted contents of example.tar.gz to /var/home/.
One way to do it is to simply download the tar.gz, extract it, then delete the tar.gz.
However, I don't want to use space downloading the tar.gz file, I just want to download the extracted version or only store the extracted version.
Is this possible?
Thanks!

What you need is the following:
aws s3 cp s3://example-bucket/file.tar.gz - | tar -xz
This will stream the file.tar.gz from s3 and extract it directly (in-memory) to the current directory. No temporary files, no extra storage and no clean up after this one command.
Make sure you write the command exactly as above.

Today I tested with Python Boto 3 and aws cli and I noticed that tar.gz is extracted automatically when the file is downloaded

There isn't currently a way you can do this with S3.
You could create the following script though and just run it whenever you wish to download the tar. Just as long as you have the IAM role / access keys setup.
!#/bin/bash
aws s3 cp s3://$1/$2 $3
tar -xvf $3
rm $3
Then just call the script using ./myScript BUCKET_NAME FILE_LOCATION OUTPUT_FILE

Related

tar folder in s3 bucket?

Let's say I have a folder on s3:
s3://tmp/folder1
With several folders within. I would like this to now be:
s3://tmp/folder1.tar.gz
in which the contents of folder1 have been tar.gz'd. However, from what I can find, the only way to do this would be to:
Either download folder1 to a local directory or cp/mv to an existing ec2 instance,
run tar czv folder1.tar.gz folder1
Reupload to s3://tmp
Is there a way to do this without having to move/download folder1? In other words, is there an amazon cli command / set of commands to do this without the download / moving?
No.
Amazon S3 does not provide the ability to manipulate the contents of objects.
You would need to copy the data somewhere, run the tar command, then upload it.
Think of it like asking a Hard Disk to tar/zip a file without a computer attached. It doesn't know how to do that.

how to use gsutil rsync. login and download bucket contents to a local directory

I have the following questions.
I got access to a cloud bucket to my email id. Now I want to download the whole bucket folder into a local directory on ubuntu. I installed gsutil from pip.
Is the command correct?
gsutil rsync gs://bucket_name .
the command seems generic how do I give my gmail credentials to it? The file is 1TB of size and I am allowed to download only once so I want to get the command right.
The command is correct if you want your current directory to mirror the contents of the bucket (including deleting any files on the right not found on the left). If you merely want to copy, you might want cp -r instead.
Here are the current docs on how to authenticate when running a standalone gsutil. It looks like you just need to run gsutil config.

How do I download files with AWS CLI based on a list?

I'm trying to download a subset of files from a public s3 bucket that contains millions of IRS files. I can download the entire repository with the command:
aws s3 sync s3://irs-form-990/ ./
But it takes way too long!
I know I should be using the --include / --exclude flags, but I don't know how to use them with a list of values. I have a csv that contains unique identifiers for all the files from 2017 that I'd like, but how do I use it in with AWS CLI? The list itself is half a million IDs long.
Help much appreciated. Thank you.
There is a bash script which can read all the filenames from a file filename.txt.
All you have to do is to convert those IDs in filenames.
#!/bin/bash
set -e
while read line
do
aws s3 cp s3://bucket-name/$line dest-path/
done <filename.txt
This question was asked before and the answer you can find it here

Download list of specific files from AWS S3 using CLI

I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)

delete s3 files from a pipeline AWS

I would like to ask about a processing task I am trying to complete using a data pipeline in AWS, but I have not been able to get it to work.
Basically, I have 2 data nodes representing 2 MySQL databases, where the data is supposed to be extracted from periodically and placed in an S3 bucket. This copy activity is working fine selecting daily every row that has been added, let's say today - 1 day.
However, that bucket containing the collected data as CSVs should become the input for an EMR activity, which will be processing those files and aggregating the information. The problem is that I do not know how to remove or move the already processed files to a different bucket so I do not have to process all the files every day.
To clarify, I am looking for a way to move or remove already processed files in an S3 bucket from a pipeline. Can I do that? Is there any other way I can only process some files in an EMR activity based on a naming convention or something else?
Even better, create a DataPipeline ShellCommandActivity and use the aws command line tools.
Create a script with these two lines:
sudo yum -y upgrade aws-cli
aws s3 rm $1 --recursive
The first line ensures you have the latest aws tools.
The second one removes a directory and all its contents. The $1 is an argument passed to the script.
In your ShellCommandActivity:
"scriptUri": "s3://myBucket/scripts/theScriptAbove.sh",
"scriptArgument": "s3://myBucket/myDirectoryToBeDeleted"
The details on how the aws s3 command works are at:
http://docs.aws.amazon.com/cli/latest/reference/s3/index.html
1) Create a script which takes input path and then deletes the files using hadoop fs -rmr s3path.
2) Upload the script to s3
In emr use the prestep -
1) hadoop fs -copyToLocal s3://scriptname .
2) chmod +x scriptname
3) run script
That pretty much it.
Another approach without using EMR is to install s3cmd tool through ShellCommandActivity in a small EC2 instance, then you can use s3cmd in pipeline to operate your S3 repo in whatever way you want.
A tricky part of this approach is to configure s3cmd through a configuration file safely (basically pass access key and secret), as you can't just ssh into the EC2 instance and use 's3cmd --configure' interactively in a pipeline.
To do that, you create a config file in the ShellCommandActivity using 'cat'. For example:
cat <<EOT >> s3.cfg
blah
blah
blah
EOT
Then use '-c' option to attach the config file every time you call s3cmd like this:
s3cmd -c s3.cfg ls
Sounds complicated, but works.