copy data from s3 to local with prefix - amazon-web-services

I am trying to copy data from s3 to local with prefix using aws-cli.
But I am getting error with different regex.
aws s3 cp s3://my-bucket-name/RAW_TIMESTAMP_0506* . --profile prod
error:
no matches found: s3://my-bucket-name/RAW_TIMESTAMP_0506*

aws s3 cp s3://my-bucket/ <local directory path> --recursive --exclude "*" --include "<prefix>*"
This will copy only files with given prefix

The above answers to not work properly... for example I have many thousands of files in a directory by date, and I wish to retrieve only the files that are needed.. so I tried the correct version per the documents:
aws s3 cp s3://mybucket/sub /my/local/ --recursive --exclude "*" --include "20170906*.png"
and it did not download the prefixed files, but began to download everything
so then I tried the sample above:
aws s3 cp s3://mybucket/sub/ . /my/local --recursive --include "20170906*"
and it also downloaded everything... It seems that this is an ongoing issue with aws cli, and they have no intention to fix it... Here are some workarounds that I found while Googling, but they are less than ideal.
https://github.com/aws/aws-cli/issues/1454

Updated: Added --recursive and --exclude
The aws s3 cp command will not accept a wildcard as part of the filename (key). Instead, you must use the --include and --exclude parameters to define filenames.
From: Use of Exclude and Include Filters
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
So, you would use something like:
aws s3 cp --recursive s3://my-bucket-name/ . --exclude "*" --include "RAW_TIMESTAMP_0506*"

If you don't like silent consoles, you can pipe aws ls thru awk and back to aws cp.
Example
# url must be the entire prefix that includes folders.
# Ex.: url='s3://my-bucket-name/folderA/folderB',
# not url='s3://my-bucket-name'
url='s3://my-bucket-name/folderA/folderB'
prefix='RAW_TIMESTAMP_0506'
aws s3 ls "$url/$prefix" | awk '{system("aws s3 cp '"$url"'/"$4 " .")}'
Explanation
The ls part is pretty simple. I'm using variables to simplify and shorten the command. Always wrap shell variables in double quotes to prevent disaster.
awk {print $4} would extract only the filenames from the ls output (NOT the S3 Key! This is why url must be the entire prefix that includes folders.)
awk {system("echo " $4")} would do the same thing, but it accomplishes this by calling another command. Note: I did NOT use a subshell $(...), because that would run the entire ls | awk part before starting cp. That would be slow, and it wouldn't print anything for a looong time.
awk '{system("echo aws s3 cp "$4 " .")}' would print commands that are very close to the ones we want. Pay attention to the spacing. If you try to run this, you'll notice something isn't quite right. This would produce commands like aws s3 cp RAW_TIMESTAMP_05060402_whatever.log .
awk '{system("echo aws s3 cp '$url'/"$4 " .")}' is what we're looking for. This adds the path to the filename. Look closely at the quotes. Remember we wrapped the awk parameter in single quotes, so we have to close and reopen the quotes if we want to use a shell variable in that parameter.
awk '{system("aws s3 cp '"$url"'/"$4 " .")}' is the final version. We just remove echo to actually execute the commands created by awk. Of course, I've also surrounded the $url variable with double quotes, because it's good practice.

Related

Delete files from s3 bucket based on names listed in file using cli

I'm trying to delete multiple (like: thousands) of files from Amazon S3 bucket.
I have a file names listed in a file like so:
name1.jpg
name2.jpg
...
name2020201.jpg
I tried following solution:
aws s3 rm s3://test-bucket --recursive --exclude "*" --include "data/*.*"
from this question but --include only takes one arg.
I tried to get hacky and list names like --include "name1.jpg" but this does not work either.
This approach does not work as well:
aws s3 rm s3://test-bucket < file.txt
Can you help?
I figured this out with this simple bash script:
#!/bin/bash
set -e
while read line
do
aws s3 rm s3://test-bucket/$line
done <files.txt
Inspired by this answer
Answer is: delete one at a time!
The following approach is actually much faster since my first answer took ages to complete.
My first approach was to delete one line at a time using rm command. This is not efficient. After around 15h (!) it deleted only around 40.000 records, which was 1/5 of total.
This approach by Norbert Preining is waaay faster. As he explains, it uses s3api method called delete-objects which can bulk delete objects in storage. This method takes a json object as an argument. To parse list of file names into JSON object required, this script uses JSON preprocessor called jq (read more here). The script takes 500 records per iteration.
cat file-with-names | while mapfile -t -n 500 ary && ((${#ary[#]})); do
objdef=$(printf '%s\n' "${ary[#]}" | ./jq-win64.exe -nR '{Objects: (reduce inputs as $line ([]; . + [{"Key":$line}]))}')
aws s3api --no-cli-pager delete-objects --bucket BUKET --delete "$objdef"
done

cp replace command is changing the content type

I´m executing a cp in a Visual Studio Online release task to change the --cache-control metadata, but it´s also changing the content-type of the files to text/plain.
Here´s the command:
s3://sourcefolder/ s3://sourcefolder/ --exclude "*" --
include "*.js" --include "*.png" --include "*.css" --include
"*.jpg" --include "*.gif" --include "*.eot" --include
"*.ttf" --include "*.svg" --include "*.woff" --include
"*.woff2" --recursive --metadata-directive REPLACE --cache-
control max-age=2592000,private
Before I execute this command, my javascript files were with correct content type: text/javascript, but after I execute this command it changes to text/plain. How can I avoid this?
I can't see a way of doing it for your specific use case. This is mainly because of different files which have different content-type value. I don't think aws s3 cp or aws s3 sync operations would work for you. This is caused by --metadata-directive REPLACE flag which is essentially removing all of the metadata and since you are not providing content-type it defaults to text/plain. However, in case you set it lets say to text/javascript, all the files will have that in their metadata which is clearly not right for images and css files.
However, I shall propose a solution that should work for you. Please try using latest version of s3cmd, as it has modify command available and you could use it as follows:
./s3cmd --recursive modify --add-header="Cache-Control:max-age=25920" \
--exclude "*" \
--include ... \
s3://yourbucket/
More about s3cmd usage and available flags -> s3cmd usage

Download list of specific files from AWS S3 using CLI

I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)

How can I use wildcards to `cp` a group of files with the AWS CLI [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 days ago.
Improve this question
I'm having trouble using * in the AWS CLI to select a subset of files from a certain bucket.
Adding * to the path like this does not seem to work
aws s3 cp s3://data/2016-08* .
To download multiple files from an aws bucket to your current directory, you can use recursive, exclude, and include flags.
The order of the parameters matters.
Example command:
aws s3 cp s3://data/ . --recursive --exclude "*" --include "2016-08*"
For more info on how to use these filters: http://docs.aws.amazon.com/cli/latest/reference/s3/#use-of-exclude-and-include-filters
The Order of the Parameters Matters
The exclude and include should be used in a specific order, We have to first exclude and then include. The viceversa of it will not be successful.
aws s3 cp s3://data/ . --recursive --include "2016-08*" --exclude "*"
This will fail because order of the parameters maters in this case. The include is excluded by the *
aws s3 cp s3://data/ . --recursive --exclude "*" --include "2016-08*"`
This one will work because the we excluded everything but later we had included the specific directory.
Okay I have to say the example is wrong and should be corrected as follows:
aws s3 cp . s3://data/ --recursive --exclude "*" --include "2006-08*" --exclude "*/*"
The . needs to be right after the cp. The final --exclude is to make sure that nothing is picked up from any subdirectories that are picked up by the --recursive (learned that one by mistake...)
This will work for anyone struggling with this by the time they got here.
If there is an error while using ‘ * ’ you can also use recursive, include, and exclude flags like
aws s3 cp s3://myfiles/ . --recursive --exclude "*" --include "file*"
As others have pointed out, you can do what you want with the --include and --exclude flags. The reason that you cannot simply use
aws s3 cp s3://data/2016-08* .
is that S3 URIs are not POSIX paths and do not resolve POSIX glob characters as BASH or other shells would. Indeed the * character and many other glob characters are valid characters in a S3 object URI so you command is asking to copy a file with the literal name s3://data/2016-08*. If there isn't an object with that name then S3 will respond that it doesn't exist.

Downloading pattern matched entries from S3 bucket

I have a S3 bucket in which there are several log files stored having the format
index.log.yyyy-mm-dd-01
index.log.yyyy-mm-dd-02
.
.
.
yyyy for year, mm for month and dd for date.
Now i want to download only a few of them. I saw Downloading an entire S3 bucket?. The accepted answer of this post is working absolutely fine if I want to download the entire bucket but what should I do if I want to do some pattern matching? I tried the following commands but they didn't worked:
aws s3 sync s3://mybucket/index.log.2014-08-01-* .
aws s3 sync 's3://mybucket/index.log.2014-08-01-*' .
I also tried using s3cmd for downloading purpose using http://fosshelp.blogspot.in/2013/06 article's POINT 7 and http://s3tools.org/s3cmd-sync. Following were the commands that I ran:
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
and a few more permutations of this.
Can anyone tell me why isn't pattern matching happening? Or if there is any other tool that I need to use.
Thanks !!
Found the solution for the problem. Although I don't know that why other commands were not working.. Solution is as follows:
aws s3 sync s3://mybucket . --exclude "*" --include "*.2014-08-01-*"
Note: --exclude "*" should come before --include "---", doing the reverse won't print anything since it will execute 'exclude' after 'include' (unable to find the reference now where I read this).
I needed to grab files from a s3 access logs bucket, and I found the official aws cli tool to be really very slow for that task. So I looked for alternatives.
https://github.com/peak/s5cmd worked great!
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast , so you can work with buckets that have s3 access logs without much fuss.