Downloading pattern matched entries from S3 bucket - amazon-web-services

I have a S3 bucket in which there are several log files stored having the format
index.log.yyyy-mm-dd-01
index.log.yyyy-mm-dd-02
.
.
.
yyyy for year, mm for month and dd for date.
Now i want to download only a few of them. I saw Downloading an entire S3 bucket?. The accepted answer of this post is working absolutely fine if I want to download the entire bucket but what should I do if I want to do some pattern matching? I tried the following commands but they didn't worked:
aws s3 sync s3://mybucket/index.log.2014-08-01-* .
aws s3 sync 's3://mybucket/index.log.2014-08-01-*' .
I also tried using s3cmd for downloading purpose using http://fosshelp.blogspot.in/2013/06 article's POINT 7 and http://s3tools.org/s3cmd-sync. Following were the commands that I ran:
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
and a few more permutations of this.
Can anyone tell me why isn't pattern matching happening? Or if there is any other tool that I need to use.
Thanks !!

Found the solution for the problem. Although I don't know that why other commands were not working.. Solution is as follows:
aws s3 sync s3://mybucket . --exclude "*" --include "*.2014-08-01-*"
Note: --exclude "*" should come before --include "---", doing the reverse won't print anything since it will execute 'exclude' after 'include' (unable to find the reference now where I read this).

I needed to grab files from a s3 access logs bucket, and I found the official aws cli tool to be really very slow for that task. So I looked for alternatives.
https://github.com/peak/s5cmd worked great!
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast , so you can work with buckets that have s3 access logs without much fuss.

Related

Delete files older than 30 days under S3 bucket recursively without deleting folders using PowerShell

I can delete files and exclude folders with following script
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*"
when i tried to add pipe to delete only older files, i'm unable to.. please help with the script.
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*" | Where-Object {($_.LastModified -lt (Get-Date).AddDays(-31))}
The approach should be to list the files you need, then pipe the results to a delete call (a reverse of what you have). This might be better managed by a full blown script rather than a one line shell command. There's an article on this and some examples here.
Going forward, you should let S3 versioning take care of this, then you don't have to manage a script or remember to run it. Note: it'll only work with files that are added after versioning has been enabled.

Download list of specific files from AWS S3 using CLI

I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)

copy data from s3 to local with prefix

I am trying to copy data from s3 to local with prefix using aws-cli.
But I am getting error with different regex.
aws s3 cp s3://my-bucket-name/RAW_TIMESTAMP_0506* . --profile prod
error:
no matches found: s3://my-bucket-name/RAW_TIMESTAMP_0506*
aws s3 cp s3://my-bucket/ <local directory path> --recursive --exclude "*" --include "<prefix>*"
This will copy only files with given prefix
The above answers to not work properly... for example I have many thousands of files in a directory by date, and I wish to retrieve only the files that are needed.. so I tried the correct version per the documents:
aws s3 cp s3://mybucket/sub /my/local/ --recursive --exclude "*" --include "20170906*.png"
and it did not download the prefixed files, but began to download everything
so then I tried the sample above:
aws s3 cp s3://mybucket/sub/ . /my/local --recursive --include "20170906*"
and it also downloaded everything... It seems that this is an ongoing issue with aws cli, and they have no intention to fix it... Here are some workarounds that I found while Googling, but they are less than ideal.
https://github.com/aws/aws-cli/issues/1454
Updated: Added --recursive and --exclude
The aws s3 cp command will not accept a wildcard as part of the filename (key). Instead, you must use the --include and --exclude parameters to define filenames.
From: Use of Exclude and Include Filters
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
So, you would use something like:
aws s3 cp --recursive s3://my-bucket-name/ . --exclude "*" --include "RAW_TIMESTAMP_0506*"
If you don't like silent consoles, you can pipe aws ls thru awk and back to aws cp.
Example
# url must be the entire prefix that includes folders.
# Ex.: url='s3://my-bucket-name/folderA/folderB',
# not url='s3://my-bucket-name'
url='s3://my-bucket-name/folderA/folderB'
prefix='RAW_TIMESTAMP_0506'
aws s3 ls "$url/$prefix" | awk '{system("aws s3 cp '"$url"'/"$4 " .")}'
Explanation
The ls part is pretty simple. I'm using variables to simplify and shorten the command. Always wrap shell variables in double quotes to prevent disaster.
awk {print $4} would extract only the filenames from the ls output (NOT the S3 Key! This is why url must be the entire prefix that includes folders.)
awk {system("echo " $4")} would do the same thing, but it accomplishes this by calling another command. Note: I did NOT use a subshell $(...), because that would run the entire ls | awk part before starting cp. That would be slow, and it wouldn't print anything for a looong time.
awk '{system("echo aws s3 cp "$4 " .")}' would print commands that are very close to the ones we want. Pay attention to the spacing. If you try to run this, you'll notice something isn't quite right. This would produce commands like aws s3 cp RAW_TIMESTAMP_05060402_whatever.log .
awk '{system("echo aws s3 cp '$url'/"$4 " .")}' is what we're looking for. This adds the path to the filename. Look closely at the quotes. Remember we wrapped the awk parameter in single quotes, so we have to close and reopen the quotes if we want to use a shell variable in that parameter.
awk '{system("aws s3 cp '"$url"'/"$4 " .")}' is the final version. We just remove echo to actually execute the commands created by awk. Of course, I've also surrounded the $url variable with double quotes, because it's good practice.

How can I use wildcards to `cp` a group of files with the AWS CLI [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 days ago.
Improve this question
I'm having trouble using * in the AWS CLI to select a subset of files from a certain bucket.
Adding * to the path like this does not seem to work
aws s3 cp s3://data/2016-08* .
To download multiple files from an aws bucket to your current directory, you can use recursive, exclude, and include flags.
The order of the parameters matters.
Example command:
aws s3 cp s3://data/ . --recursive --exclude "*" --include "2016-08*"
For more info on how to use these filters: http://docs.aws.amazon.com/cli/latest/reference/s3/#use-of-exclude-and-include-filters
The Order of the Parameters Matters
The exclude and include should be used in a specific order, We have to first exclude and then include. The viceversa of it will not be successful.
aws s3 cp s3://data/ . --recursive --include "2016-08*" --exclude "*"
This will fail because order of the parameters maters in this case. The include is excluded by the *
aws s3 cp s3://data/ . --recursive --exclude "*" --include "2016-08*"`
This one will work because the we excluded everything but later we had included the specific directory.
Okay I have to say the example is wrong and should be corrected as follows:
aws s3 cp . s3://data/ --recursive --exclude "*" --include "2006-08*" --exclude "*/*"
The . needs to be right after the cp. The final --exclude is to make sure that nothing is picked up from any subdirectories that are picked up by the --recursive (learned that one by mistake...)
This will work for anyone struggling with this by the time they got here.
If there is an error while using ‘ * ’ you can also use recursive, include, and exclude flags like
aws s3 cp s3://myfiles/ . --recursive --exclude "*" --include "file*"
As others have pointed out, you can do what you want with the --include and --exclude flags. The reason that you cannot simply use
aws s3 cp s3://data/2016-08* .
is that S3 URIs are not POSIX paths and do not resolve POSIX glob characters as BASH or other shells would. Indeed the * character and many other glob characters are valid characters in a S3 object URI so you command is asking to copy a file with the literal name s3://data/2016-08*. If there isn't an object with that name then S3 will respond that it doesn't exist.

filtered results with s3cmd get

I'm using cmd tool for AWS S3 called s3cmd. I'm trying 'get' set of folders filtered by front part of name directory name like '/192.168.*/'. Basically I have S3 bucket with a lot of directories and I just need couple of them that start with particular string. Here is what I have so far. Will be grateful for any kind of help :) Thank you!
s3cmd get --recursive --include '192.168*' s3://mys3bucket/logfiles/
Code above pulls down all the directories from /logfiles/. :(
s3cmd get --recursive s3://mys3bucket/logfiles/192.168
If you want to "filter" for files that match a certain name, you can do this:
s3cmd get --recursive --exclude '*' --include 'filename*' s3://my-bucket/cheese