filtered results with s3cmd get - regex

I'm using cmd tool for AWS S3 called s3cmd. I'm trying 'get' set of folders filtered by front part of name directory name like '/192.168.*/'. Basically I have S3 bucket with a lot of directories and I just need couple of them that start with particular string. Here is what I have so far. Will be grateful for any kind of help :) Thank you!
s3cmd get --recursive --include '192.168*' s3://mys3bucket/logfiles/
Code above pulls down all the directories from /logfiles/. :(

s3cmd get --recursive s3://mys3bucket/logfiles/192.168

If you want to "filter" for files that match a certain name, you can do this:
s3cmd get --recursive --exclude '*' --include 'filename*' s3://my-bucket/cheese

Related

Delete files older than 30 days under S3 bucket recursively without deleting folders using PowerShell

I can delete files and exclude folders with following script
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*"
when i tried to add pipe to delete only older files, i'm unable to.. please help with the script.
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*" | Where-Object {($_.LastModified -lt (Get-Date).AddDays(-31))}
The approach should be to list the files you need, then pipe the results to a delete call (a reverse of what you have). This might be better managed by a full blown script rather than a one line shell command. There's an article on this and some examples here.
Going forward, you should let S3 versioning take care of this, then you don't have to manage a script or remember to run it. Note: it'll only work with files that are added after versioning has been enabled.

cp replace command is changing the content type

I´m executing a cp in a Visual Studio Online release task to change the --cache-control metadata, but it´s also changing the content-type of the files to text/plain.
Here´s the command:
s3://sourcefolder/ s3://sourcefolder/ --exclude "*" --
include "*.js" --include "*.png" --include "*.css" --include
"*.jpg" --include "*.gif" --include "*.eot" --include
"*.ttf" --include "*.svg" --include "*.woff" --include
"*.woff2" --recursive --metadata-directive REPLACE --cache-
control max-age=2592000,private
Before I execute this command, my javascript files were with correct content type: text/javascript, but after I execute this command it changes to text/plain. How can I avoid this?
I can't see a way of doing it for your specific use case. This is mainly because of different files which have different content-type value. I don't think aws s3 cp or aws s3 sync operations would work for you. This is caused by --metadata-directive REPLACE flag which is essentially removing all of the metadata and since you are not providing content-type it defaults to text/plain. However, in case you set it lets say to text/javascript, all the files will have that in their metadata which is clearly not right for images and css files.
However, I shall propose a solution that should work for you. Please try using latest version of s3cmd, as it has modify command available and you could use it as follows:
./s3cmd --recursive modify --add-header="Cache-Control:max-age=25920" \
--exclude "*" \
--include ... \
s3://yourbucket/
More about s3cmd usage and available flags -> s3cmd usage

Download list of specific files from AWS S3 using CLI

I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)

copy data from s3 to local with prefix

I am trying to copy data from s3 to local with prefix using aws-cli.
But I am getting error with different regex.
aws s3 cp s3://my-bucket-name/RAW_TIMESTAMP_0506* . --profile prod
error:
no matches found: s3://my-bucket-name/RAW_TIMESTAMP_0506*
aws s3 cp s3://my-bucket/ <local directory path> --recursive --exclude "*" --include "<prefix>*"
This will copy only files with given prefix
The above answers to not work properly... for example I have many thousands of files in a directory by date, and I wish to retrieve only the files that are needed.. so I tried the correct version per the documents:
aws s3 cp s3://mybucket/sub /my/local/ --recursive --exclude "*" --include "20170906*.png"
and it did not download the prefixed files, but began to download everything
so then I tried the sample above:
aws s3 cp s3://mybucket/sub/ . /my/local --recursive --include "20170906*"
and it also downloaded everything... It seems that this is an ongoing issue with aws cli, and they have no intention to fix it... Here are some workarounds that I found while Googling, but they are less than ideal.
https://github.com/aws/aws-cli/issues/1454
Updated: Added --recursive and --exclude
The aws s3 cp command will not accept a wildcard as part of the filename (key). Instead, you must use the --include and --exclude parameters to define filenames.
From: Use of Exclude and Include Filters
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
So, you would use something like:
aws s3 cp --recursive s3://my-bucket-name/ . --exclude "*" --include "RAW_TIMESTAMP_0506*"
If you don't like silent consoles, you can pipe aws ls thru awk and back to aws cp.
Example
# url must be the entire prefix that includes folders.
# Ex.: url='s3://my-bucket-name/folderA/folderB',
# not url='s3://my-bucket-name'
url='s3://my-bucket-name/folderA/folderB'
prefix='RAW_TIMESTAMP_0506'
aws s3 ls "$url/$prefix" | awk '{system("aws s3 cp '"$url"'/"$4 " .")}'
Explanation
The ls part is pretty simple. I'm using variables to simplify and shorten the command. Always wrap shell variables in double quotes to prevent disaster.
awk {print $4} would extract only the filenames from the ls output (NOT the S3 Key! This is why url must be the entire prefix that includes folders.)
awk {system("echo " $4")} would do the same thing, but it accomplishes this by calling another command. Note: I did NOT use a subshell $(...), because that would run the entire ls | awk part before starting cp. That would be slow, and it wouldn't print anything for a looong time.
awk '{system("echo aws s3 cp "$4 " .")}' would print commands that are very close to the ones we want. Pay attention to the spacing. If you try to run this, you'll notice something isn't quite right. This would produce commands like aws s3 cp RAW_TIMESTAMP_05060402_whatever.log .
awk '{system("echo aws s3 cp '$url'/"$4 " .")}' is what we're looking for. This adds the path to the filename. Look closely at the quotes. Remember we wrapped the awk parameter in single quotes, so we have to close and reopen the quotes if we want to use a shell variable in that parameter.
awk '{system("aws s3 cp '"$url"'/"$4 " .")}' is the final version. We just remove echo to actually execute the commands created by awk. Of course, I've also surrounded the $url variable with double quotes, because it's good practice.

Downloading pattern matched entries from S3 bucket

I have a S3 bucket in which there are several log files stored having the format
index.log.yyyy-mm-dd-01
index.log.yyyy-mm-dd-02
.
.
.
yyyy for year, mm for month and dd for date.
Now i want to download only a few of them. I saw Downloading an entire S3 bucket?. The accepted answer of this post is working absolutely fine if I want to download the entire bucket but what should I do if I want to do some pattern matching? I tried the following commands but they didn't worked:
aws s3 sync s3://mybucket/index.log.2014-08-01-* .
aws s3 sync 's3://mybucket/index.log.2014-08-01-*' .
I also tried using s3cmd for downloading purpose using http://fosshelp.blogspot.in/2013/06 article's POINT 7 and http://s3tools.org/s3cmd-sync. Following were the commands that I ran:
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
and a few more permutations of this.
Can anyone tell me why isn't pattern matching happening? Or if there is any other tool that I need to use.
Thanks !!
Found the solution for the problem. Although I don't know that why other commands were not working.. Solution is as follows:
aws s3 sync s3://mybucket . --exclude "*" --include "*.2014-08-01-*"
Note: --exclude "*" should come before --include "---", doing the reverse won't print anything since it will execute 'exclude' after 'include' (unable to find the reference now where I read this).
I needed to grab files from a s3 access logs bucket, and I found the official aws cli tool to be really very slow for that task. So I looked for alternatives.
https://github.com/peak/s5cmd worked great!
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast , so you can work with buckets that have s3 access logs without much fuss.