I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)
Related
I have list of files in a bucket in aws s3, but when i execute the aws cp command it gives me an error saying "unknown option".
my list
s3://<bucket>/cms/imagepool/5f84dc7234bf5.jpg
s3://<bucket>/cms/imagepool/5f84daa19b7df.jpg
s3://<bucket>/cms/imagepool/5f84dcb12f9c5.jpg
s3://<bucket>/cms/imagepool/5f84dcbf25d4e.jpg
My bash script is below:
#!/bin/bash
while read line
do
aws s3 cp "${line}" ./
done <../links.txt
This is the error I get:
Unknown options: s3:///cms/imagepool/5f84daa19b7df.jpg
Does anybody know how to solve this issue.
Turns out the solution below worked(had to include the --no-cli-auto-prompt flag):
#!/bin/bash
while read line
do
aws s3 cp --no-cli-auto-prompt "${line}" ./
done <../links.txt
I can delete files and exclude folders with following script
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*"
when i tried to add pipe to delete only older files, i'm unable to.. please help with the script.
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*" | Where-Object {($_.LastModified -lt (Get-Date).AddDays(-31))}
The approach should be to list the files you need, then pipe the results to a delete call (a reverse of what you have). This might be better managed by a full blown script rather than a one line shell command. There's an article on this and some examples here.
Going forward, you should let S3 versioning take care of this, then you don't have to manage a script or remember to run it. Note: it'll only work with files that are added after versioning has been enabled.
I'm trying to download a subset of files from a public s3 bucket that contains millions of IRS files. I can download the entire repository with the command:
aws s3 sync s3://irs-form-990/ ./
But it takes way too long!
I know I should be using the --include / --exclude flags, but I don't know how to use them with a list of values. I have a csv that contains unique identifiers for all the files from 2017 that I'd like, but how do I use it in with AWS CLI? The list itself is half a million IDs long.
Help much appreciated. Thank you.
There is a bash script which can read all the filenames from a file filename.txt.
All you have to do is to convert those IDs in filenames.
#!/bin/bash
set -e
while read line
do
aws s3 cp s3://bucket-name/$line dest-path/
done <filename.txt
This question was asked before and the answer you can find it here
I am trying to move S3 bucket files from one folder to an archive folder in the same S3 bucket and I am using mv command to do this. While moving I want to exclude the movement of files in the archive folder.
I am using the following command
aws s3 mv s3://mybucket/incoming/ s3://mybucket/incoming/archive/ --recursive --exclude incoming/archive/" --include "*.csv"
but this command is moving the files but also creating multiple hierarchical archive folder when running multiple times
so,
1st run - files moved from /mybucket/incoming/ to
/mybucket/incoming/archive/
2nd run - new files moved from
/mybucket/incoming/ to /mybucket/incoming/archive/archive/
3rd run -
new files moved from /mybucket/incoming/ to
/mybucket/incoming/archive/archive/archive/
4th run - new files
moved from /mybucket/incoming/ to
/mybucket/incoming/archive/archive/archive/archive/
Can someone suggest/advise what exactly I am doing wrong here?
Use:
aws s3 mv s3://bucket/incoming/ s3://bucket/incoming/archive/ --recursive --include "*.csv" --exclude "archive/*"
The order of include/exclude is important, and the references are relative to the path given.
I have a S3 bucket in which there are several log files stored having the format
index.log.yyyy-mm-dd-01
index.log.yyyy-mm-dd-02
.
.
.
yyyy for year, mm for month and dd for date.
Now i want to download only a few of them. I saw Downloading an entire S3 bucket?. The accepted answer of this post is working absolutely fine if I want to download the entire bucket but what should I do if I want to do some pattern matching? I tried the following commands but they didn't worked:
aws s3 sync s3://mybucket/index.log.2014-08-01-* .
aws s3 sync 's3://mybucket/index.log.2014-08-01-*' .
I also tried using s3cmd for downloading purpose using http://fosshelp.blogspot.in/2013/06 article's POINT 7 and http://s3tools.org/s3cmd-sync. Following were the commands that I ran:
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
and a few more permutations of this.
Can anyone tell me why isn't pattern matching happening? Or if there is any other tool that I need to use.
Thanks !!
Found the solution for the problem. Although I don't know that why other commands were not working.. Solution is as follows:
aws s3 sync s3://mybucket . --exclude "*" --include "*.2014-08-01-*"
Note: --exclude "*" should come before --include "---", doing the reverse won't print anything since it will execute 'exclude' after 'include' (unable to find the reference now where I read this).
I needed to grab files from a s3 access logs bucket, and I found the official aws cli tool to be really very slow for that task. So I looked for alternatives.
https://github.com/peak/s5cmd worked great!
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast , so you can work with buckets that have s3 access logs without much fuss.