aws s3 cp clobbers files? - amazon-web-services

Um, not quite sure what to make out of this.
I am trying to download 50 files from S3 to EC2 machine.
I ran:
for i in `seq -f "%05g" 51 101`; do (aws s3 cp ${S3_DIR}part-${i}.gz . &); done
A few minutes later, I checked on pgrep -f aws and found 50 processes running. Moreover, all files were created and started to download (large files, so expected to take a while to download).
At the end, however, I got only a subset of files:
$ ls
part-00051.gz part-00055.gz part-00058.gz part-00068.gz part-00070.gz part-00074.gz part-00078.gz part-00081.gz part-00087.gz part-00091.gz part-00097.gz part-00099.gz part-00101.gz
part-00054.gz part-00056.gz part-00066.gz part-00069.gz part-00071.gz part-00075.gz part-00080.gz part-00084.gz part-00089.gz part-00096.gz part-00098.gz part-00100.gz
Where is the rest??
I did not see any errors, but I saw these for successfully completed files (and these are the files that are shown in the ls output above):
download: s3://my/path/part-00075.gz to ./part-00075.gz

If you are copying many objects to/from S3, you might try the --recursive option to instruct aws-cli to copy multiple objects:
aws s3 cp s3://bucket-name/ . --recursive --exclude "*" --include "part-*.gz"

Related

Delete files older than 30 days under S3 bucket recursively without deleting folders using PowerShell

I can delete files and exclude folders with following script
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*"
when i tried to add pipe to delete only older files, i'm unable to.. please help with the script.
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*" | Where-Object {($_.LastModified -lt (Get-Date).AddDays(-31))}
The approach should be to list the files you need, then pipe the results to a delete call (a reverse of what you have). This might be better managed by a full blown script rather than a one line shell command. There's an article on this and some examples here.
Going forward, you should let S3 versioning take care of this, then you don't have to manage a script or remember to run it. Note: it'll only work with files that are added after versioning has been enabled.

AWS S3 CLI creating multiple directories when using mv command

I am trying to move S3 bucket files from one folder to an archive folder in the same S3 bucket and I am using mv command to do this. While moving I want to exclude the movement of files in the archive folder.
I am using the following command
aws s3 mv s3://mybucket/incoming/ s3://mybucket/incoming/archive/ --recursive --exclude incoming/archive/" --include "*.csv"
but this command is moving the files but also creating multiple hierarchical archive folder when running multiple times
so,
1st run - files moved from /mybucket/incoming/ to
/mybucket/incoming/archive/
2nd run - new files moved from
/mybucket/incoming/ to /mybucket/incoming/archive/archive/
3rd run -
new files moved from /mybucket/incoming/ to
/mybucket/incoming/archive/archive/archive/
4th run - new files
moved from /mybucket/incoming/ to
/mybucket/incoming/archive/archive/archive/archive/
Can someone suggest/advise what exactly I am doing wrong here?
Use:
aws s3 mv s3://bucket/incoming/ s3://bucket/incoming/archive/ --recursive --include "*.csv" --exclude "archive/*"
The order of include/exclude is important, and the references are relative to the path given.

Gsutil creates 'tmp' files in the buckets

We are using automation scripts to upload thousands of files from MAPR HDFS to GCP storage. Sometimes the files in the main bucket appear with tmp~!# suffix it causes failures in our pipeline.
Example:
gs://some_path/.pre-processing/file_name.gz.tmp~!#
We are using rsync -m and in certain cases cp -I
some_file | gsutil -m cp -I '{GCP_DESTINATION}'
gsutil -m rsync {MAPR_SOURCE} '{GCP_DESTINATION}'
It's possible that copy attempt failed and retried later from a different machine, eventually, we have both the file and another one with the tmp~!# suffix
I'd want to get rid of these files without actively looking for them.
we have gsutil 4.33, appreciate any lead. Thx

Download list of specific files from AWS S3 using CLI

I am trying to download only specific files from AWS. I have the list of file URLs. Using the CLI I can only download all files in a bucket using the --recursive command, but I only want to download the files in my list. Any ideas on how to do that?
This is possibly a duplicate of:
Selective file download in AWS S3 CLI
You can do something along the lines of:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2018-02-06*" --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Since you have the s3 urls already in a file (say file.list), like -
s3://bucket/file1
s3://bucket/file2
You could download all the files to your current working directory with a simple bash script -
while read -r line;do aws s3 cp "$line" .;done < test.list
People, I found out a quicker way to do it: https://stackoverflow.com/a/69018735
WARNING: "Please make sure you don't have an empty line at the end of your text file".
It worked here! :-)

Downloading pattern matched entries from S3 bucket

I have a S3 bucket in which there are several log files stored having the format
index.log.yyyy-mm-dd-01
index.log.yyyy-mm-dd-02
.
.
.
yyyy for year, mm for month and dd for date.
Now i want to download only a few of them. I saw Downloading an entire S3 bucket?. The accepted answer of this post is working absolutely fine if I want to download the entire bucket but what should I do if I want to do some pattern matching? I tried the following commands but they didn't worked:
aws s3 sync s3://mybucket/index.log.2014-08-01-* .
aws s3 sync 's3://mybucket/index.log.2014-08-01-*' .
I also tried using s3cmd for downloading purpose using http://fosshelp.blogspot.in/2013/06 article's POINT 7 and http://s3tools.org/s3cmd-sync. Following were the commands that I ran:
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
s3cmd -c myconf.txt get --exclude '*.log.*' --include '*.2014-08-01-*' s3://mybucket/ .
and a few more permutations of this.
Can anyone tell me why isn't pattern matching happening? Or if there is any other tool that I need to use.
Thanks !!
Found the solution for the problem. Although I don't know that why other commands were not working.. Solution is as follows:
aws s3 sync s3://mybucket . --exclude "*" --include "*.2014-08-01-*"
Note: --exclude "*" should come before --include "---", doing the reverse won't print anything since it will execute 'exclude' after 'include' (unable to find the reference now where I read this).
I needed to grab files from a s3 access logs bucket, and I found the official aws cli tool to be really very slow for that task. So I looked for alternatives.
https://github.com/peak/s5cmd worked great!
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast , so you can work with buckets that have s3 access logs without much fuss.