Downloading the latest file in an S3 bucket using AWS CLI? [duplicate] - amazon-web-services

This question already has answers here:
Get last modified object from S3 using AWS CLI
(5 answers)
Closed 2 years ago.
I have an S3 bucket that contains database backups. I am creating a script to download the latest backup (and eventually restore it somewhere else), but I'm not sure how to go about only grabbing the most recent file from a bucket.
Is it possible to copy only the most recent file from an S3 bucket to a local directory using AWS CLI tools?

And here is a bash script create based on #error2007s's answer. This script requires your aws profile and bucket name as variables, and downloads the latest object to your ~/Downloads folder:
#!/bin/sh
PROFILE=your_profile
BUCKET=your_bucket
OBJECT="$(aws s3 ls --profile $PROFILE $BUCKET --recursive | sort | tail -n 1 | awk '{print $4}')"
aws s3 cp s3://$BUCKET/$OBJECT ~/Downloads/$OBJECT --profile $PROFILE

FILE=`aws s3api list-objects-v2 --bucket "$BUCKET_NAME" --query 'reverse(sort_by(Contents[?contains(Key, \`$FILE_NAME_FILTER\`)], &LastModified))[:1].Key' --output=text`;aws s3 cp "s3://$BUCKET_NAME/$FILE" .
$BUCKET_NAME - is the bucket from which you want to download.
$FILE_NAME_FILTER - a string used as a filter for the name, which you want to match.
aws s3 cp " " - its in double-quotes because to also include files that have spaces in their names.

The above solutions are using Bash. If one wants to do the same thing in PowerShell for downloading on Windows here is the script:
# This assumes AWS CLI exe is in your path.
$s3location = "s3://bucket-name"
$files = $(aws s3 ls $s3location --recursive | sort | select -last 3)
$dlPath = "C:\TEMP"
foreach ($s3FileInfo in $files) {
$filename = $s3FileInfo.Split()[-1]
$path = "${s3location}/${filename}"
aws s3 cp $path $dlPath
echo("Done downloading ${path} to ${dlPath}")
}

Related

Copy the latest uploaded file from S3 bucket to local machine

I have a cron job set that moves the files from an EC2 instance to S3
aws s3 mv --recursive localdir s3://bucket-name/ --exclude "*" --include "localdir/*"
After that I use aws s3 sync s3://bucket-name/data1/ E:\Datafolder in .bat file and run task scheduler in Windows to run the command.
The issue is that s3 sync command copies all the files in /data1/ prefix.
So let's say I have the following files:
Day1: file1 is synced to local.
Day2: file1 and file2 are synced to local because file1 is removed from the local machine's folder.
I don't want them to occupy space on local machine. On Day 2, I just want file2 to be copied over.
Can this be accomplished by AWS CLI commands? or do I need to write a lambda function?
I followed the answer from Get last modified object from S3 using AWS CLI
but on Windows, the | and awk commands are not working as expected.
To obtain the name of the object that has the most recent Last Modified date, you can use:
aws s3api list-objects-v2 --bucket BUCKET-NAME --query 'sort_by(Contents, &LastModified)[-1].Key' --output text
Therefore (using shell syntax), you could use:
object=`aws s3api list-objects-v2 --bucket BUCKET-NAME --prefix data1/ --query 'sort_by(Contents, &LastModified)[-1].Key' --output text`
aws s3 cp s3://BUCKET-NAME/$object E:\Datafolder
You might need to tweak it to get it working on Windows.
Basically, it gets the bucket listing, sorts by LastModified, then grabs the name of the last object in the list.
Modified answer to work with Windows .bat file. Uses Windows cmd.exe
for /f "delims=" %%i in ('aws s3api list-objects-v2 --bucket BUCKET-NAME --prefix data1/ --query "sort_by(Contents, &LastModified)[-1].Key" --output text') do set object=%%i
aws s3 cp s3://BUCKET-NAME/%object% E:\Datafolder

AWS S3 File merge using CLI

I am trying to combine/merge contents from all the files existing in a S3 bucket folder into a new file. The combine/merge should be done by the ascending order of the Last modified of the S3 file.
I am able to do that manually by having hard coded file names like as follows:
(aws s3 cp s3://bucket1/file1 - && aws s3 cp s3://bucket1/file2 - && aws s3 cp s3://bucket1/file3 - ) | aws s3 cp - s3://bucket1/new-file
But, now I want to change the CLI command so that we can do this file merge based on list of as many files as they exist in a folder, sorted by Last Modified. So ideally, the cp command should receive the list of all files that exist in a S3 bucket folder, sorted by Last Modified and then merge them into a new file.
I appreciate everyone's help on this.
Give you some hints.
First list the files in the reverse order of Last Modified.
aws s3api list-objects --bucket bucket1 --query "reverse(sort_by(Contents,&LastModified))"
Then you should be fine to attach the rest commands as you did
aws s3api list-objects --bucket bucket1 --query "reverse(sort_by(Contents,&LastModified))" |jq -r .[].Key |while read file
do
echo $file
# do the cat $file >> new-file
done
aws s3 cp new-file s3://bucket1/new-file

Selective file download in AWS CLI

I have files in S3 bucket. I was trying to download files based on a date, like 08th aug, 09th Aug etc.
I used the following code, but it still downloads the entire bucket:
aws s3 cp s3://bucketname/ folder/file \
--profile pname \
--exclude \"*\" \
--recursive \
--include \"" + "2015-08-09" + "*\"
I am not sure, how to achieve this. How can I download selective date file?
This command will copy all files starting with 2015-08-15:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2015-08-15*" --recursive
If your goal is to synchronize a set of files without copying them twice, use the sync command:
aws s3 sync s3://BUCKET/ folder
That will copy all files that have been added or modified since the previous sync.
In fact, this is the equivalent of the above cp command:
aws s3 sync s3://BUCKET/ folder --exclude "*" --include "2015-08-15*"
References:
AWS CLI s3 sync command documentation
AWS CLI s3 cp command documentation
Bash Command to copy all files for specific date or month to current folder
aws s3 ls s3://bucketname/ | grep '2021-02' | awk '{print $4}' | aws s3 cp s3://bucketname/{} folder
Command is doing the following thing
Listing all the files under a bucket
Filtering out all the files of 2021-02 i.e. all files of feb month of 2021
Filtering out only the name of them
running command aws s3 cp on specific files
In case your bucket size is large in the upwards of 10 to 20 gigs,
this was true in my own personal use case, you can achieve the same
goal by using sync in multiple terminal windows.
All the terminal sessions can use the same token, in case you need to generate a token for prod environment.
$ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/
--exclude "*" --include "name_date1*" --profile UR_AC_SomeName
and another terminal window (same pwd)
$ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/
--exclude "*" --include "name_date2*" --profile UR_AC_SomeName
and another two for "name_date3*" and "name_date4*"
Additionally, you can also do multiple excludes in the same sync
command as in:
$ aws s3 sync s3://bucket-name/sub-name/another-name my-local-path/
--exclude="*.log/*" --exclude=img --exclude=".error" --exclude=tmp
--exclude="*.cache"
This Bash Script will copy all files from one bucket to another by modified-date using aws-cli.
aws s3 ls <BCKT_NAME> --recursive | sort | grep "2020-08-*" | cut -b 32- > a.txt
Inside Bash File
while IFS= read -r line; do
aws s3 cp s3://<SRC_BCKT>/${line} s3://<DEST_BCKT>/${line} --sse AES256
done < a.txt
aws cli is really slow at this. I waited hours and nothing really happened. So I looked for alternatives.
https://github.com/peak/s5cmd worked great.
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast, so you can work with buckets that have s3 access logs without much fuss.

Get last modified object from S3 using AWS CLI

I have a use case where I programmatically bring up an EC2 instance, copy an executable file from S3, run it and shut down the instance (done in user-data). I need to get only the last added file from S3.
Is there a way to get the last modified file / object from a S3 bucket using the AWS CLI tool?
You can list all the objects in the bucket with aws s3 ls $BUCKET --recursive:
$ aws s3 ls $BUCKET --recursive
2015-05-05 15:36:17 4 an_object.txt
2015-06-08 14:14:44 16322599 some/other/object
2015-04-29 12:09:29 32768 yet-another-object.sh
They're sorted alphabetically by key, but that first column is the last modified time. A quick sort will reorder them by date:
$ aws s3 ls $BUCKET --recursive | sort
2015-04-29 12:09:29 32768 yet-another-object.sh
2015-05-05 15:36:17 4 an_object.txt
2015-06-08 14:14:44 16322599 some/other/object
tail -n 1 selects the last row, and awk '{print $4}' extracts the fourth column (the name of the object).
$ aws s3 ls $BUCKET --recursive | sort | tail -n 1 | awk '{print $4}'
some/other/object
Last but not least, drop that into aws s3 cp to download the object:
$ KEY=`aws s3 ls $BUCKET --recursive | sort | tail -n 1 | awk '{print $4}'`
$ aws s3 cp s3://$BUCKET/$KEY ./latest-object
Updated answer
After a while there is a small update how to do it a bit elegant:
aws s3api list-objects-v2 --bucket "my-awesome-bucket" --query 'sort_by(Contents, &LastModified)[-1].Key' --output=text
Instead of extra reverse function we can get last entry from the list via [-1]
Old answer
This command just do the job without any external dependencies:
aws s3api list-objects-v2 --bucket "my-awesome-bucket" --query 'reverse(sort_by(Contents, &LastModified))[:1].Key' --output=text
aws s3api list-objects-v2 --bucket "bucket-name" |jq -c ".[] | max_by(.LastModified)|.Key"
If this is a freshly uploaded file, you can use Lambda to execute a piece of code on the new S3 object.
If you really need to get the most recent one, you can name you files with the date first, sort by name, and take the first object.
Following is bash script, that downloads latest file from a S3 Bucket. I used AWS S3 Synch command instead, so that it would not download the file from S3 if already existing.
--exclude, excludes all the files
--include, includes all the files matching the pattern
#!/usr/bin/env bash
BUCKET="s3://my-s3-bucket-eu-west-1/list/"
FILE_NAME=`aws s3 ls $BUCKET | sort | tail -n 1 | awk '{print $4}'`
TARGET_FILE_PATH=target/datdump/
TARGET_FILE=${TARGET_FILE_PATH}localData.json.gz
echo $FILE_NAME
echo $TARGET_FILE
aws s3 sync $BUCKET $TARGET_FILE_PATH --exclude "*" --include "*$FILE_NAME*"
cp target/datdump/$FILE_NAME $TARGET_FILE
p.s. Thanks #David Murray

How to search an Amazon S3 Bucket using Wildcards?

This stackoverflow answer helped a lot. However, I want to search for all PDFs inside a given bucket.
I click "None".
Start typing.
I type *.pdf
Press Enter
Nothing happens. Is there a way to use wildcards or regular expressions to filter bucket search results via the online S3 GUI console?
As stated in a comment, Amazon's UI can only be used to search by prefix as per their own documentation:
http://docs.aws.amazon.com/AmazonS3/latest/UG/searching-for-objects-by-prefix.html
There are other methods of searching but they require a bit of effort. Just to name two options, AWS-CLI application or Boto3 for Python.
I know this post is old but it is high on Google's list for s3 searching and does not have an accepted answer. The other answer by Harish is linking to a dead site.
UPDATE 2020/03/03: AWS link above has been removed. This is a link to a very similar topic that was as close as I could find. https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
AWS CLI search:
In AWS Console,we can search objects within the directory only but not in entire directories, that too with prefix name of the file only(S3 Search limitation).
The best way is to use AWS CLI with below command in Linux OS
aws s3 ls s3://bucket_name/ --recursive | grep search_word | cut -c 32-
Searching files with wildcards
aws s3 ls s3://bucket_name/ --recursive |grep '*.pdf'
You can use the copy function with the --dryrun flag:
aws s3 ls s3://your-bucket/any-prefix/ .\ --recursive --exclude * --include *.pdf --dryrun
It would show all of the files that are PDFs.
If you use boto3 in Python it's quite easy to find the files. Replace 'bucket' with the name of the bucket.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
if '.pdf' in obj.key:
print(obj.key)
The CLI can do this; aws s3 only supports prefixes, but aws s3api supports arbitrary filtering. For s3 links that look like s3://company-bucket/category/obj-foo.pdf, s3://company-bucket/category/obj-bar.pdf, s3://company-bucket/category/baz.pdf, you can run
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?ends-with(Key, '.pdf')]"
or for a more general wildcard
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?contains(Key, 'foo')]"
or even
aws s3api list-objects --bucket "company-bucket" --prefix "category/obj" --query "Contents[?ends_with(Key, '.pdf') && contains(Key, 'ba')]"
The full query language is described at JMESPath.
The documentation using the Java SDK suggests it can be done:
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html
Specifically the function listObjectsV2Result allows you to specify a prefix filter, e.g. "files/2020-01-02*" so you can only return results matching today's date.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ListObjectsV2Result.html
My guess the files were uploaded from a unix system and your downloading to windows so s3cmd is unable to preserve file permissions which don't apply on NTFS.
To search for files and grab them try this from the target directory or change ./ to target:
for i in `s3cmd ls s3://bucket | grep "searchterm" | awk '{print $4}'`; do s3cmd sync --no-preserve $i ./; done
This works in WSL in windows.
I have used this in one of my project but its a bit of hard coding
import subprocess
bucket = "Abcd"
command = "aws s3 ls s3://"+ bucket + "/sub_dir/ | grep '.csv'"
listofitems = subprocess.check_output(command, shell=True,)
listofitems = listofitems.decode('utf-8')
print([item.split(" ")[-1] for item in listofitems.split("\n")[:-1]])