How to compare versions of an Amazon S3 object? - amazon-web-services

Versioning of Amazon S3 buckets is nice, but I don't see any easy way to compare versions of a file - either through the console or through any other app I found.
S3Browser seems to have the best versioning support, but no comparison.
Is there a way to compare versions of a file on S3 without downloading both versions and comparing them manually?
--
EDIT:
I just started thinking that some basic automation should not be too hard, see snippet below. Question remains though: is there any tool that supports this properly? This script may be fine for me, but not for non-dev users.
#!/bin/bash
# s3-compare-last-versions.sh
if [[ $# -ne 2 ]]; then
echo "Usage: `basename $0` <bucketName> <fileKey> "
exit 1
fi
bucketName=$1
fileKey=$2
latestVersionId=$(aws s3api list-object-versions --bucket $bucketName --prefix $fileKey --max-items 2 | json Versions[0].VersionId)
previousVersionId=$(aws s3api list-object-versions --bucket $bucketName --prefix $fileKey --max-items 2 | json Versions[1].VersionId)
aws s3api get-object --bucket $bucketName --key $fileKey --version-id $latestVersionId $latestVersionId".js"
aws s3api get-object --bucket $bucketName --key $fileKey --version-id $previousVersionId $previousVersionId".js"
diff $latestVersionId".js" $previousVersionId".js"

I wrote a bash script to download the last two versions of an object and compare it using colordiff. I stumbled across this questions after writing it. Thought I could share it here if anyone wanted to use it.
#!/bin/bash
#This script needs awscli, jq and colordiff. Please install them for your environment
#This script also needs the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION.
#Please set them using the export command as follows or set them using envrc
#export AWS_ACCESS_KEY_ID=<Your AWS Access Key ID>
#export AWS_SECRET_ACCESS_KEY=<Your AWS Secret Access Key>
#export AWS_DEFAULT_REGION=<Your AWS Default Region>
set -e
if [ -z $1 ] || [ -z $2 ]; then
echo "Usage:"
echo "version_compare.sh *bucket_name* *file_name*"
echo
echo "Example"
echo "version_compare.sh bucket_name folder/filename.extension"
echo
exit 1;
fi
aws_bucket=$1
file_key=$2
echo Getting the last 2 versions of the file at ${file_key}..
echo
echo Executing:
cat << EOF
aws s3api list-object-versions --bucket ${aws_bucket} --prefix ${file_key} --max-items 2
EOF
echo
versions=$(aws s3api list-object-versions --bucket ${aws_bucket} --prefix ${file_key} --max-items 2)
version_1=$( jq -r '.["Versions"][0]["VersionId"]' <<< "${versions}" )
version_2=$( jq -r '.["Versions"][1]["VersionId"]' <<< "${versions}" )
mkdir -p state_comparison_files
echo Getting the latest version ${version_1} of the file at ${file_key}..
echo
echo Executing:
cat << EOF
aws s3api get-object --bucket ${aws_bucket} --key ${file_key} --version-id ${version_1} state_comparison_files/${version_1}
EOF
aws s3api get-object --bucket ${aws_bucket} --key ${file_key} --version-id ${version_1} state_comparison_files/${version_1} > /dev/null
echo
echo Getting older version ${version_2} of the file at ${file_key}..
echo
echo Executing:
cat << EOF
aws s3api get-object --bucket ${aws_bucket} --key ${file_key} --version-id ${version_2} state_comparison_files/${version_2}
EOF
aws s3api get-object --bucket ${aws_bucket} --key ${file_key} --version-id ${version_2} state_comparison_files/${version_2} > /dev/null
echo
echo Comparing the different versions.
echo If no differences are found, nothing will be shown
colordiff --unified state_comparison_files/${version_2} state_comparison_files/${version_1}
Here's the link to it
https://gist.github.com/mohamednajiullah/3edc88d314291be40f2dd3cf13ea0d7f
Note: It's pretty much the same as the script the question asker himself created except that it uses jq for json parsing and colordiff for showing the difference with different colors like in git diff.
I'm creating an electron.js based desktop app to do exactly this. It's currently in development but it can be used. I welcome contributions
https://github.com/mohamednajiullah/s3_object_version_comparator

You can't view file contents at all via S3, so you definitely can't compare the contents of files via S3. You would have to download the different versions and then use a tool like diff to compare them.

you can use MegaSparDiff an open source too that compares multiple types of datasources including S3
https://github.com/FINRAOS/MegaSparkDiff
the below pair will return inLeftButNotInRight and inRightButNotInLeft as DataFrames which you can save as files or you can examine the data via code.
SparkFactory.initializeSparkContext();
AppleTable leftAppleTable = SparkFactory.parallelizeTextSource("S3://file1","table1");
AppleTable rightAppleTable = SparkFactory.parallelizeTextSource("S3://file2","table2");
Pair<Dataset<Row>, Dataset<Row>> resultPair = SparkCompare.compareAppleTables(leftAppleTable, rightAppleTable);
resultPair.getLeft().show(100);
SparkFactory.stopSparkContext();

Related

AWS CLI - SSL check on every bucket

I'm just getting started with learning AWS CLI, I was wondering is there a way of checking pre-existing buckets and seeing if they have SSL enabled?
Many Thanks
buckets=`aws s3api list-buckets | jq -r '.Buckets[].Name'`
for bucket in $buckets
do
#echo "$bucket"
if aws s3api get-bucket-policy --bucket $bucket --query Policy --output text &> /dev/null; then
aws s3api get-bucket-policy --bucket $bucket --query Policy --output text | jq -r 'select(.Statement[].Condition.Bool."aws:SecureTransport"=="false")' | wc | awk {'print $1'}`

Retrieving files from s3 with content-type

Is there any proper way to retrieve files from s3 with the Content-type using python or AWS CLI?
I've searched and made some queries as below but the first one seems not as intended.
aws s3 ls --summarize --human-readable --recursive s3://<Bucket Name> | egrep '*.jpg*'
And the following query seems working but it also returns 404 errors.
for KEY in $(aws s3api list-objects --bucket <Bucket Name> --query "Contents[].[Key]" --output text) do aws s3api head-object --bucket <Bucket Name> --key $KEY --query "[\`$KEY\`,ContentType]" --output text | awk '$2 == "image/jpeg" { print $1 }'done
One of the reason is, the variable is not expending in the query parameters
--query "[\`$KEY\`,ContentType]"
Here you can look for more details.
How to expand variable in aws-cli --query parameter
so you can try this as just test it out and seems like working.
#!/bin/bash
ContentType="application/octet-stream"
BUCKET=mybucket
MAX_ITME=100
OBJECT_LIST="$(aws s3api list-objects --bucket $BUCKET --query 'Contents[].[Key]' --max-items=$MAX_ITME --output text | tr '\n' ' ' )";
for KEY in ${OBJECT_LIST}
do
aws s3api head-object --bucket $BUCKET --key $KEY --query "[\``echo $KEY`\`,ContentType]" --output text | grep "$ContentType"
done

How to list unused AWS S3 buckets and empty bucket using shell script?

I am looking for list of unused s3 buckets from last 90 days and also for empty bucket list.
In order to get it, I have tried writing code as below:
#/bin/sh
for bucketlist in $(aws s3api list-buckets --query "Buckets[].Name");
do
listobjects=$(\
aws s3api list-objects --bucket $bucketlist \
--query 'Contents[?contains(LastModified, `2020-08-06`)]')
done
This code prints following output: [I have added results for only one bucket for reference]
{
"Contents": [
{
"Key": "test2/image.png",
"LastModified": "2020-08-06T17:19:10.000Z",
"ETag": "\"xxxxxx\"",
"Size": 179008,,
"StorageClass": "STANDARD",
}
]
}
Expectations:
In above code I want to print only bucket list which objects are not modified/used in last 90 days.
I am also looking for empty bucket list
I am not good in programming, Can anyone guide me on this?
Thank you in advance for your support.
I made this small bash script to find empty buckets in my account:
#!/bin/zsh
for b in $(aws s3 ls | cut -d" " -f3)
do
echo -n $b
if [[ "$(aws s3api list-objects-v2 --bucket $b --max-items 1)" == "" ]]
then
echo " BUCKET EMPTY"
else
echo ""
fi
done
I listed the objects using the list-objects-v2 with maximum items of 1. If there are no items - the result is empty and I print "BUCKET EMPTY" alongside the bucket name.
Note 1: You must have access to list the objects.
Note 2: I'm not sure how it'll work for versioned buckets with deleted objects (appears to be empty, but actually contains older versions of deleted objects)
Here's a script I wrote today. It doesn't change anything, but it does give you the commandlines to make the changes.
#!/bin/bash
profile="default"
olddate="2020-01-01"
smallbucketsize=10
emptybucketlist=()
oldbucketlist=()
smallbucketlist=()
#for bucketlist in $(aws s3api list-buckets --profile $profile | jq --raw-output '.Buckets[6,7,8,9].Name'); # test this script on just a few buckets
for bucketlist in $(aws s3api list-buckets --profile $profile | jq --raw-output '.Buckets[].Name');
do
echo "* $bucketlist"
if [[ ! "$bucketlist" == *"shmr-logs" ]]; then
listobjects=$(\
aws s3api list-objects --bucket $bucketlist \
--query 'Contents[*].Key' \
--profile $profile)
#echo "==$listobjects=="
if [[ "$listobjects" == "null" ]]; then
echo "$bucketlist is empty"
emptybucketlist+=("$bucketlist")
else
# get size
aws s3 ls --summarize --human-readable --recursive --profile $profile s3://$bucketlist | tail -n1
# get number of files
filecount=$(echo $listobjects | jq length )
echo "contains $filecount files"
if [[ $filecount -lt $smallbucketsize ]]; then
smallbucketlist+=("$bucketlist")
fi
# get number of files older than $olddate
listoldobjects=$(\
aws s3api list-objects --bucket $bucketlist \
--query "Contents[?LastModified<=\`$olddate\`]" \
--profile $profile)
oldfilecount=$(echo $listoldobjects | jq length )
echo "contains $oldfilecount old files"
# check if all files are old
if [[ $filecount -eq $oldfilecount ]]; then
echo "all the files are old"
oldbucketlist+=("$bucketlist")
fi
fi
fi
done
echo -e "\n\n"
echo "check the contents of these buckets which only contain old files"
for oldbuckets in ${oldbucketlist[#]};
do
echo "$oldbuckets"
done
echo -e "\n\n"
echo "check the contents of these buckets which don't have many files"
for smallbuckets in ${smallbucketlist[#]};
do
echo "aws s3api list-objects --bucket $smallbuckets --query 'Contents[*].Key' --profile $profile"
done
echo -e "\n\n"
echo "consider deleting these empty buckets"
for emptybuckets in "${emptybucketlist[#]}";
do
echo "aws s3api delete-bucket --profile $profile --bucket $emptybuckets"
done

Undelete folders from AWS S3

I have a S3 bucket with versioning enabled. It is possible to undelete files, but how can I undelete folders?
I know, S3 does not have folders... but how can I undelete common prefixes? Is there a possibility to undelete files recursively?
I created this simple bash script to restore all the files in an S3 folder I deleted:
#!/bin/bash
recoverfiles=$(aws s3api list-object-versions --bucket MyBucketName --prefix TheDeletedFolder/ --query "DeleteMarkers[?IsLatest && starts_with(LastModified,'yyyy-mm-dd')].{Key:Key,VersionId:VersionId}")
for row in $(echo "${recoverfiles}" | jq -c '.[]'); do
key=$(echo "${row}" | jq -r '.Key' )
versionId=$(echo "${row}" | jq -r '.VersionId' )
echo aws s3api delete-object --bucket MyBucketName --key $key --version-id $versionId
done
yyyy-mm-dd = the date the folder was deleted
I found a satisfying solution here, which is described in more details here.
To sum up, there is no out-of-the-box tool for this, but a simple bash script wraps the AWS tool "s3api" to achieve the recursive undelete.
The solution worked for me. The only drawback I found is, that Amazon seems to throttle the restore operations after about 30.000 files.
You cannot undelete a common prefix. You would need to undelete one object at a time. When an object appears, any associated folder will also reappear.
Undeleting can be accomplished in two ways:
Delete the Delete Marker that will reverse the deletion, or
Copy a previous version of the object to itself, which will make the newest version newer than the Delete Marker, so it will reappear. (I hope you understood that!)
If a folder and its contents are deleted you can recover them using the below script inspired by a previous answer
The script is applicable to an S3 bucket where versioning is enabeled before hand. It uses the delete marker tag to restore files in an S3 prefix.
#!/bin/bash
# Inspired by https://www.dmuth.org/how-to-undelete-files-in-amazon-s3/
# This script can be used to undelete objects from an S3 bucket.
# When run, it will print out a list of AWS commands to undelete files, which you
# can then pipe into Bash.
#
#
# You will need the AWS CLI tool from https://aws.amazon.com/cli/ in order to run this script.
#
# Note that you must have the following permissions via IAM:
#
# Bucket permissions:
#
# s3:ListBucket
# s3:ListBucketVersions
#
# File permissions:
#
# s3:PutObject
# s3:GetObject
# s3:DeleteObject
# s3:DeleteObjectVersion
#
# If you want to do this in a "quick and dirty manner", you could just grant s3:* to
# the account, but I don't really recommend that.
#
# profile = company
# bucket = company-s3-bucket
# prefix = directory1/directory2/directory3/lastdirectory/
# pattern = (.*)
# USAGE
# bash undelete.sh > recover_files.txt | bash
read -p "Enter your aws profile: " PROFILE
read -p "Enter your S3 bucket name: " BUCKET
read -p "Enter your S3 directory/prefix to be recovered from, leave empty for to recover all of the S3 bucket: " PREFIX
read -p "Enter the file pattern looking to recover, leave empty for all: " PATTERN
# Make sure Profile and Bucket are entered
[[ -z "$PROFILE" ]] && { echo "Profile is empty" ; exit 1; }
[[ -z "$BUCKET" ]] && { echo "Bucket is empty" ; exit 1; }
# Fill PATTERN to match all if empty
PATTERN=${PATTERN:-(.*)}
# Errors are fatal
set -e
if [ "$PREFIX" = "" ];
# To recover all of the S3 bucket
then
aws --profile ${PROFILE} --output text s3api list-object-versions --bucket ${BUCKET} \
| grep -i $PATTERN \
| grep -E "^DELETEMARKERS" \
| awk -v PROFILE=$PROFILE -v BUCKET=$BUCKET -v PREFIX=$PREFIX \
-F "[\t]+" '{ print "aws --profile " PROFILE " s3api delete-object --bucket " BUCKET "--key \""$3"\" --version-id "$5";"}'
# To recover a directory
else
aws --profile ${PROFILE} --output text s3api list-object-versions --bucket ${BUCKET} --prefix ${PREFIX} \
| grep -E $PATTERN \
| grep -E "^DELETEMARKERS" \
| awk -v PROFILE=$PROFILE -v BUCKET=$BUCKET -v PREFIX=$PREFIX \
-F "[\t]+" '{ print "aws --profile " PROFILE " s3api delete-object --bucket " BUCKET "--key \""$3"\" --version-id "$5";"}'
fi

Does aws-cli confirm checksums when uploading files to S3, or do I need to manage that myself?

If I'm uploading data to S3 using the aws-cli (i.e. using aws s3 cp), does aws-cli do any work to confirm that the resulting file in S3 matches the original file, or do I somehow need to manage that myself?
Based on this answer and the Java API documentation for putObject(), it looks like it's possible to verify the MD5 checksum after upload. However, I can't find a definitive answer on whether aws-cli actually does that.
It matters to me because I'm intending to upload GPG-encrypted files from a backup process, and I'd like some confidence that what's been stored in S3 actually matches the original.
According to the faq from the aws-cli github, the checksums are checked in most cases during upload and download.
Key points for uploads:
The AWS CLI calculates the Content-MD5 header for both standard and
multipart uploads.
If the checksum that S3 calculates does not match
the Content-MD5 provided, S3 will not store the object and instead
will return an error message back the AWS CLI.
The AWS CLI will retry this error up to 5 times before giving up and exiting with a nonzero exit code.
The AWS support page How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? describes how to achieve this.
Firstly determine the base64 encoded md5sum of the file you wish to upload:
$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"
Then use the s3api to upload the file:
$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64"
Note the use of the --content-md5 flag, the help for this flag states:
--content-md5 (string) The base64-encoded 128-bit MD5 digest of the part data.
This does not say much about why to use this flag, but we can find this information in the API documentation for put object:
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:
{
"ETag": "\"599393a2c526c680119d84155d90f1e5\""
}
The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).
If the hash does not match the one you specified you get an error.
A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.
In addition to this you can also add the file md5sum to the file metadata as an additional check:
$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"
After upload you can issue the head-object command to check the values.
$ aws s3api head-object --bucket my-bucket --key my-file-name
{
"AcceptRanges": "bytes",
"ContentType": "binary/octet-stream",
"LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
"ContentLength": 605,
"ETag": "\"599393a2c526c680119d84155d90f1e5\"",
"Metadata": {
"md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="
}
}
Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:
#!/bin/bash
set -euf -o pipefail
# assumes you have aws cli, jq installed
# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"
test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"
cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit
wget "$test_file_url"
md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"
echo "$file_name hex = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"
echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"
echo "Verifying sums match"
s3_md5_sum_hex=$( aws --profile "$aws_profile" --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile" --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )
if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
echo "checksums match"
else
echo "something is wrong checksums do not match:"
cat <<EOM | column -t -s ' '
$file_name file hex: $md5_sum_hex s3 hex: $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM
fi
echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"