We have s3 'folders' (objects with a prefix under a bucket) with millions and millions of files and we want to figure out the size of these folders.
Writing my own .net application to get the lists of s3 objects was easy enough but the maximum number of keys per request is 1000, so it's taking forever.
Using S3Browser to look at a 'folder's' properties is taking a long time too. I'm guessing for the same reasons.
I've had this .NET application running for a week - I need a better solution.
Is there a faster way to do this?
The AWS CLI's ls command can do this: aws s3 ls --summarize --human-readable --recursive s3://$BUCKETNAME/$PREFIX --region $REGION
Seems like AWS added a menu item where it's possible to see the size:
I prefer using the AWSCLI. I find that the web console often times out when there are too many objects.
replace s3://bucket/ with where you want to start from.
relies on awscli, awk, tail, and some bash-like shell
start=s3://bucket/ && \
for prefix in `aws s3 ls $start | awk '{print $2}'`; do
echo ">>> $prefix <<<"
aws s3 ls $start$prefix --recursive --summarize | tail -n2
done
or in one line form:
start=s3://bucket/ && for prefix in `aws s3 ls $start | awk '{print $2}'`; do echo ">>> $prefix <<<"; aws s3 ls $start$prefix --recursive --summarize | tail -n2; done
Output looks something like:
$ start=s3://bucket/ && for prefix in `aws s3 ls $start | awk '{print $2}'`; do echo ">>> $prefix <<<"; aws s3 ls $start$prefix --recursive --summarize | tail -n2; done
>>> extracts/ <<<
Total Objects: 23
Total Size: 10633858646
>>> hackathon/ <<<
Total Objects: 2
Total Size: 10004
>>> home/ <<<
Total Objects: 102
Total Size: 1421736087
I think the ideal solution does not exist. But I offer some ideas you can further develop:
Is the app the only mean by which file are written to S3? If so, you can store (in a db, a file or what ever) the files size and sum it when necessary
Do concurrent calls to the LIST api
Can you switch from an organisation based on folders to one based on buckets? If so, you could query the billing API (yes, the billing) and calculating the size (or an approximation of) from cost...
If they're throttling you too 1000 keys per request, I'm not certain how PowerShell is going to help, but if you want to size of a bunch of folders, something like this should do it.
Save the following in a file called Get-FolderSize.ps1:
param
(
[Parameter(Position=0, ValueFromPipeline=$True, Mandatory=$True)]
[ValidateNotNullOrEmpty()]
[System.String]
$Path
)
function Get-FolderSize ($_ = (get-item .)) {
Process {
$ErrorActionPreference = "SilentlyContinue"
#? { $_.FullName -notmatch "\\email\\?" } <-- Exlcude folders.
$length = (Get-ChildItem $_.fullname -recurse | Measure-Object -property length -sum).sum
$obj = New-Object PSObject
$obj | Add-Member NoteProperty Folder ($_.FullName)
$obj | Add-Member NoteProperty Length ($length)
Write-Output $obj
}
}
Function Class-Size($size)
{
IF($size -ge 1GB)
{
"{0:n2}" -f ($size / 1GB) + " GB"
}
ELSEIF($size -ge 1MB)
{
"{0:n2}" -f ($size / 1MB) + " MB"
}
ELSE
{
"{0:n2}" -f ($size / 1KB) + " KB"
}
}
Get-ChildItem $Path | Get-FolderSize | Sort-Object -Property Length -Descending | Select-Object -Property Folder, Length | Format-Table -Property Folder, #{ Label="Size of Folder" ; Expression = {Class-Size($_.Length)} }
Usage: .\Get-FolderSize.ps1 -Path \path\to\your\folders
Related
I have an s3 bucket with different filenames. I need to download specific files (filenames that starts with impression) that are created or modified in last 24 hours from s3 bucket to local folder using powershell?
$items = Get-S3Object -BucketName $sourceBucket -ProfileName $profile -Region 'us-east-1' | Sort-Object LastModified -Descending | Select-Object -First 1 | select Key Write-Host "$($items.Length) objects to copy" $index = 1 $items | % { Write-Host "$index/$($items.Length): $($_.Key)" $fileName = $Folder + ".\$($_.Key.Replace('/','\'))" Write-Host "$fileName" Read-S3Object -BucketName $sourceBucket -Key $_.Key -File $fileName -ProfileName $profile -Region 'us-east-1' > $null $index += 1 }
A workaround might be to turn on access log, and since the access log will contain timestamp, you can get all access logs in the past 24 hours, de-duplicate repeated S3 objects, then download them all.
You can enable S3 access log in the bucket settings, the logs will be stored in another bucket.
If you end up writing a script for this, just bear in mind downloading the S3 objects will essentially create new access logs, making the operation irreversible.
If you want something fancy perhaps you can even query the logs and perhaps deduplicate using AWS Athena.
I'm Trying to fetch the objects list from the S3 bucket which are uploaded recently. But only Contents[?LastModified>='yyyy-mm-hh'] comparision is working
in query. When I tried with Contents[?LastModified>='yyyy-mm-hh HH:MM:SS'] then its comparing only yyyy-mm-dd giving the list which has been updated in that day and when i tried to fetch files which has added recently with timestamp HH:MM:SS its giving all the objects added in that day.
echo "###################### Previous Run : $previous_run"
dat2=$(date -d "$previous_run" "+%Y-%m-%d %H:%M:%S")
echo $dat2
get_Latest_Files()
{
#get new files from s3
json_var=$(aws s3api list-objects --bucket "$input_bucket" --prefix "$input_prefix" --query "Contents[?LastModified>='$dat2'].{Key: Key,LastModified: LastModified}" --output text)
echo "$json_var"
if [ -z "$json_var" ]
then
echo "No latest files to Process...!"
exit
else
#grep for tgz files
echo $json_var | tr " " "\n" | egrep -i "(\.tgz)|(\.tar\.gz)$" | awk -v prefix="s3://$input_bucket/" '{print prefix $0}' > input_files.txt
cat input_files.txt
fi
}
Solution:
Try to use this format of the date:
"Contents[?LastModified>='2019-07-26T17:49:00.000Z'][].{Key: Key,LastModified: LastModified}"
The AWS Command Line Interface (CLI) allows to upload a file to AWS Glacier. But there is also a limit of 4GB for file uploads in the AWS Rest API. If I need to upload a file larger than 4GB through the Rest API, I need to use the multi-part upload.
My question is: does the AWS CLI handle internally file uploads larger than 4GB, or do I need to handle myself the multipart upload when handling files larger than 4GB? Can I just pass a 20Gb file to the upload-archive option of the AWS CLI and it will just work? If the CLI can't handle large file uploads directly, there is any command line tool that does it for me (freeing me from the trouble of implementing all of the checksum computing, error handling and retry logic when a part upload fails)?
I understand that the 4GB limit is on the AWS Rest API, but I could not find anything about how this limit is handled in the CLI. I could just make the test, but my upload speed is not so fast and I fear wasting a few hours before discovering that it does not work.
I'm using glacier-cmd (https://github.com/uskudnik/amazon-glacier-cmd-interface), works pretty well, but seems to be unsupported recently. Sometimes it has a timeout with big files (~50GB).
The below script will work fine. I had created chunks for treehash calculation and fileparts uploads saperately. It worked fine.
#!/bin/bash
date1=$(date +"%s")
byteSize=1073741824
CHUNK_SIZE=1073741824
hashsize=1048576
if [[ -z "${1}" ]]; then
echo "No file provided."
exit 1
fi
ARCHIVE="/mnt/dbfiles/mahipal/splitfiles/${1}"
ARCHIVE_SIZE=`cat "${ARCHIVE}" | wc --bytes`
cd /mnt/dbfiles/mahipal/splitfiles
rm -rf TEMP
rm -rf HASH
mkdir TEMP
mkdir HASH
cd /mnt/dbfiles/mahipal/splitfiles/TEMP
date3=$(date +"%s")
split -d --bytes=${CHUNK_SIZE} "${ARCHIVE}" chunk -a 4
date4=$(date +"%s")
diff2=$(($date4-$date3))
cd /mnt/dbfiles/mahipal/splitfiles/HASH
date5=$(date +"%s")
split -d --bytes=${hashsize} "${ARCHIVE}" chunk -a 5
date6=$(date +"%s")
diff3=$(($date6-$date5))
cd /mnt/dbfiles/mahipal/splitfiles/TEMP
lastpartsize=`expr $(ls -l | tail -1 | awk '{print$5}') + 0`
lastfile=$(ls -l | tail -1 | awk '{print$9}')
cont=$(ls -l | wc -l)
cnt=`expr $cont - 2`
fileCount=$(ls -1 | grep "^chunk" | wc -l)
echo "Total parts to upload: " $fileCount
files=$(ls | grep "^chunk")
init=$(/bin/aws glacier initiate-multipart-upload --account-id - --part-size $byteSize --vault-name final_vault --archive-description "${1}_${ARCHIVE_SIZE}_${byteSize}")
echo "---------------------------------------"
uploadId=$(echo $init | jq '.uploadId' | xargs)
touch commands.txt
i=0
for f in $files
do
byteStart=$((i*byteSize))
byteEnd=$((i*byteSize+byteSize-1))
echo /bin/aws glacier upload-multipart-part --body $f --range "'"'bytes '"$byteStart"'-'"$byteEnd"'/*'"'" --account-id - --vault-name final_vault --upload-id $uploadId >> commands.txt
i=$(($i+1))
if [ "$i" == "$cnt" ]
then
byteEnd=`expr $byteEnd + 1`
byteEnd2=$((i*byteSize+lastpartsize-1))
byteSize=$lastpartsize
echo /bin/aws glacier upload-multipart-part --body $lastfile --range "'"'bytes '"$byteEnd"'-'"$byteEnd2"'/*'"'" --account-id - --vault-name final_vault --upload-id $uploadId >> commands.txt
break
fi
done
parallel --load 100% -a commands.txt --no-notice --bar
cd /mnt/dbfiles/mahipal/splitfiles/HASH
files=$(ls | grep "^chunk")
for f in $files
do
openssl dgst -sha256 -binary ${f} > "hash${f:5}"
done
echo "List Active Multipart Uploads:"
echo "Verify that a connection is open:"
/bin/aws glacier list-multipart-uploads --account-id - --vault-name final_vault >> /mnt/dbfiles/mahipal/splitfiles/TEMP/commands.txt
echo "-------------"
echo "Contents of commands.txt"
cd /mnt/dbfiles/mahipal/splitfiles/TEMP
cat commands.txt
# Calculate tree hash.
cd /mnt/dbfiles/mahipal/splitfiles/HASH
echo "Calculating tree hash..."
while true; do
COUNT=`ls hash* | wc -l`
if [[ ${COUNT} -le 2 ]]; then
TREE_HASH=$(cat hash* | openssl dgst -sha256 | awk '{print $2}')
break
fi
ls hash* | xargs -n 2 | while read PAIR; do
PAIRARRAY=(${PAIR})
if [[ ${#PAIRARRAY[#]} -eq 1 ]]; then
break
fi
cat ${PAIR} | openssl dgst -sha256 -binary > temphash
rm ${PAIR}
mv temphash "${PAIRARRAY[0]}"
done
done
cd /mnt/dbfiles/mahipal/splitfiles/TEMP
echo "Finalizing..."
/bin/aws glacier complete-multipart-upload --account-id=- --vault-name="final_vault" --upload-id="$uploadId" --checksum="${TREE_HASH}" --archive-size=${ARCHIVE_SIZE} >>commands.txt
RETVAL=$?
if [[ ${RETVAL} -ne 0 ]]; then
echo "complete-multipart-upload failed with status code: ${RETVAL}" >>commands.txt
echo "Aborting upload ${uploadId}" >>commands.txt
/bin/aws glacier abort-multipart-upload --account-id=- --vault-name="final_vault" --upload-id="${uploadId}" >>commands.txt
exit 1
fi
echo "--------------"
echo "Deleting temporary commands.txt file"
#rm commands.txt
date2=$(date +"%s")
diff=$(($date2-$date1))
echo "Total Split Duration for Chunk Part Size: $(($diff2/ 3600 )) hours $((($diff2 % 3600) / 60)) minutes $(($diff2 % 60)) seconds" >>commands.txt
echo "Total Split Duration for hash Part Size: $(($diff3/ 3600 )) hours $((($diff3 % 3600) / 60)) minutes $(($diff3 % 60)) seconds" >>commands.txt
echo "Total upload Duration: $(($diff/ 3600 )) hours $((($diff % 3600) / 60)) minutes $(($diff % 60)) seconds" >>commands.txt
echo "Done."
exit 0
I am trying to figure out on what the s3cmd command would be to download files from bucket by date, so for example i have a bucket named "test" and in that bucket there are different files from different dates. I am trying to get the files that were uploaded yesterday. what would the command be?
There is no single command that will allow you to do that. You have to write a script some thing like this. Or use a SDK that allows you to do this. Below script is a sample script that will get S3 files from last 30 days.
#!/bin/bash
# Usage: ./getOld "bucketname" "30 days"
s3cmd ls s3://$1 | while read -r line; do
createDate=`echo $line|awk {'print $1" "$2'}`
createDate=`date -d"$createDate" +%s`
olderThan=`date -d"-$2" +%s`
if [[ $createDate -lt $olderThan ]]
then
fileName=`echo $line|awk {'print $4'}`
echo $fileName
if [[ $fileName != "" ]]
then
s3cmd get "$fileName"
fi
fi
done;
I like s3cmd but to work with single line command, I prefer the JSon output of aws cli and jq JSon processor
The command will look like
aws s3api list-objects --bucket "yourbucket" |\
jq '.Contents[] | select(.LastModified | startswith("yourdate")).Key' --raw-output |\
xargs -I {} aws s3 cp s3://yourbucket/{} .
basically what the script does
list all object from a given bucket
(the interesting part) jq will parse the Contents array and select element where the LastModified value start with your pattern (you will need to change), get the Key of the s3 object and add --raw-output so it strips the quote from the value
pass the result to an aws copy command to download the file from s3
if you want to automate a bit further you can get yesterday from the command line
for mac os
$ export YESTERDAY=`date -v-1w +%F`
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
for linux os (or other flavor of bash that I am not familiar)
$ export YESTERDAY=`date -d "1 day ago" '+%Y-%m-%d' `
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
Now you get the idea if you want to change the YESTERDAY variable to have different kind of date
I have a bucket (version enabled), how can i get back the objects that are accidentally permanent deleted from my bucket.
I have created a script to restore the objects with deletemarker. You'll have to input it like below:
sh Undelete_deletemarker.sh bucketname path/to/certain/folder
**Script:**
#!/bin/bash
#please provide the bucketname and path to destination folder to restore
# Remove all versions and delete markers for each object
aws s3api list-object-versions --bucket $1 --prefix $2 --output text |
grep "DELETEMARKERS" | while read obj
do
KEY=$( echo $obj| awk '{print $3}')
VERSION_ID=$( echo $obj | awk '{print $5}')
echo $KEY
echo $VERSION_ID
aws s3api delete-object --bucket $1 --key $KEY --version-id
$VERSION_ID
done
Happy Coding! ;)
Thank you, Kc Bickey, this script works wonderfully! Only thing I might add for others is to make sure " $VERSION_ID" immediately follows "--version-id" on line 12. The forum seems to have wrapped " $VERSION_ID" to the next line and it causes the script to error until that's corrected.
**Script:**
#!/bin/bash
#please provide the bucketname and path to destination folder to restore
# Remove all versions and delete markers for each object
aws s3api list-object-versions --bucket $1 --prefix $2 --output text |
grep "DELETEMARKERS" | while read obj
do
KEY=$( echo $obj| awk '{print $3}')
VERSION_ID=$( echo $obj | awk '{print $5}')
echo $KEY
echo $VERSION_ID
aws s3api delete-object --bucket $1 --key $KEY --version-id $VERSION_ID
done
with bucket versioning enable to permanently delete an object you need to specifically mention the version of the object DELETE Object versionId
If you've done so you cannot recover this specific version, you get access to previous version
When versioning is enabled, a simple DELETE cannot permanently delete an object. Instead, Amazon S3 inserts a delete marker in the bucket so you can recover from this specific marker, but if the marker is deleted (and you mention it was permanent deleted) you cannot recover
did you enable Cross-Region Replication ? If so you can retrieve the object in the other region:
If a DELETE request specifies a particular object version ID to delete, Amazon S3 will delete that object version in the source bucket, but it will not replicate the deletion in the destination bucket (in other words, it will not delete the same object version from the destination bucket). This behavior protects data from malicious deletions.
Edit: If you have versioning enabled on your bucket you should get the Versions Hide/Show toggle button and when Show is selected you should have the additional Version ID column as per the screenshot from my bucket
If your bucket objects has white spaces in filename, previous scripts may not work properly. This script take the key including white spaces.
#!/bin/bash
#please provide the bucketname and path to destination folder to restore
# Remove all versions and delete markers for each object
aws s3api list-object-versions --bucket $1 --prefix $2 --output text |
grep "DELETEMARKERS" | while read obj
do
KEY=$( echo $obj| awk '{indice=index($0,$(NF-1))-index($0,$3);print substr($0, index($0,$3), indice-1)}')
VERSION_ID=$( echo $obj | awk '{print $NF}')
echo $KEY
echo $VERSION_ID
aws s3api delete-object --bucket $1 --key "$KEY" --version-id $VERSION_ID
done
This version of the script worked really well for me. I have a bucket that has a directory with 180,000 items in it, and this one chews through them and restores all the files that are in a directory/folder that is within the bucket.
If you just need to restore all the items in a bucket that don't have a directory, then you can just drop the prefix parameter.
#!/bin/bash
BUCKET=mybucketname
DIRECTORY=myfoldername
function run() {
aws s3api list-object-versions --bucket ${BUCKET_NAME} --prefix="${DIRECTORY}" --query='{Objects: DeleteMarkers[].{Key:Key}}' --output text |
while read KEY
do
if [[ "$KEY" == "None" ]]; then
continue
else
KEY=$(echo ${KEY} | awk '{$1=""; print $0}' | sed "s/^ *//g")
VERSION=$(aws s3api list-object-versions --bucket ${BUCKET_NAME} --prefix="$KEY" --query='{Objects: DeleteMarkers[].{VersionId:VersionId}}' --output text | awk '{$1=""; print $0}' | sed "s/^ *//g")
echo ${KEY}
echo ${VERSION}
fi
aws s3api delete-object --bucket ${BUCKET_NAME} --key="${KEY}" --version-id ${VERSION}
done
}
Note, running this script two times will run, but it won't work. It will just return the same record in the second script, so it doesn't really do anything. If you had a massive bucket, I might setup 3-4 scripts that filter by files that start with a certain letter/number. At least this way you can start working on files deeper down in the bucket.