Amazon S3 filenames: Replace double spaces from all files - amazon-web-services

I have a bucket on Amazon S3 with thousands of files that contain double spaces in their names.
How can I replace all the double spaces with one space?
like: folder1/folder2/file name.pdf to folder1/folder2/file name.pdf

Option 1: Use a spreadsheet
One 'cheat method' I sometimes use is to create a spreadsheet and then generate commands:
Extract a list of all files with double-spaces:
aws s3api list-objects --bucket bucket-name --query 'Contents[].[Key]' --output text | grep '\ \ ' >file_list.csv
Open the file in Excel
Write a formula in Column B that creates a aws s3 mv command:
="aws s3 mv 's3://bucket-name/"&A1&"' 's3://bucket-name/"&SUBSTITUTE(A1," "," ")&"'"
Test it by copying the output and running it in a terminal
If it works, Copy Down to the other rows, copy and paste all the commands into a shell script, then run the shell script
Option 2: Write a script
Or, you could write a script in your favourite language (eg Python) that will:
List the bucket
Loop through each object
If the object Key has double-spaces:
Copy the object to a new Key
Delete the original object

According to the idea from #john-rotenstein
I build bash command that makes it in one line
aws s3 ls --recursive s3://bucket-name | cut -c32- | grep "\/.* .*" | (IFS='' ; while read -r line ; do aws s3 mv s3://bucket-name/"$line" s3://bucket-name/$(echo "$line" | xargs) --recursive; done)
get the list paths of the bucket
cut the result to get the only file path
search all paths that contain double spaces
move to new path with one space

Related

How to copy all the files created on a specific date from one bucket to another in GCS?

How can we copy all the files created on the specified date from one directory to another in GCS.
I have an archive folder from which I need to copy all the files that were created on a specified date(e.g. 20 August 2022) to another directory. We can do this by providing the list of filenames in a file and providing it as input to the gsutil cp command however I am having 500+ files and don't have the names of all of those.
You can use Data transfer service to copy data from one GCS to another one.
On the second step, select advanced filters and pick the absolute time range
It should meet your expectation.
Try this command :
gsutil ls -l gs://{SOURCE_BUCKET}/ | grep 2022-08-20 | rev | cut -d ' ' -f1 | rev | gsutil -m cp -I gs://{DESTINATION_BUCKET}/

What is efficient and fastest way to get number of records( no. of lines) in zip file using aws s3 command?

I want to get number of records in zip file which is present in s3 bucket. Could you please tell me what is the fastest way to get the result?
I am running below command but that is not working. Please correct me if I am doing anything wrong.
aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/phd_admsrc.txt.gz - | wc -l
The above command is giving me 0. but actual count is 24.
You need to decompress the .gz file:
aws s3 cp s3://bucket/object.gz - | zcat | wc -l
This copies the S3 object to stdout, sends it through zcat to decompress, then send the output to wc to count the lines.

Delete files from s3 bucket based on names listed in file using cli

I'm trying to delete multiple (like: thousands) of files from Amazon S3 bucket.
I have a file names listed in a file like so:
name1.jpg
name2.jpg
...
name2020201.jpg
I tried following solution:
aws s3 rm s3://test-bucket --recursive --exclude "*" --include "data/*.*"
from this question but --include only takes one arg.
I tried to get hacky and list names like --include "name1.jpg" but this does not work either.
This approach does not work as well:
aws s3 rm s3://test-bucket < file.txt
Can you help?
I figured this out with this simple bash script:
#!/bin/bash
set -e
while read line
do
aws s3 rm s3://test-bucket/$line
done <files.txt
Inspired by this answer
Answer is: delete one at a time!
The following approach is actually much faster since my first answer took ages to complete.
My first approach was to delete one line at a time using rm command. This is not efficient. After around 15h (!) it deleted only around 40.000 records, which was 1/5 of total.
This approach by Norbert Preining is waaay faster. As he explains, it uses s3api method called delete-objects which can bulk delete objects in storage. This method takes a json object as an argument. To parse list of file names into JSON object required, this script uses JSON preprocessor called jq (read more here). The script takes 500 records per iteration.
cat file-with-names | while mapfile -t -n 500 ary && ((${#ary[#]})); do
objdef=$(printf '%s\n' "${ary[#]}" | ./jq-win64.exe -nR '{Objects: (reduce inputs as $line ([]; . + [{"Key":$line}]))}')
aws s3api --no-cli-pager delete-objects --bucket BUKET --delete "$objdef"
done

How to make `aws s3 sync` ignore size and only use last modified time

Is there a way to make aws s3 sync only use last modified time?
i.e., even if file sizes differ, only copy if source is newer than destination
If not, is there an easy workaround to get this behaviour?
Thanks
If this sync operation is from local folder to S3 destination, you can use a combination of git and S3 Sync CLI's "include" and "exclude" options to accomplish this
For example,
#!/bin/bash
set -ex
FILES=()
for i in $( git status -s | sed 's/\s*[a-zA-Z?]\+ \(.*\)/\1/' ); do
FILES+=( "$i" )
done
#echo "${FILES[#]}"
CMDS=()
for i in "${FILES[#]}"; do
CMDS+=("--include=$i""*")
done
#echo "${CMDS[#]}"
echo "${CMDS[#]}" | xargs aws s3 sync . s3://dest.com [-otherflags]"*"

copy data from s3 to local with prefix

I am trying to copy data from s3 to local with prefix using aws-cli.
But I am getting error with different regex.
aws s3 cp s3://my-bucket-name/RAW_TIMESTAMP_0506* . --profile prod
error:
no matches found: s3://my-bucket-name/RAW_TIMESTAMP_0506*
aws s3 cp s3://my-bucket/ <local directory path> --recursive --exclude "*" --include "<prefix>*"
This will copy only files with given prefix
The above answers to not work properly... for example I have many thousands of files in a directory by date, and I wish to retrieve only the files that are needed.. so I tried the correct version per the documents:
aws s3 cp s3://mybucket/sub /my/local/ --recursive --exclude "*" --include "20170906*.png"
and it did not download the prefixed files, but began to download everything
so then I tried the sample above:
aws s3 cp s3://mybucket/sub/ . /my/local --recursive --include "20170906*"
and it also downloaded everything... It seems that this is an ongoing issue with aws cli, and they have no intention to fix it... Here are some workarounds that I found while Googling, but they are less than ideal.
https://github.com/aws/aws-cli/issues/1454
Updated: Added --recursive and --exclude
The aws s3 cp command will not accept a wildcard as part of the filename (key). Instead, you must use the --include and --exclude parameters to define filenames.
From: Use of Exclude and Include Filters
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
So, you would use something like:
aws s3 cp --recursive s3://my-bucket-name/ . --exclude "*" --include "RAW_TIMESTAMP_0506*"
If you don't like silent consoles, you can pipe aws ls thru awk and back to aws cp.
Example
# url must be the entire prefix that includes folders.
# Ex.: url='s3://my-bucket-name/folderA/folderB',
# not url='s3://my-bucket-name'
url='s3://my-bucket-name/folderA/folderB'
prefix='RAW_TIMESTAMP_0506'
aws s3 ls "$url/$prefix" | awk '{system("aws s3 cp '"$url"'/"$4 " .")}'
Explanation
The ls part is pretty simple. I'm using variables to simplify and shorten the command. Always wrap shell variables in double quotes to prevent disaster.
awk {print $4} would extract only the filenames from the ls output (NOT the S3 Key! This is why url must be the entire prefix that includes folders.)
awk {system("echo " $4")} would do the same thing, but it accomplishes this by calling another command. Note: I did NOT use a subshell $(...), because that would run the entire ls | awk part before starting cp. That would be slow, and it wouldn't print anything for a looong time.
awk '{system("echo aws s3 cp "$4 " .")}' would print commands that are very close to the ones we want. Pay attention to the spacing. If you try to run this, you'll notice something isn't quite right. This would produce commands like aws s3 cp RAW_TIMESTAMP_05060402_whatever.log .
awk '{system("echo aws s3 cp '$url'/"$4 " .")}' is what we're looking for. This adds the path to the filename. Look closely at the quotes. Remember we wrapped the awk parameter in single quotes, so we have to close and reopen the quotes if we want to use a shell variable in that parameter.
awk '{system("aws s3 cp '"$url"'/"$4 " .")}' is the final version. We just remove echo to actually execute the commands created by awk. Of course, I've also surrounded the $url variable with double quotes, because it's good practice.