Regex each line of stdout and push to array in shell/bash - regex

I am using AWS CLI to ls an S3 bucket. The output is:
Austins-MacBook-Pro:~ austin$ aws s3 ls s3://obscured-bucket-name
PRE 2016-02-24-03-42/
PRE 2016-02-25-22-25/
PRE 2016-02-26-00-34/
PRE 2016-02-26-00-42/
PRE 2016-02-26-03-43/
Using either Bash or Shell script I need to take each line and remove the spaces or tabs and the PRE before the prefix names and put each prefix in an array so I can use it to subsequently rm the oldest folder.
TLDR;
I need to turn the output of aws s3 ls s3://obscured-bucket-name to an array of values like this: 2016-02-26-03-43/
Thanks for reading!

Under bash, you could:
mapfile myarray < <(aws s3 ls s3://obscured-bucket-name)
echo ${myarray[#]#*PRE }
2016-02-24-03-42/ 2016-02-25-22-25/ 2016-02-26-00-34/ 2016-02-26-00-42/ 2016-02-26-03-43/
or
mapfile -t myarray < <(aws s3 ls s3://obscured-bucket-name)
myarray=( "${myarray[#]#*PRE }" )
printf '<%s>\n' "${myarray[#]%/}"
<2016-02-24-03-42>
<2016-02-25-22-25>
<2016-02-26-00-34>
<2016-02-26-00-42>
<2016-02-26-03-43>
Nota: -t switch remove a trailing newline from each line read.
See help mapfile and/or man -Pless\ +/readarray bash
mapfile was introduced in 2009 with version 4 of bash.

try this:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//"
so if you want to get the oldest folder:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//" | sort | head -n 1

You could also use awk to the rescue
aws s3 ls <s3://obscured-bucket-name>/ | awk '/PRE/ { print $2 }' | tail -n+2
This will remove the last bucket and provide store the folders in the array variable.

Related

How to determine if a string is located in AWS S3 CSV file

I have a CSV file in AWS S3.
The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file?
I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com']
Will return => ['xyz.com','ggg.com']
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.
If you use the aws s3 cp command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
- The dash will send the output to stdout.
this are two examples of grep checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7

How do I list WAF objects that do not have any resources using the AWS CLI?

I'd like to list all objects in WAF that do not have resources connected to them using the aws cli in my terminal.
Is there anyway I can do this using the aws wafv2 list-web-acl --name --scope <value> AWS cli command with other perimeters?
Thanks
Looks like there's no cmd for that so I created a script to have the results placed in a file. Might come handy if needed by anyone on here
#!/bin/bash
#list the web acl objects with their corresponding arn and save it in a file
aws wafv2 list-web-acls --scope REGIONAL | grep "ARN" > output.txt
# Next generate only the ARN nos and save output in a seperate file
awk -F\" '{print $4}' output.txt > input.txt
#Create a file to store ARN numbers together with their resources attached
touch resources.txt
#loop through each line and generate the resource attached to an ARN object based on its ARN no
while read p; do
echo $p >> resources.txt && \
aws wafv2 list-resources-for-web-acl --web-acl-arn $p >> resources.txt && \
echo ------------------------ >> resources.txt
#echo -e ' \t ' >> resources.txt
done < input.txt
#remove unwanted files
rm input.txt output.txt
#list webacl objects that do not have resources attached to them
grep -B 3 "\[\]" resources.txt | grep "webacl"
#remove any files left
rm resources.txt

copy last modified files from one bucket into an other bucket using gsutil

I need to copy last modified files from one GCS bucket to another.
Let's assume that input bucket is :
gs://input-bucket/object
and target bucket is :
gs://target-bucket/object
I want to copy files last files of today :
I wrote
gsutil ls -l gs://renault-ftt-vll-dfp/complex-files/PAN/TRM | sort -k2n | tail -n5 | sort -k2n | tail -n5
But this is not complete. my aime is to copy the files which were last modified today from input bucket to target bucket.
Any help with this please ?
Many thanks
It's not possible to do this easily in gsutil at the moment but it is feasible using the terminal.
gsutil -m ls -l gs://input-bucket | grep $(date -I) | sed 's/.*\(gs:\/\/\)/\1/''| gsutil cp -I gs://target-bucket/
To break it down:
gsutil -m ls -l gs://input-bucket - This will list all objects within the input-bucket
example line: 29 2018-11-27T15:43:24Z gs://input-bucket/README.md
grep $(date -I) - Finds all lines that contain today's date. (find all objects modified today)
sed 's/.*\(gs:\/\/\)/\1/'' - This will remove everything up to where gs:// starts so it will change the line from 29 2018-11-27T15:43:24Z gs://input-bucket/README.md to gs://input-bucket/README.md
gsutil cp -I gs://target-bucket/ - Copy it to the target storage bucket, the -I option allows us to input the list of files to copy from stdin.
It's not possible to do that with gsutil, but I did this beautiful script in python for you:
import subprocess
import re
import datetime
child = subprocess.Popen('gsutil ls -l gs://<YOUR_BUCKET> | sort -k2n',shell=True,stdout=subprocess.PIPE)
output = child.communicate()[0]
datepattern = re.compile("\d{4}-\d{2}-\d{2}")
matcher = datepattern.search(output)
for line in output.splitlines():
datepattern = re.compile("\d{4}-\d{2}-\d{2}")
matcher = datepattern.search(line)
if matcher:
if matcher.group(0) == datetime.datetime.today().strftime('%Y-%m-%d'):
filebucket = line[line.index("gs://") + len("gs://"):]
child = subprocess.Popen("gsutil cp gs://"+filebucket+" gs://<YOUR_DESTINATION_BUCKET>",shell=True,stdout=subprocess.PIPE)
outputCopy=child.communicate()[0]
print outputCopy
Just edit the "< YOUR_BUCKET >" and "< YOUR_DESTINATION_BUCKET >" fields and run this normally, it should copy all the files that have been modified today to your destination bucket.

How to retrieve the most recent file in cloud storage bucket?

Is this something that can be done with gsutil?
https://cloud.google.com/storage/docs/gsutil/commands/ls does not seem to mention any sorting functionality - only filtering by a date - which wouldn't work for my use case.
Hello this still doesn't seems to exists, but there is a solution in this post: enter link description here
The command used is this one:
gsutil ls -l gs://[bucket-name]/ | sort -k 2
As it allow you to filter by date you can get the most recent result in the bucket and recuperating the last line using another pipe if you need.
gsutil ls -l gs://<bucket-name> | sort -k 2 | tail -n 2 | head -1 | cut -d ' ' -f 7
It will not work well if there is less then two objects in the bucket though
By using gsutil from a host machine this will populate the response array:
response=(`gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Or by gsutil from docker container:
response=(`docker run --name some-container-name --rm --volumes-from gcloud-config -it google/cloud-sdk:latest gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Afterwards, to get the whole response, run:
echo ${response[#]}
will print for example:
33 2021-08-11T09:24:55Z gs://some-bucket-name/filename-37.txt
Or to get separate info from the response, (e.g. filename)
echo ${response[2]}
will print the filename only
gs://some-bucket-name/filename-37.txt
For my use case, I wanted to find the most recent directory in my bucket. I number them in ascending order (with leading zeros), so all I need to get the most recent one is this:
gsutil ls -l gs://[bucket-name] | sort | tail -n 1 | cut -d '/' -f 4
list the directory
sort alphabetically (probably unnecessary)
take the last line
tokenise it with "/" delimiter
get the 4th token, which is the directory name

looking for s3cmd download command for a certain date

I am trying to figure out on what the s3cmd command would be to download files from bucket by date, so for example i have a bucket named "test" and in that bucket there are different files from different dates. I am trying to get the files that were uploaded yesterday. what would the command be?
There is no single command that will allow you to do that. You have to write a script some thing like this. Or use a SDK that allows you to do this. Below script is a sample script that will get S3 files from last 30 days.
#!/bin/bash
# Usage: ./getOld "bucketname" "30 days"
s3cmd ls s3://$1 | while read -r line; do
createDate=`echo $line|awk {'print $1" "$2'}`
createDate=`date -d"$createDate" +%s`
olderThan=`date -d"-$2" +%s`
if [[ $createDate -lt $olderThan ]]
then
fileName=`echo $line|awk {'print $4'}`
echo $fileName
if [[ $fileName != "" ]]
then
s3cmd get "$fileName"
fi
fi
done;
I like s3cmd but to work with single line command, I prefer the JSon output of aws cli and jq JSon processor
The command will look like
aws s3api list-objects --bucket "yourbucket" |\
jq '.Contents[] | select(.LastModified | startswith("yourdate")).Key' --raw-output |\
xargs -I {} aws s3 cp s3://yourbucket/{} .
basically what the script does
list all object from a given bucket
(the interesting part) jq will parse the Contents array and select element where the LastModified value start with your pattern (you will need to change), get the Key of the s3 object and add --raw-output so it strips the quote from the value
pass the result to an aws copy command to download the file from s3
if you want to automate a bit further you can get yesterday from the command line
for mac os
$ export YESTERDAY=`date -v-1w +%F`
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
for linux os (or other flavor of bash that I am not familiar)
$ export YESTERDAY=`date -d "1 day ago" '+%Y-%m-%d' `
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
Now you get the idea if you want to change the YESTERDAY variable to have different kind of date