How to retrieve the most recent file in cloud storage bucket? - google-cloud-platform

Is this something that can be done with gsutil?
https://cloud.google.com/storage/docs/gsutil/commands/ls does not seem to mention any sorting functionality - only filtering by a date - which wouldn't work for my use case.

Hello this still doesn't seems to exists, but there is a solution in this post: enter link description here
The command used is this one:
gsutil ls -l gs://[bucket-name]/ | sort -k 2
As it allow you to filter by date you can get the most recent result in the bucket and recuperating the last line using another pipe if you need.

gsutil ls -l gs://<bucket-name> | sort -k 2 | tail -n 2 | head -1 | cut -d ' ' -f 7
It will not work well if there is less then two objects in the bucket though

By using gsutil from a host machine this will populate the response array:
response=(`gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Or by gsutil from docker container:
response=(`docker run --name some-container-name --rm --volumes-from gcloud-config -it google/cloud-sdk:latest gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Afterwards, to get the whole response, run:
echo ${response[#]}
will print for example:
33 2021-08-11T09:24:55Z gs://some-bucket-name/filename-37.txt
Or to get separate info from the response, (e.g. filename)
echo ${response[2]}
will print the filename only
gs://some-bucket-name/filename-37.txt

For my use case, I wanted to find the most recent directory in my bucket. I number them in ascending order (with leading zeros), so all I need to get the most recent one is this:
gsutil ls -l gs://[bucket-name] | sort | tail -n 1 | cut -d '/' -f 4
list the directory
sort alphabetically (probably unnecessary)
take the last line
tokenise it with "/" delimiter
get the 4th token, which is the directory name

Related

PCF: How to find the list of all the spaces where you have Developer privileges

If you are not an admin in Pivotal Cloud Foundry, how will you find or list all the orgs/spaces where you have developer privileges? Is there a command or menu to get that, instead of going into each space and verifying it?
Here's a script that will dump the org & space names of which the currently logged in user is a part.
A quick explanation. It will call the /v2/spaces api, which already filters to only show spaces of which the currently logged in user can see (if you run with a user that has admin access, it will list all orgs and spaces). We then iterate over the results & take the space's organization_url field and cf curl that to get the organization name (there's a hashmap to cache results).
This script requires Bash 4+ for the hashmap support. If you don't have that, you can remove that part and it will just be a little slower. It also requires jq, and of course the cf cli.
#!/usr/bin/env bash
#
# List all spaces available to the current user
#
set -e
function load_all_pages {
URL="$1"
DATA=""
until [ "$URL" == "null" ]; do
RESP=$(cf curl "$URL")
DATA+=$(echo "$RESP" | jq .resources)
URL=$(echo "$RESP" | jq -r .next_url)
done
# dump the data
echo "$DATA" | jq .[] | jq -s
}
function load_all_spaces {
load_all_pages "/v2/spaces"
}
function main {
declare -A ORGS # cache org name lookups
# load all the spaces & properly paginate
SPACES=$(load_all_spaces)
# filter out the name & org_url
SPACES_DATA=$(echo "$SPACES" | jq -rc '.[].entity | {"name": .name, "org_url": .organization_url}')
printf "Org\tSpace\n"
for SPACE_JSON in $SPACES_DATA; do
SPACE_NAME=$(echo "$SPACE_JSON" | jq -r '.name')
# take the org_url and look up the org name, cache responses for speed
ORG_URL=$(echo "$SPACE_JSON" | jq -r '.org_url')
ORG_NAME="${ORGS[$ORG_URL]}"
if [ "$ORG_NAME" == "" ]; then
ORG_NAME=$(cf curl "$ORG_URL" | jq -r '.entity.name')
ORGS[$ORG_URL]="$ORG_NAME"
fi
printf "$ORG_NAME\t$SPACE_NAME\n"
done
}
main "$#"

copy last modified files from one bucket into an other bucket using gsutil

I need to copy last modified files from one GCS bucket to another.
Let's assume that input bucket is :
gs://input-bucket/object
and target bucket is :
gs://target-bucket/object
I want to copy files last files of today :
I wrote
gsutil ls -l gs://renault-ftt-vll-dfp/complex-files/PAN/TRM | sort -k2n | tail -n5 | sort -k2n | tail -n5
But this is not complete. my aime is to copy the files which were last modified today from input bucket to target bucket.
Any help with this please ?
Many thanks
It's not possible to do this easily in gsutil at the moment but it is feasible using the terminal.
gsutil -m ls -l gs://input-bucket | grep $(date -I) | sed 's/.*\(gs:\/\/\)/\1/''| gsutil cp -I gs://target-bucket/
To break it down:
gsutil -m ls -l gs://input-bucket - This will list all objects within the input-bucket
example line: 29 2018-11-27T15:43:24Z gs://input-bucket/README.md
grep $(date -I) - Finds all lines that contain today's date. (find all objects modified today)
sed 's/.*\(gs:\/\/\)/\1/'' - This will remove everything up to where gs:// starts so it will change the line from 29 2018-11-27T15:43:24Z gs://input-bucket/README.md to gs://input-bucket/README.md
gsutil cp -I gs://target-bucket/ - Copy it to the target storage bucket, the -I option allows us to input the list of files to copy from stdin.
It's not possible to do that with gsutil, but I did this beautiful script in python for you:
import subprocess
import re
import datetime
child = subprocess.Popen('gsutil ls -l gs://<YOUR_BUCKET> | sort -k2n',shell=True,stdout=subprocess.PIPE)
output = child.communicate()[0]
datepattern = re.compile("\d{4}-\d{2}-\d{2}")
matcher = datepattern.search(output)
for line in output.splitlines():
datepattern = re.compile("\d{4}-\d{2}-\d{2}")
matcher = datepattern.search(line)
if matcher:
if matcher.group(0) == datetime.datetime.today().strftime('%Y-%m-%d'):
filebucket = line[line.index("gs://") + len("gs://"):]
child = subprocess.Popen("gsutil cp gs://"+filebucket+" gs://<YOUR_DESTINATION_BUCKET>",shell=True,stdout=subprocess.PIPE)
outputCopy=child.communicate()[0]
print outputCopy
Just edit the "< YOUR_BUCKET >" and "< YOUR_DESTINATION_BUCKET >" fields and run this normally, it should copy all the files that have been modified today to your destination bucket.

looking for s3cmd download command for a certain date

I am trying to figure out on what the s3cmd command would be to download files from bucket by date, so for example i have a bucket named "test" and in that bucket there are different files from different dates. I am trying to get the files that were uploaded yesterday. what would the command be?
There is no single command that will allow you to do that. You have to write a script some thing like this. Or use a SDK that allows you to do this. Below script is a sample script that will get S3 files from last 30 days.
#!/bin/bash
# Usage: ./getOld "bucketname" "30 days"
s3cmd ls s3://$1 | while read -r line; do
createDate=`echo $line|awk {'print $1" "$2'}`
createDate=`date -d"$createDate" +%s`
olderThan=`date -d"-$2" +%s`
if [[ $createDate -lt $olderThan ]]
then
fileName=`echo $line|awk {'print $4'}`
echo $fileName
if [[ $fileName != "" ]]
then
s3cmd get "$fileName"
fi
fi
done;
I like s3cmd but to work with single line command, I prefer the JSon output of aws cli and jq JSon processor
The command will look like
aws s3api list-objects --bucket "yourbucket" |\
jq '.Contents[] | select(.LastModified | startswith("yourdate")).Key' --raw-output |\
xargs -I {} aws s3 cp s3://yourbucket/{} .
basically what the script does
list all object from a given bucket
(the interesting part) jq will parse the Contents array and select element where the LastModified value start with your pattern (you will need to change), get the Key of the s3 object and add --raw-output so it strips the quote from the value
pass the result to an aws copy command to download the file from s3
if you want to automate a bit further you can get yesterday from the command line
for mac os
$ export YESTERDAY=`date -v-1w +%F`
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
for linux os (or other flavor of bash that I am not familiar)
$ export YESTERDAY=`date -d "1 day ago" '+%Y-%m-%d' `
$ aws s3api list-objects --bucket "ariba-install" |\
jq '.Contents[] | select(.LastModified | startswith('\"$YESTERDAY\"')).Key' --raw-output |\
xargs -I {} aws s3 cp s3://ariba-install/{} .
Now you get the idea if you want to change the YESTERDAY variable to have different kind of date

Regex each line of stdout and push to array in shell/bash

I am using AWS CLI to ls an S3 bucket. The output is:
Austins-MacBook-Pro:~ austin$ aws s3 ls s3://obscured-bucket-name
PRE 2016-02-24-03-42/
PRE 2016-02-25-22-25/
PRE 2016-02-26-00-34/
PRE 2016-02-26-00-42/
PRE 2016-02-26-03-43/
Using either Bash or Shell script I need to take each line and remove the spaces or tabs and the PRE before the prefix names and put each prefix in an array so I can use it to subsequently rm the oldest folder.
TLDR;
I need to turn the output of aws s3 ls s3://obscured-bucket-name to an array of values like this: 2016-02-26-03-43/
Thanks for reading!
Under bash, you could:
mapfile myarray < <(aws s3 ls s3://obscured-bucket-name)
echo ${myarray[#]#*PRE }
2016-02-24-03-42/ 2016-02-25-22-25/ 2016-02-26-00-34/ 2016-02-26-00-42/ 2016-02-26-03-43/
or
mapfile -t myarray < <(aws s3 ls s3://obscured-bucket-name)
myarray=( "${myarray[#]#*PRE }" )
printf '<%s>\n' "${myarray[#]%/}"
<2016-02-24-03-42>
<2016-02-25-22-25>
<2016-02-26-00-34>
<2016-02-26-00-42>
<2016-02-26-03-43>
Nota: -t switch remove a trailing newline from each line read.
See help mapfile and/or man -Pless\ +/readarray bash
mapfile was introduced in 2009 with version 4 of bash.
try this:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//"
so if you want to get the oldest folder:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//" | sort | head -n 1
You could also use awk to the rescue
aws s3 ls <s3://obscured-bucket-name>/ | awk '/PRE/ { print $2 }' | tail -n+2
This will remove the last bucket and provide store the folders in the array variable.

How to download latest version of software from same url using wget

I would like to download a latest source code of software (WRF) from some url and automate the installation process thereafter. A sample url like is given below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
In the above url, the version number may change time to time after the developer release the new version. Now I would like to download the latest available version from the main script. I tried the following:-
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV'
With above code, I could pull all available version of the software. The output of the above code is below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV2.0.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Var-do-not-use.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3_OVERLAY_3.0.1.1.TAR.gz
However, I am unable to go further to filter out only later version from the link.
Usually, for processing the html-pages i recommendig some perl tools, but because this is an Directory Index output, (probably) can be done by bash tools like grep sed and such...
The following code is divided to several smaller bash functions, for easy changes
#!/bin/bash
#getdata - should output html source of the page
getdata() {
#use wget with output to stdout or curl or fetch
curl -s "http://www2.mmm.ucar.edu/wrf/src/"
#cat index.html
}
#filer_rows - get the filename and the date columns
filter_rows() {
sed -n 's:<tr><td.*href="\([^"]*\)">.*>\([0-9].*\)</td>.*</td>.*</td></tr>:\2#\1:p' | grep "${1:-.}"
}
#sort_by_date - probably don't need comment... sorts the lines by date... ;)
sort_by_date() {
while IFS=# read -r date file
do
echo "$(date --date="$date" +%s)#$file"
done | sort -gr
}
#MAIN
file=$(getdata | filter_rows WRFV | sort_by_date | head -1 | cut -d# -f2)
echo "You want download: $file"
prints
You want download: WRFV3-Chem-3.6.1.TAR.gz
What about adding a numeric sort and taking the top line:
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV[0-9]*[0-9]\.[0-9]' | sort -r -n | head -1