bash download the first matching regex on download page

bash download the first matching regex on download page - regex

I want to get the newest (first) download link matching a regex.
URL=https://github.com/sharkdp/bat/releases/ # Need to look at /releases/ even though the downloads are under /releases/download/$REL/$BAT
content=$(wget $URL -q -O -)
# Parse $content for string starting 'https://' and ending "_amd64.deb"
# At the moment, that will be: href="/sharkdp/bat/releases/download/v0.18.3/bat_0.18.3_amd64.deb"
# wget -O to specify the name of the file into which wget dumps the page contents, and then - to get the dump onto standard output. -q (quiet) turns off wget output.
Then I need to somehow grep / match strings that starts https:// and ends _amd64. Then I need to just pick the first one in that list.
How do I grep / match / pick first item in this way?
Once I have that, it's then easy for me to download the latest version on the page, with wget -P /tmp/ $DL

With Bash, you can use
rx='href="(/sharkdp/[^"]*_amd64\.deb)"'
if [[ "$content" =~ $rx ]]; then
echo "${BASH_REMATCH[1]}";
else
echo "No match";
fi
# => /sharkdp/bat/releases/download/v0.18.3/bat-musl_0.18.3_amd64.deb
The href="(/sharkdp/[^"]*_amd64\.deb)" regex matches href=", then captures into Group 1 (${BASH_REMATCH[1]}) /shardp/ + zero or more chars other than " + _amd64.deb and then just matches ".
With GNU grep, you can use
> link=$(grep -oP 'href="\K/sharkdp/[^"]*_amd64\.deb' <<< "$content" | head -1)
> echo "$link"
# => /sharkdp/bat/releases/download/v0.18.3/bat-musl_0.18.3_amd64.deb
Here,
href="\K/sharkdp/[^"]*_amd64\.deb - matches href=", then drops this text from the match, then matches /sharkdp/ + any zero or more chars other than " and then _amd_64.deb
head -1 - only keeps the first match.

Related

Script to delete old files and leave the newest one in a directory in Linux

I have a backup tool that takes database backup daily and stores them with the following format:
*_DATE_*.*.sql.gz
with DATE being in YYYY-MM-DD format.
How could I delete old files (by comparing YYYY-MM-DD in the filenames) matching the pattern above, while leaving only the newest one.
Example:
wordpress_2020-01-27_06h25m.Monday.sql.gz
wordpress_2020-01-28_06h25m.Tuesday.sql.gz
wordpress_2020-01-29_06h25m.Wednesday.sql.gz
Ath the end only the last file, meaning wordpress_2020-01-29_06h25m.Wednesday.sql.gz should remain.

Assuming:
The preceding substring left to _DATE_ portion does not contain underscores.
The filenames do not contain newline characters.
Then would you try the following:
for f in *.sql.gz; do
echo "$f"
done | sort -t "_" -k 2 | head -n -1 | xargs rm --
If your head and cut commands support -z option, following code will be more robust against special characters in the filenames:
for f in *.sql.gz; do
[[ $f =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]] && \
printf "%s\t%s\0" "${BASH_REMATCH[1]}" "$f"
done | sort -z | head -z -n -1 | cut -z -f 2- | xargs -0 rm --
It makes use of the NUL character as a line delimiter and allows any special characters in the filenames.
It first extracts the DATE portion from the filename, then prepend it to the filename as a first field separated by a tab character.
Then it sorts the files with the DATE string, exclude the last (newest) one, then retrieve the filename cutting the first field off, then remove those files.

I found this in another question. Although it serves the purpose, but it does not handle the files based on their filenames.
ls -tp | grep -v '/$' | tail -n +2 | xargs -I {} rm -- {}

Since the pattern (glob) you present us is very generic, we have to make an assumption here.
assumption: the date pattern, is the first sequence that matches the regex [0-9]{4}-[0-9]{2}-[0-9]{2}
Files are of the form: constant_string_<DATE>_*.sql.gz
a=( *.sql.gz )
unset a[${#a[#]}-1]
rm "${a[#]}"
Files are of the form: *_<DATE>_*.sql.gz
Using this, it is easily done in the following way:
a=( *.sql.gz );
cnt=0; ref="0000-00-00"; for f in "${a[#]}"; do
[[ "$f" =~ [0-9]{4}(-[0-9]{2}){2} ]] \
&& [[ "$BASH_REMATCH" > "$ref" ]] \
&& ref="${BASH_REMATCH}" && refi=$cnt
((++cnt))
done
unset a[cnt]
rm "${a[#]}"
[[ expression ]] <snip> An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force it to be matched as a string. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression
source: man bash

Goto the folder where you have *_DATE_*.*.sql.gz files and try below command
ls -ltr *.sql.gz|awk '{print $9}'|awk '/2020/{print $0}' |xargs rm
or
use
`ls -ltr |grep '2019-05-20'|awk '{print $9}'|xargs rm`
replace/2020/ with the pattern you want to delete. example 2020-05-01 replace as /2020-05-01/

Using two for loop
#!/bin/bash
shopt -s nullglob ##: This might not be needed but just in case
##: If there are no files the glob will not expand
latest=
allfiles=()
unwantedfiles=()
for file in *_????-??-??_*.sql.gz; do
if [[ $file =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]]; then
allfiles+=("$file")
[[ $file > $latest ]] && latest=$file ##: The > is magical inside [[
fi
done
n=${#allfiles[#]}
if ((n <= 1)); then ##: No files or only one file don't remove it!!
printf '%s\n' "Found ${n:-0} ${allfiles[#]:-*sql.gz} file, bye!"
exit 0 ##: Exit gracefully instead
fi
for f in "${allfiles[#]}"; do
[[ $latest == $f ]] && continue ##: Skip the latest file in the loop.
unwantedfiles+=("$f") ##: Save all files in an array without the latest.
done
printf 'Deleting the following files: %s\n' "${unwantedfiles[*]}"
echo rm -rf "${unwantedfiles[#]}"
Relies heavily on the > test operator inside [[
You can create a new file with lower dates and should still be good.
The echo is there just to see what's going to happen. Remove it if you're satisfied with the output.
I'm actually using this script via cron now, except for the *.sql.gz part since I only have directories to match but the same date formant so I have, ????-??-??/ and only ([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}) as the regex pattern.

You can use my Python script "rotate-archives" for smart delete backups. (https://gitlab.com/k11a/rotate-archives).
An example of starting archives deletion:
rotate-archives.py test_mode=off age_from-period-amount_for_last_timeslot=7-5,31-14,365-180-5 archives_dir=/mnt/archives
As a result, there will remain archives from 7 to 30 days old with a time interval between archives of 5 days, from 31 to 364 days old with time interval between archives 14 days, from 365 days old with time interval between archives 180 days and the number of 5.
But require move _date_ to beginning file name or script add current date for new files.

How to find and replace special chars within a string in zsh

I'm trying to build a secure copy protocol quick function. When I run the command it will work with a single file OR with the entire directory, but as soon as I put a /* after the local_repo it returns zsh: no matches found: hackingedu/*.
If I put the command scp hackingedu\/\* hackingedu the command works properly. I think I'm on the right track, but can't get it to work.
contains() {
string="$1"
substring="$2"
if test "${string#*$substring}" != "$string"
then
# echo '$substring is in $string'
return 1 # $substring is in $string
else
# echo '$substring is not in $string'
return 0 # $substring is not in $string
fi
}
# Quickly scp files in Workspace to Remote
function scp() {
local_repo="$1"
remote_repo="$2"
# find all the `*` and replace with `/*`
if [ contains $local_repo '*' ]; then
# replace all instances of * with \* <- HOW TO DO
fi
command scp -r $LOCAL_REPOS/$local_repo $ALEX_SERVER_UNAME#$ALEX_SERVER_PORT:$ALEX_REMOTE_ROOT_PATH/$remote_repo
# Description: $1: Local Repo | $2: Remote Repo
# Define ex: scpp local/path/to/file/or/directory/* remote/path/to/file/or/directory/*
# Live ex: scpp alexcory/index.php alexcory/index.php
# Live ex: scpp alexcory/* alexcory/*
#
# This Saves you from having long commands that look like this:
# scp -r ~/Google\ Drive/server/Dev/git\ repositories/hackingedu/* alexander#alexander.com:/home2/alexander/public_html/hackingedu/beta
}
Command trying to execute: scp -r ~/Google\ Drive/server/Dev/git\ repositories/hackingedu/* alexander#alexander.com:/home2/alexander/public_html/hackingedu/beta
Any ideas on how to find and replace an *? If there's a better way to do this please do tell! :)
If you know how to do this in bash I would like your input as well!
References:
How do you tell if a string contains another string in Unix shell scripting?
ZSH Find command replacement
ZSH Find command replacement 2
Using wildcards in commands with zsh

You can either prefix your scp call using noglob (which will turn off globbing for that command, e.g. noglob ls *) or use
autoload -U url-quote-magic
zle -N self-insert url-quote-magic
zstyle -e :urlglobber url-other-schema '[[ $words[1] == scp ]] && reply=("*") || reply=(http https ftp)'
the above should make zsh auto quote * when you use scp.
[...]
BTW, in any case, you should learn that you can easily quote special characters using ${(q)variable_name}, e.g.
% foo='*&$%normal_chars'
% echo $foo
*&$%normal_chars
% echo ${(q)foo}
\*\&\$%normal_chars

Filter words starting and ending with hyphen but not when it's found in the middle

I have a list of words I want to filter: only those that starts or ends with a hyphen but not those with a hyphen in the middle. That is, to filter entries like: "a-" or "-cefalia" but not "castellano-manchego".
I have tried with many options and the most similar thing I've found it'sgrep -E '*\-' minilemario.txt however it filters all hyphens. Could you please provide me with a solution?
a
a-
aarónico
aaronita
amuzgo
an-
-án
ana
-ana
ana-
anabaptismo
anabaptista
blablá
bla-bla-bla
blanca
castellano
castellanohablante
castellano-leonés
castellano-manchego
castellanoparlante
cedulario
cedulón
-céfala
cefalalgia
cefalálgico
cefalea
-cefalia
cefálica
cefálico
cefalitis
céfalo
-céfalo
cefalópodo
cefalorraquídeo
cefalotórax
cefea
ciabogar
cian
cian-
cianato
cianea
cianhídrico
cianí
ciánico
cianita
ciano-
cianógeno
cianosis
cianótico
cianuro
ciar
ciática
ciático
zoo
zoo-
zoófago

Using grep, say:
grep -E '^-|-$' filename
to get the words starting and ending with -. And
grep -v -E '^-|-$' filename
to exclude the words starting and ending with -.
^ and $ are anchors denoting the start and end of line respectively. You used '*\-' which would match anything followed by - (it doesn't say that - is at the end of the line).

Here is a bash only solution. Please see the comments for details:
#!/usr/bin/env bash
# Assign the first argument (e.g. a textfile) to a variable
input="$1"
# Bash 4 - read the data line by line into an array
readarray -t data < "$input"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$input"
# For each item in the array do something
for item in "${data[#]}"; do
# Line starts with "-" or ends with "-"
[[ "$item" =~ ^-|-$ ]] && echo "$item"
done
This will produce the following output:
$ ./script input.txt
a-
an-
-án
-ana
ana-
-céfala
-cefalia
-céfalo
cian-
ciano-
zoo-

How to check an input string in bash it's in version format (n1.n2.n3)

I've written an script that updates a version on a certain file. I need to check that the input for the user is in version format so I don't finish adding number that are not needed in those important files. The way I have done it is by adding a new value version_check which where I delete my regex pattern and then an if check.
version=$1
version_checked=$(echo $version | sed -e '/[0-9]\+\.[0-9]\+\.[0-9]/d')
if [[ -z $version_checked ]]; then
echo "$version is the right format"
else
echo "$version_checked is not in the right format, please use XX.XX.XX format (ie: 4.15.3)"
exit
fi
That works fine for XX.XX and XX.XX.XX but it also allows XX.XX.XX.XX and XX.XX.XX.XX.XX etc.. so if user makes a mistake it will input wrong data on the file. How can I get the sed regex to ONLY allow 3 pairs of numbers separated by a dot?

Change your regex from:
/[0-9]\+\.[0-9]\+\.[0-9]/
to this:
/^[0-9]*\.[0-9]*\.[0-9]*$/

You can do this with bash pattern matching:
$ for version in 1.2 1.2.3 1.2.3.4; do
printf "%s\t" $version
[[ $version == +([0-9]).+([0-9]).+([0-9]) ]] && echo y || echo n
done
1.2 n
1.2.3 y
1.2.3.4 n
If you need each group of digits to be exactly 2 digits:
[[ $version == [0-9][0-9].[0-9][0-9].[0-9][0-9] ]]

sed regex to match ['', 'WR' or 'RN'] + 2-4 digits

I'm trying to do some conditional text processing on Unix and struggling with the syntax. I want to acheive
Find the first 2, 3 or 4 digits in the string
if 2 characters before the found digits are 'WR' (could also be lower case)
Variable = the string we've found (e.g. WR1234)
Type = "work request"
else
if 2 characters before the found digits are 'RN' (could also be lower case)
Variable = the string we've found (e.g. RN1234)
Type = "release note"
else
Variable = "WR" + the string we've found (Prepend 'WR' to the digits)
Type = "Work request"
fi
fi
I'm doing this in a Bash shell on Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Thanks in advance,
Karl

I'm not sure how you read in your strings but this example should help you get there. I loop over 4 example strings, WR1234 RN456 7890 PQ2342. You didn't say what to do if the string doesn't match your expected format (PQ2342 in my example), so my code just ignores it.
#!/bin/bash
for string in "WR1234 - Work Request Name.doc" "RN5678 - Release Note.doc"; do
[[ $string =~ ^([^0-9]*)([0-9]*).*$ ]]
case ${BASH_REMATCH[1]} in
"WR")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
"RN")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="release note"
echo -e "$var\t-- $type"
;;
"")
var="WR${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
esac
done
Output
$ ./rematch.sh
WR1234 -- work request
RN5678 -- release note

I like to use perl -pe instead of sed because PERL has such expressive regular expressions. The following is a bit verbose for the sake of instruction.
example.txt:
WR1234 - Work Request name.doc
RN456
rn456
WR7890 - Something else.doc
wr789
2456
script.sh:
#! /bin/bash
# search for 'WR' or 'RN' followed by 2-4 digits and anything else, but capture
# just the part we care about
records="`perl -pe 's/^((WR|RN)([\d]{2,4})).*/\1/i' example.txt`"
# now that you've filtered out the records, you can do something like replace
# WR's with 'work request'
work_requests="`echo \"$records\" | perl -pe 's/wr/work request /ig' | perl -pe 's/rn/release note /ig'`"
# or add 'WR' to lines w/o a listing
work_requests="`echo \"$work_requests\" | perl -pe 's/^(\d)/work request \1/'`"
# or make all of them uppercase
records_upper=`echo $records | tr '[:lower:]' '[:upper:]'`
# or count WR's
wr_count=`echo "$records" | grep -i wr | wc -l`
echo count $wr_count
echo "$work_requests"

#!/bin/bash
string="RN12344 - Work Request Name.doc"
echo "$string" | gawk --re-interval '
{
if(match ($0,/(..)[0-9]{4}\>/,a ) ){
if (a[1]=="WR"){
type="Work release"
}else if ( a[1] == "RN" ){
type = "Release Notes"
}
print type
}
}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

bash download the first matching regex on download page - regex

Related

Script to delete old files and leave the newest one in a directory in Linux

How to find and replace special chars within a string in zsh

Filter words starting and ending with hyphen but not when it's found in the middle

How to check an input string in bash it's in version format (n1.n2.n3)

sed regex to match ['', 'WR' or 'RN'] + 2-4 digits

Categories

Resources