can't extract a substring with regex

can't extract a substring with regex - regex

I'm trying to write a prepare-commit-msg hook for git. The script should do following steps :
Get the current git branch name (working)
Extract the issue-id (not working)
Check if the issue-id is already in the commit msg
If not, insert [issue-id] before the commit message
The issue-id has this pattern [a-zA-Z]+-\d+ and the branch name should be something like feature/issue-id-my-small-description.
But for now, the extraction part is not ok...
Here is my prepare-commit-msg script :
# Regex used to extract the issue id
REGEX_ISSUE_ID="s/([a-zA-Z]+-\d+)//"
# Find current branch name
BRANCH_NAME=$(git symbolic-ref --short HEAD)
# Extract issue id from branch name
ISSUE_ID= $BRANCH_NAME | sed -r $REGEX_ISSUE_ID
# Check if the issue id is already in the msg
ISSUE_IN_COMMIT=$(grep -c "\[$ISSUE_ID\]" $1)
# Check if branch name is not null and if the issue id is already in the commit msg
if [ -n "$BRANCH_NAME" ] && ! [[ $ISSUE_IN_COMMIT -ge 1 ]]; then
# Prefix with the issue id surrounded with brackets
sed -i.bak -e "1s/^/[$ISSUE_ID] /" $1
fi
Edit to add in-/output example
Input $1 is the git commit message which is something like
fix bug on login
or
fix MyIssue-234 which is a bug on login
Output should be the input with the issue id i.e. :
[MyIssue-123] fix bug on login

I'm not sure about what and why you do as you do, but this is the closest I got by fixing what I thought is to be corrected in your code:
# Regex used to extract the issue id
REGEX_ISSUE_ID="s/\[([a-zA-Z]+-[0-9]+)\].*/\1/"
# Find current branch name
BRANCH_NAME=$(git symbolic-ref --short HEAD)
if [[ -z "$BRANCH_NAME" ]]; then
echo "No brach name... "; exit 1
fi
# Extract issue id from branch name
ISSUE_ID=$(echo "$BRANCH_NAME" | sed -r "$REGEX_ISSUE_ID")
# Check if the issue id is already in the msg
ISSUE_IN_COMMIT=$(echo "$#" | grep -c "^\[*$ISSUE_ID\]*")
# Check if branch name is not null and if the issue id is already in the commit msg
if [[ -n "$BRANCH_NAME" ]]; then
if [[ $ISSUE_IN_COMMIT -gt 0 ]]; then
shift # Drop the issue if from the msg
fi
# Prefix with the issue id surrounded with brackets
MESSAGE="[$ISSUE_ID] $#"
fi
echo "$MESSAGE"
where $# is all the words that you provide after "fix" (ex. "$#" = "bug" "on" "login"). The rest I hope you understand after you compare it to your original code.

Related

Script to delete old files and leave the newest one in a directory in Linux

I have a backup tool that takes database backup daily and stores them with the following format:
*_DATE_*.*.sql.gz
with DATE being in YYYY-MM-DD format.
How could I delete old files (by comparing YYYY-MM-DD in the filenames) matching the pattern above, while leaving only the newest one.
Example:
wordpress_2020-01-27_06h25m.Monday.sql.gz
wordpress_2020-01-28_06h25m.Tuesday.sql.gz
wordpress_2020-01-29_06h25m.Wednesday.sql.gz
Ath the end only the last file, meaning wordpress_2020-01-29_06h25m.Wednesday.sql.gz should remain.

Assuming:
The preceding substring left to _DATE_ portion does not contain underscores.
The filenames do not contain newline characters.
Then would you try the following:
for f in *.sql.gz; do
echo "$f"
done | sort -t "_" -k 2 | head -n -1 | xargs rm --
If your head and cut commands support -z option, following code will be more robust against special characters in the filenames:
for f in *.sql.gz; do
[[ $f =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]] && \
printf "%s\t%s\0" "${BASH_REMATCH[1]}" "$f"
done | sort -z | head -z -n -1 | cut -z -f 2- | xargs -0 rm --
It makes use of the NUL character as a line delimiter and allows any special characters in the filenames.
It first extracts the DATE portion from the filename, then prepend it to the filename as a first field separated by a tab character.
Then it sorts the files with the DATE string, exclude the last (newest) one, then retrieve the filename cutting the first field off, then remove those files.

I found this in another question. Although it serves the purpose, but it does not handle the files based on their filenames.
ls -tp | grep -v '/$' | tail -n +2 | xargs -I {} rm -- {}

Since the pattern (glob) you present us is very generic, we have to make an assumption here.
assumption: the date pattern, is the first sequence that matches the regex [0-9]{4}-[0-9]{2}-[0-9]{2}
Files are of the form: constant_string_<DATE>_*.sql.gz
a=( *.sql.gz )
unset a[${#a[#]}-1]
rm "${a[#]}"
Files are of the form: *_<DATE>_*.sql.gz
Using this, it is easily done in the following way:
a=( *.sql.gz );
cnt=0; ref="0000-00-00"; for f in "${a[#]}"; do
[[ "$f" =~ [0-9]{4}(-[0-9]{2}){2} ]] \
&& [[ "$BASH_REMATCH" > "$ref" ]] \
&& ref="${BASH_REMATCH}" && refi=$cnt
((++cnt))
done
unset a[cnt]
rm "${a[#]}"
[[ expression ]] <snip> An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force it to be matched as a string. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression
source: man bash

Goto the folder where you have *_DATE_*.*.sql.gz files and try below command
ls -ltr *.sql.gz|awk '{print $9}'|awk '/2020/{print $0}' |xargs rm
or
use
`ls -ltr |grep '2019-05-20'|awk '{print $9}'|xargs rm`
replace/2020/ with the pattern you want to delete. example 2020-05-01 replace as /2020-05-01/

Using two for loop
#!/bin/bash
shopt -s nullglob ##: This might not be needed but just in case
##: If there are no files the glob will not expand
latest=
allfiles=()
unwantedfiles=()
for file in *_????-??-??_*.sql.gz; do
if [[ $file =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]]; then
allfiles+=("$file")
[[ $file > $latest ]] && latest=$file ##: The > is magical inside [[
fi
done
n=${#allfiles[#]}
if ((n <= 1)); then ##: No files or only one file don't remove it!!
printf '%s\n' "Found ${n:-0} ${allfiles[#]:-*sql.gz} file, bye!"
exit 0 ##: Exit gracefully instead
fi
for f in "${allfiles[#]}"; do
[[ $latest == $f ]] && continue ##: Skip the latest file in the loop.
unwantedfiles+=("$f") ##: Save all files in an array without the latest.
done
printf 'Deleting the following files: %s\n' "${unwantedfiles[*]}"
echo rm -rf "${unwantedfiles[#]}"
Relies heavily on the > test operator inside [[
You can create a new file with lower dates and should still be good.
The echo is there just to see what's going to happen. Remove it if you're satisfied with the output.
I'm actually using this script via cron now, except for the *.sql.gz part since I only have directories to match but the same date formant so I have, ????-??-??/ and only ([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}) as the regex pattern.

You can use my Python script "rotate-archives" for smart delete backups. (https://gitlab.com/k11a/rotate-archives).
An example of starting archives deletion:
rotate-archives.py test_mode=off age_from-period-amount_for_last_timeslot=7-5,31-14,365-180-5 archives_dir=/mnt/archives
As a result, there will remain archives from 7 to 30 days old with a time interval between archives of 5 days, from 31 to 364 days old with time interval between archives 14 days, from 365 days old with time interval between archives 180 days and the number of 5.
But require move _date_ to beginning file name or script add current date for new files.

SVN pre-commit hook logic

I'm adding a logic in my svn pre-commit hook to check if there is QA(in upper case starting with space) in commit message then commit should fail. But its not working. Kindly assist me how to write it properly.
REPOS="$1"
TXN="$2"
# Make sure that the log message contains some text.
SVNLOOK=/usr/bin/svnlook
LOGMSG=$($SVNLOOK log -t "$TXN" "$REPOS")
# check if any comment has supplied by the commiter
if [ -z "$LOGMSG" ]; then
echo "Your commit was blocked because it have no comments." 1>&2
exit 1
fi
#check minimum size of text
if [ ${#LOGMSG} -lt 15 ]; then
echo "Your Commit was blocked because the comments does not meet minimum length requirements (15 letters)." 1>&2
exit 1
fi
# get TaskID by regex
TaskID=$(expr "$LOGMSG" : '\([#][0-9]\{1,9\}[:][" "]\)[A-Za-z0-9]*')
# Check if task id was found.
if [ -z "$TaskID" ]; then
echo "" 1>&2
echo "No Task id found in log message \"$LOGMSG\"" 1>&2
echo "" 1>&2
echo "The TaskID must be the first item on the first line of the log message." 1>&2
echo "" 1>&2
echo "Proper TaskID format--> #123- 'Your commit message' " 1>&2
exit 1
fi
#Check that QA should not be present in log message.
QA=$(expr "$LOGMSG" : '\(*[" "][QA][" "]\)')
if [ "$QA" == "QA" ]; then
echo "" 1>&2
echo "Your log message \"$LOGMSG\" must not contain QA in upper case." 1>&2
echo "" 1>&2
exit 1
fi

The regex is incorrect:
\( starts a capturing group in expr, but you don't need a capturing group for your task
When * follows a \( in a pattern, it tries to match a literal *
[QA] matches a single character, which can be Q or A
The pattern of expr must match from the start of the string
As it is, the regex doesn't correspond to your requirement.
Even if the above points are fixed, a pattern QA, "QA" with spaces around it, will not match commit messages like this:
"Fix the build of QA"
"Broken in QA, temporarily"
... and so on...
That is, instead of "QA" with spaces around, you probably want to match QA with word boundaries around.
This is easy to do using grep -w QA.
As you clarified in a comment, you really want a space before the "Q".
In that case the -w flag of grep is not suitable,
because that requires a word boundary at both sides of patterns.
There is another way to match word boundaries,
using \< for word start and \> for word end.
So to have a space in front of "Q",
and a word boundary after "A", you can write QA\>, like this:
if grep -q ' QA\>' <<< "$LOGMSG"; then
echo
echo "Your log message \"$LOGMSG\" must not contain QA in upper case."
echo
exit 1
fi 1>&2
Notice some other improvements:
Instead of redirecting to stderr every single echo, you can redirect the entire if statement
Instead of echo "" you can write simply echo
Instead of storing the result of a command in a temporary variable, you can write conditionals on the exit code of commands

This could be an error with your regex expression checking for " QA ".
I find using this site pretty useful for testing out regex expressions - RegExr.
I put your (*[" "][QA][" "]) expression into the site and when I looked at the details of it (a tab link towards the bottom of the page), it would break down exactly what you regular expression would match with. From this, it was saying that it was looking for the following:
0 or more (
Either a " or a space
Either Q or A (not both)
Either a " or a space
Ending with a )
I put the following expression into it - ( (QA) ) and it was able to find the match in a sample svn message (TEST-117 QA testing message).

Bash regular expression for CRON

Does anyone have any recommendations for the best method to write a regular expression for CRON?
Allow me to explain a little better. I have a config file with individual variables corresponding to the fields in CRON. I need to verify that each field is valid. ie 0-59 for seconds, 0-31 for months etc. I'm using sed to update CRON and if the configuration file has syntax errors (accidental extra characters, letters, anything that CRON doesnt like) the results are disastrous (CRON file is clobbered)
I would need to verify all possible numbers and wildcards and throw an error on anything else. I dont know if im just getting tired or what, but I cant seem to get started logically on this one.
I'm open to any suggestions, not just coding. How to prevent CRON from getting clobbered, maybe editing everything in one string (in config file) for CRON instead of individual variables
Thx for any help
Here is an example of the config. Very simple.
# SUMMARY REPORT FREQUENCY ( * Wildcards acceptable )
MIN="30"
HOUR="*"
DAY="12"
MON="*"
WEEK="*"
* UPDATE *
Ubuntu 12.04 LTS which ships with Bash 4.2.25
and here is the code that is doing the updating.
function REPORT.CHECK {
sleep 1s
if [ "`crontab -l | grep report.sh`" \> " " ]; then
CTMP="$(set -f; crontab -l | grep report.sh)"
if [ "$CTMP" = "$MIN $HOUR $DAY $MON $WEEK cd $DIR && ./report.sh" ]; then
if [ "$DISABLE" = "false" ]; then
RETURN="true"
fi
else
if [ "$DISABLE" = "false" ]; then
CTMPESC=$(sed 's/[\*\.&]/\\&/g' <<<"$CTMP")
DIRESC=$(sed 's/[\*\.&]/\\&/g' <<<"$DIR")
crontab -l | sed "s%$CTMPESC%/$MIN /$HOUR /$DAY /$MON /$WEEK cd $DIRESC \&\& \./report\.sh" | crontab -
RETURN="update"
fi
fi
if [ "$DISABLE" = "true" ]; then
crontab -l | grep -F -v report.sh | crontab -
RETURN="disable"
fi
else
if [ "$DISABLE" = "true" ]; then
RETURN="exit"
else
(crontab -l ; echo "$MIN $HOUR $DAY $MON $WEEK cd $DIR && ./report.sh") | crontab -
RETURN="default"
fi
fi
}
This snip of code actually does quite a bit. It adds the entry to CRON if it doesn't exist. It also kills the script (well returns exit) if this part (the reporting portion) is disabled in the config, it also updates CRON if it sees that what is in CRON is different than whats in the config and finally if the config is identical to whats in CRON, it just ignores and moves on. Those features are not in order. Hopefully that adds enough detail lol.

If you are sticking with the regex-based approach, this set of regexes (regeces?) should get you started. It doesn't support using names for days of the week or months, nor "frequency" notation like */5 to substitute for every five minutes. But try this (assuming you have opened your config file into an file id $configfile:
min=$(grep -P 'MIN="([0-5]?[0-9]|\*)"' $configfile | grep -oP '([0-5]?[0-9]|\*)')
hour=$(grep -P 'HOUR=\"([1-2]?[0-9]|\*)"' $configfile | grep -oP "([1-2]?[0-9]|\*)")
day=$(grep -P 'DAY=\"([1-3]?[0-9]|\*)"' $configfile | grep -oP "([1-3]?[0-9]|\*)")
mon=$(grep -P 'MON=\"(1?[0-9]|\*)"' $configfile | grep -oP "(1?[0-9]|\*)")
week=$(grep -P 'WEEK=\"([0-7]|\*)"' $configfile | grep -oP "([0-7]|\*)")
After you've collected these values, you can easily check to see if they're in the correct range -- for example, it's possible for the HOUR regex to match 29, which obviously isn't a real hour. But now that the value is saved, you can do:
if [ "$hour" -gt 23 ]; then
#throw an error, exit the test, whatever
fi
Just make sure to quote the variables when you test them! For example, "$hour", not $hour. If you have an * in a variable and don't quote it, the shell will expand it inline to all the filenames in your current directory.

How to check an input string in bash it's in version format (n1.n2.n3)

I've written an script that updates a version on a certain file. I need to check that the input for the user is in version format so I don't finish adding number that are not needed in those important files. The way I have done it is by adding a new value version_check which where I delete my regex pattern and then an if check.
version=$1
version_checked=$(echo $version | sed -e '/[0-9]\+\.[0-9]\+\.[0-9]/d')
if [[ -z $version_checked ]]; then
echo "$version is the right format"
else
echo "$version_checked is not in the right format, please use XX.XX.XX format (ie: 4.15.3)"
exit
fi
That works fine for XX.XX and XX.XX.XX but it also allows XX.XX.XX.XX and XX.XX.XX.XX.XX etc.. so if user makes a mistake it will input wrong data on the file. How can I get the sed regex to ONLY allow 3 pairs of numbers separated by a dot?

Change your regex from:
/[0-9]\+\.[0-9]\+\.[0-9]/
to this:
/^[0-9]*\.[0-9]*\.[0-9]*$/

You can do this with bash pattern matching:
$ for version in 1.2 1.2.3 1.2.3.4; do
printf "%s\t" $version
[[ $version == +([0-9]).+([0-9]).+([0-9]) ]] && echo y || echo n
done
1.2 n
1.2.3 y
1.2.3.4 n
If you need each group of digits to be exactly 2 digits:
[[ $version == [0-9][0-9].[0-9][0-9].[0-9][0-9] ]]

sed regex to match ['', 'WR' or 'RN'] + 2-4 digits

I'm trying to do some conditional text processing on Unix and struggling with the syntax. I want to acheive
Find the first 2, 3 or 4 digits in the string
if 2 characters before the found digits are 'WR' (could also be lower case)
Variable = the string we've found (e.g. WR1234)
Type = "work request"
else
if 2 characters before the found digits are 'RN' (could also be lower case)
Variable = the string we've found (e.g. RN1234)
Type = "release note"
else
Variable = "WR" + the string we've found (Prepend 'WR' to the digits)
Type = "Work request"
fi
fi
I'm doing this in a Bash shell on Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Thanks in advance,
Karl

I'm not sure how you read in your strings but this example should help you get there. I loop over 4 example strings, WR1234 RN456 7890 PQ2342. You didn't say what to do if the string doesn't match your expected format (PQ2342 in my example), so my code just ignores it.
#!/bin/bash
for string in "WR1234 - Work Request Name.doc" "RN5678 - Release Note.doc"; do
[[ $string =~ ^([^0-9]*)([0-9]*).*$ ]]
case ${BASH_REMATCH[1]} in
"WR")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
"RN")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="release note"
echo -e "$var\t-- $type"
;;
"")
var="WR${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
esac
done
Output
$ ./rematch.sh
WR1234 -- work request
RN5678 -- release note

I like to use perl -pe instead of sed because PERL has such expressive regular expressions. The following is a bit verbose for the sake of instruction.
example.txt:
WR1234 - Work Request name.doc
RN456
rn456
WR7890 - Something else.doc
wr789
2456
script.sh:
#! /bin/bash
# search for 'WR' or 'RN' followed by 2-4 digits and anything else, but capture
# just the part we care about
records="`perl -pe 's/^((WR|RN)([\d]{2,4})).*/\1/i' example.txt`"
# now that you've filtered out the records, you can do something like replace
# WR's with 'work request'
work_requests="`echo \"$records\" | perl -pe 's/wr/work request /ig' | perl -pe 's/rn/release note /ig'`"
# or add 'WR' to lines w/o a listing
work_requests="`echo \"$work_requests\" | perl -pe 's/^(\d)/work request \1/'`"
# or make all of them uppercase
records_upper=`echo $records | tr '[:lower:]' '[:upper:]'`
# or count WR's
wr_count=`echo "$records" | grep -i wr | wc -l`
echo count $wr_count
echo "$work_requests"

#!/bin/bash
string="RN12344 - Work Request Name.doc"
echo "$string" | gawk --re-interval '
{
if(match ($0,/(..)[0-9]{4}\>/,a ) ){
if (a[1]=="WR"){
type="Work release"
}else if ( a[1] == "RN" ){
type = "Release Notes"
}
print type
}
}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

can't extract a substring with regex - regex

Related

Script to delete old files and leave the newest one in a directory in Linux

SVN pre-commit hook logic

Bash regular expression for CRON

How to check an input string in bash it's in version format (n1.n2.n3)

sed regex to match ['', 'WR' or 'RN'] + 2-4 digits

Categories

Resources