sed regex to match ['', 'WR' or 'RN'] + 2-4 digits - regex

I'm trying to do some conditional text processing on Unix and struggling with the syntax. I want to acheive
Find the first 2, 3 or 4 digits in the string
if 2 characters before the found digits are 'WR' (could also be lower case)
Variable = the string we've found (e.g. WR1234)
Type = "work request"
else
if 2 characters before the found digits are 'RN' (could also be lower case)
Variable = the string we've found (e.g. RN1234)
Type = "release note"
else
Variable = "WR" + the string we've found (Prepend 'WR' to the digits)
Type = "Work request"
fi
fi
I'm doing this in a Bash shell on Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Thanks in advance,
Karl

I'm not sure how you read in your strings but this example should help you get there. I loop over 4 example strings, WR1234 RN456 7890 PQ2342. You didn't say what to do if the string doesn't match your expected format (PQ2342 in my example), so my code just ignores it.
#!/bin/bash
for string in "WR1234 - Work Request Name.doc" "RN5678 - Release Note.doc"; do
[[ $string =~ ^([^0-9]*)([0-9]*).*$ ]]
case ${BASH_REMATCH[1]} in
"WR")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
"RN")
var="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
type="release note"
echo -e "$var\t-- $type"
;;
"")
var="WR${BASH_REMATCH[2]}"
type="work request"
echo -e "$var\t-- $type"
;;
esac
done
Output
$ ./rematch.sh
WR1234 -- work request
RN5678 -- release note

I like to use perl -pe instead of sed because PERL has such expressive regular expressions. The following is a bit verbose for the sake of instruction.
example.txt:
WR1234 - Work Request name.doc
RN456
rn456
WR7890 - Something else.doc
wr789
2456
script.sh:
#! /bin/bash
# search for 'WR' or 'RN' followed by 2-4 digits and anything else, but capture
# just the part we care about
records="`perl -pe 's/^((WR|RN)([\d]{2,4})).*/\1/i' example.txt`"
# now that you've filtered out the records, you can do something like replace
# WR's with 'work request'
work_requests="`echo \"$records\" | perl -pe 's/wr/work request /ig' | perl -pe 's/rn/release note /ig'`"
# or add 'WR' to lines w/o a listing
work_requests="`echo \"$work_requests\" | perl -pe 's/^(\d)/work request \1/'`"
# or make all of them uppercase
records_upper=`echo $records | tr '[:lower:]' '[:upper:]'`
# or count WR's
wr_count=`echo "$records" | grep -i wr | wc -l`
echo count $wr_count
echo "$work_requests"

#!/bin/bash
string="RN12344 - Work Request Name.doc"
echo "$string" | gawk --re-interval '
{
if(match ($0,/(..)[0-9]{4}\>/,a ) ){
if (a[1]=="WR"){
type="Work release"
}else if ( a[1] == "RN" ){
type = "Release Notes"
}
print type
}
}'

Related

Regular Expression to search for a number between two

I am not very familiar with Regular Expressions.
I have a requirement to extract all lines that match an 8 digit number between any two given numbers (for example 20200628 and 20200630) using regular expression. The boundary numbers are not fixed, but need to be parameterized.
In case you are wondering, this number is a timestamp, and I am trying to extract information between two dates.
HHHHH,E.164,20200626113247
HHHHH,E.164,20200627070835
HHHHH,E.164,20200628125855
HHHHH,E.164,20200629053139
HHHHH,E.164,20200630125855
HHHHH,E.164,20200630125856
HHHHH,E.164,20200626122856
HHHHH,E.164,20200627041046
HHHHH,E.164,20200628125856
HHHHH,E.164,20200630115849
HHHHH,E.164,20200629204531
HHHHH,E.164,20200630125857
HHHHH,E.164,20200630125857
HHHHH,E.164,20200626083628
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629122857
HHHHH,E.164,20200630122237
HHHHH,E.164,20200630122351
HHHHH,E.164,20200630122858
HHHHH,E.164,20200630122857
HHHHH,E.164,20200630084722
Assuming the above data is stored in a file named data.txt, the idea is to sort it on the 3rd column delimited by the comma (i.e. sort -nk3), and then pass the sorted output through this perl filter, as demonstrated by this find_dates.sh script:
#!/bin/bash
[ $# -ne 3 ] && echo "Expects 3 args: YYYYmmdd start, YYYYmmdd end, and data filename" && exit
DATE1=$1
DATE2=$2
FILE=$3
echo "$DATE1" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: First date is invalid - $DATE1" && exit
echo "$DATE2" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: Second date is invalid - $DATE2" && exit
[ ! -r "$FILE" ] && echo "ERROR: File not found - $FILE" && exit
cat $FILE | sort -t, -nk3 | perl -ne '
BEGIN { $date1 = shift; $date2 = shift }
print if /164,$date1/ .. /164,$date2/;
print if /164,$date2/;
' $DATE1 $DATE2 | sort -u
Running the command find_dates.sh 20200627 20200629 data.txt will produce the result:
HHHHH,E.164,20200627041046
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627070835
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628125855
HHHHH,E.164,20200628125856
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629053139
HHHHH,E.164,20200629122857
HHHHH,E.164,20200629204531
For the example you gave, between 20200628 and 20200630, you may try:
\b202006(?:2[89]|30)
Demo
I might be tempted to make the general comment that regex is not very suitable for finding numerical ranges (whereas application programming languages are). However, in the case of parsing a text log file, regex is what would be easily available.

Regular expression on bash/shell/python for githook pre-commit

I am trying to work with regular expression
I have a string in format
[+/-] Added Feature 305105:WWE-108. Added Dolph Ziggler super star
Let's look on each part of string
1) [+/-] – bracket quotes are important. it can [+] or [-]. or [+/-]. not "+", or "-", or "+/-" without bracket quotes
2) Added – it can be "Added", "Resolved", "Closed"
3) 305105 – any numbers
4) Feature – it can be "Feaute", "Bug", "Fix"
5) : – very imporant delimiter
6) WWE-108 – any text with delimiter "–" and with numbers after delimiter
7) . – very imporant delimiter
8) Added Dolph Ziggler super star – any text
What I tried to do
Let's try to resolve each part:
1) echo '[+]' | egrep -o "[+/-]+". Yes, it works, but, it works also for [+/], or [/]. and I see result without bracket quotes
2) echo "Resolved" | egrep -o "Added$|Resolved$|Closed$". It works
3) echo '124214215215' | egrep -o "[0-9]+$". It works
4) echo "Feature" | egrep -o "Feature$|Bug$|Fix$". It works too
5) I have not found how
6) echo "WWE-108" | egrep -o "[a-zA-Z]+-[0-9]+". It works too
7) I have not found how
8) Any text
The main question. How to concatenate, all these points via bash with spaces, according to this template. [+/-] Added Feature 305105:WWE-108. Added Dolph Ziggler super star. I am not familiar with regexp, as for me, I'd like to do something like this:
string="[+/-] Added Feature 305105:WWE-108. Added Dolph Ziggler super star"
first=$(echo $string | awk '{print $1}')
if [[ $first == "[+]" ]]; then
echo "ok"
echo $first
elif [[ $first == "[*]" ]]; then
echo "ok2"
echo $first
elif [[ $first == "[+/-]" ]]; then
echo "ok3"
echo "$first"
else
echo "not ok"
echo $first
exit 1
fi
But it is not ok. Can you please help me in a little bit with creation of regexp on bash. Also, python it is ok too for me.
Why I am doing this ? I want to make pre-commit hook, in format like this.
[+/-] Added Feature 305105:WWE-108. Added Dolph Ziggler super star. This is a reson, why I am doing this.
Answer from comment. Putting all together.
egrep '^\[(\+|-|\+/-)\] (Added|Resolved|Closed) (Feature|Bug|Fix) [0-9]+:[a-zA-Z]+-[0-9]+\..+'
a general rule, with extended regex, meta characters .*+^$(|)[]{}\ must be escaped with a backslash to have literal meaning (except in character sets between [] where rules are different).
Note, for culture, that with basic regex, it's the contrary, backslash was used to enable the specaial meaning of regex extensions (|){}+.
grep '^\[\(+\|-\|+/-\)\] \(Added\|Resolved\|Closed\) \(Feature\|Bug\|Fix\) [0-9]\+:[a-zA-Z]\+-[0-9]\+\..\+'
But it's longer and harder to understand.

Get multiple values in an xml file

<!-- someotherline -->
<add name="core" connectionString="user id=value1;password=value2;Data Source=datasource1.comapany.com;Database=databasename_compny" />
I need to grab the values in userid , password, source, database. Not all lines are in the same format.My desired result would be (username=value1,password=value2, DataSource=datasource1.comapany.com,Database=databasename_compny)
This regex seems little bit more complicated as it is more complicated. Please, explain your answer if possible.
I realised its better to loop through each line. Code I wrote so far
while read p || [[ -n $p ]]; do
#echo $p
if [[ $p =~ .*connectionString.* ]]; then
echo $p
fi
done <a.config
Now inside the if I have to grab the values.
For this solution I am considering:
Some lines can contain no data
No semi-colon ; is inside the data itself (nor field names)
No equal sign = is inside the data itself (nor field names)
A possible solution for you problem would be:
#!/bin/bash
while read p || [[ -n $p ]]; do
# 1. Only keep what is between the quotes after connectionString=
filteredLine=`echo $p | sed -n -e 's/^.*connectionString="\(.\+\)".*$/\1/p'`;
# 2. Ignore empty lines (that do not contain the expected data)
if [ -z "$filteredLine" ]; then
continue;
fi;
# 3. split each field on a line
oneFieldByLine=`echo $filteredLine | sed -e 's/;/\r\n/g'`;
# 4. For each field
while IFS= read -r field; do
# extract field name + field value
fieldName=`echo $field | sed 's/=.*$//'`;
fieldValue=`echo $field | sed 's/^[^=]*=//' | sed 's/[\r\n]//'`;
# do stuff with it
echo "'$fieldName' => '$fieldValue'";
done < <(printf '%s\n' "$oneFieldByLine")
done <a.xml
Explanations
General sed replacement syntax :
sed 's/a/b/' will replace what matches the regex a by the content of b
Step 1
-n argument tells sed not to output if no match is found. In this case this is useful to ignore useless lines.
^.* - anything at the beginning of the line
connectionString=" - literally connectionString="
\(.\+\)" - capturing group to store anything in before the closing quote "
.*$" - anything until the end of the line
\1 tells sed to replace the whole match with only the capturing group (which contains only the data between the quotes)
p tells sed to print out the replacement
Step 3
Replace ; by \r\n ; it is equivalent to splitting by semi-colon because bash can loop over line breaks
Step 4 - field name
Replaces literal = and the rest of the line with nothing (it removes it)
Step 4 - field value
Replaces all the characters at the beginning that are not = ([^=] matches all but what is after the '^' symbol) until the equal symbol by nothing.
Another sed command removes the line breaks by replacing it with nothing.

Bash script grep for pattern in variable of text

I have a variable which contains text; I can echo it to stdout so I think the variable is fine. My problem is trying to grep for a pattern in that variable of text. Here is what I am trying:
ERR_COUNT=`echo $VAR_WITH_TEXT | grep "ERROR total: (\d+)"`
When I echo $ERR_COUNT the variable appears to be empty, so I must be doing something wrong.
How to do this properly? Thanks.
EDIT - Just wanted to mention that testing that pattern on the example text I have in the variable does give me something (I tested with: http://rubular.com)
However the regex could still be wrong.
EDIT2 - Not getting any results yet, so here's the string I'm working with:
ALERT line125: Alert: Cannot locate any description for 'asdf' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../hgfd.controls) ALERT line126: Alert: Cannot locate any description for 'zxcv' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../dfhg.controls) ALERT line127: Alert: Cannot locate any description for 'rtyu' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../kjgh.controls) [1] 22280 IGNORE total: 0 WARN total: 0 ALERT total: 3 ERROR total: 23 [1] + Done /tool/pandora/bin/gvim -u NONE -U NONE -nRN -c runtime! plugin/**/*.vim -bg ...
That's the string, so hopefully there should be no ambiguity anymore... I want to extract the number "23" (after "ERROR total: ") into a variable and I'm having a hard time haha.
Cheers
You can use bash's =~ operator to extract the value.
[[ $VAR_WITH_TEXT =~ ERROR\ total:\ ([0-9]+) ]]
Note that you have to escape the spaces, or only only quote
the fixed parts of the regular expression:
[[ $VAR_WITH_TEXT =~ "ERROR total: "([0-9]+) ]]
since quoting any of the metacharacters causes them to be treated
literally.
You can also save the regex in a variable:
regex="ERROR total: ([0-9]+)"
[[ $VAR_WITH_TEXT =~ $regex ]]
In any case, once the expression matches, the parenthesized expression
can be found in BASH_REMATCH array.
ERR_COUNT=${BASH_REMATCH[1]}
(The zeroth element contains the entire matched regular expression; the parenthesized subexpressions are found in the remaining elements in the order they appear in the full regex.)
If you want to use grep, you'll need a version that can accept Perl-style regexes.
ERR_COUNT=$( echo "$VAR_WITH_TEXT" | grep -Po "(?<=ERROR total: )\d+" )
As long as you need to use Perl-style regexes to enable the look-behind assertion, you can replace [0-9] with \d.
Your error is in the pattern: (\d+) matches:
'('
a digit
'+'
')'
According to your comment, what you want is \(\d\+\), which:
defines a sub-pattern by \( ... \)
Inside it matches at least one (\+) digit (\d).
In this case, if you don't need a sub-pattern, you can just drop the \( and \).
Note: if your grep doesn't understand \d, you can replace it by [0-9]. Easiest way is to write grep '\d' and test it by writing a couple test lines.
# setting example data
test="adfa\nfasetrfaqwe\ndsfa ERROR total: 32514235dsfaewrf"
one solution:
echo $(sed -n 's/^.*ERROR total: \([0-9]*\).*$/\1/p' < <(echo $test))
32514235
other solution:
# throw away everything up to "ERROR total: "
test=${test##*ERROR total: }
# cut from behind assuming number contains no spaces and is
# separated by space
test=${test%% *}
echo $test
32514235
The \d is probably only recognized as a digit in perl regex mode, you probably want to use grep -P.
If you only want the number you could try:
ERR_COUNT=$(echo $VAR_WITH_TEXT | perl -pe "s/.*ERROR total: (\d+).*/\1/g")
or:
ERR_COUNT=$(echo $VAR_WITH_TEXT | sed -n "s/.*ERROR total: ([0-9]+).*/\1/gp")

How to check an input string in bash it's in version format (n1.n2.n3)

I've written an script that updates a version on a certain file. I need to check that the input for the user is in version format so I don't finish adding number that are not needed in those important files. The way I have done it is by adding a new value version_check which where I delete my regex pattern and then an if check.
version=$1
version_checked=$(echo $version | sed -e '/[0-9]\+\.[0-9]\+\.[0-9]/d')
if [[ -z $version_checked ]]; then
echo "$version is the right format"
else
echo "$version_checked is not in the right format, please use XX.XX.XX format (ie: 4.15.3)"
exit
fi
That works fine for XX.XX and XX.XX.XX but it also allows XX.XX.XX.XX and XX.XX.XX.XX.XX etc.. so if user makes a mistake it will input wrong data on the file. How can I get the sed regex to ONLY allow 3 pairs of numbers separated by a dot?
Change your regex from:
/[0-9]\+\.[0-9]\+\.[0-9]/
to this:
/^[0-9]*\.[0-9]*\.[0-9]*$/
You can do this with bash pattern matching:
$ for version in 1.2 1.2.3 1.2.3.4; do
printf "%s\t" $version
[[ $version == +([0-9]).+([0-9]).+([0-9]) ]] && echo y || echo n
done
1.2 n
1.2.3 y
1.2.3.4 n
If you need each group of digits to be exactly 2 digits:
[[ $version == [0-9][0-9].[0-9][0-9].[0-9][0-9] ]]