reading and analyzing a text file with bash script - regex

I want to read a log file and want to extract 5-6 number digit that is written right next after the keyword "salary". And then want to analyze if the salary is above 2000. If there is even one above 2000, it is a MNC otherwise unknown. After writing the salary, the line ends mostly but sometimes there is an email option.
My script currently looks like this.
salary=$(grep -o 'salary [1-9][0-9]\+$' tso.txt | grep -o '[0-9]\+')
echo $salary
if [ $salary > 2000 ]; then echo "it is mnc....."; else ":it is unknown....."; fi

This can be done in a simple awk like this:
awk '
{
for (i=2; i<=NF; ++i)
if ($(i-1) == "salary" && $i+0 > 2000) {
mnc = 1
exit
}
}
END {
print (mnc ? "it is mnc....." : "it is unknown.....")
}' file

As you seem to be using a GNU grep, you can get the salary value directly with grep -oP 'salary *0*\K[1-9][0-9]*' and then you can check the salary with if [ $salary -gt 0 ].
See the online demo:
#!/bin/bash
tso='salary 23000'
salary=$(grep -oP 'salary *0*\K[1-9][0-9]*' <<< "$tso")
echo $salary # => 23000
if [ $salary -gt 0 ]; then
echo "it is mnc.....";
else
echo "it is unknown.....";
fi
# => it is mnc.....

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

Else statement is not being executed - Unix

I am trying to run a bash script that has an if/else condition, but for some reason, my else statement is not being executed.
The rest of the script works perfectly. I could try to make it different, but I am trying to understand why this else is not working.
n=1
for ((i=1;i<=GEN;i++))
do
if [ `cat sires${i} | wc -l` -ge 0 ] || [ `cat dams${i} | wc -l` -ge 0 ]; then
cat sires${i} dams${i} > parent${i}
awk 'NR==FNR {a[$1]=$0;next} {if($1 in a) print a[$1]; else print $0}' ped parent${i} >> ped_plus
cat ped_plus | awk '$2!=0 {print $2,0,0}' | awk '!a[$1]++' > tmp_sire
cat ped_plus | awk '$3!=0 {print $3,0,0}' | awk '!a[$1]++' > tmp_dam
((n2=n+i))
awk 'NR==FNR {a[$1];next} !($1 in a) {print $0}' ped_plus tmp_sire > sires${n2}
awk 'NR==FNR {a[$1];next} !($1 in a) {print $0}' ped_plus tmp_dam > dams${n2}
else
echo "Your file looks good."
i=99
fi
done
It should print the message Your file looks good. , but this is not happing.
Any idea?
Use -gt, not -ge when you want to check for more than 0.
Or look at man test, you will find the option -s:
if [ -s sires${i} ] || [ -s dams${i} ]; then

Escaping special characters with sed

I have a script to generate char arrays from strings:
#!/bin/bash
while [ -n "$1" ]
do
echo -n "{" && echo -n "$1" | sed -r "s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
shift
done
It works great as is:
$ wchar 'test\n' 'test\\n' 'test\123' 'test\1234' 'test\x12345'
{'t','e','s','t','\n',0}
{'t','e','s','t','\\','n',0}
{'t','e','s','t','\123',0}
{'t','e','s','t','\123','4',0}
{'t','e','s','t','\x12345',0}
But because sed considers each new line to be a brand new thing it doesn't handle actual newlines:
$ wchar 'test
> test'
{'t','e','s','t',
't','e','s','t',0}
How can I replace special characters (Tabs, newlines etc) with their escaped versions so that the output would be like so:
$ wchar 'test
> test'
{'t','e','s','t','\n','t','e','s','t',0}
Edit: Some ideas that almost work:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$!ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{test\n\\n\1234\x1234abg\ntest\ntest0}
While removing the !:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{'t','e','s','t','\n','\\','n','\123','4','\x1234ab','g','\n','t','e','s','t',
test0}
This is close...
The first isn't performing the final replacement, and the second isn't correctly adding the last line
You can pre-filter before passing to sed. Perl will do:
$ set -- 'test1
> test2'
$ echo -n "$1" | perl -0777 -pe 's/\n/\\n/g'
test1\ntest2
This is a very convoluted solution, but might work for your needs. GNU awk 4.1
#!/usr/bin/awk -f
#include "join"
#include "ord"
BEGIN {
RS = "\\\\(n|x..)"
FS = ""
}
{
for (z=1; z<=NF; z++)
y[++x] = ord($z)<0x20 ? sprintf("\\x%02x",ord($z)) : $z
y[++x] = RT
}
END {
y[++x] = "\\0"
for (w in y)
y[w] = "'" y[w] "'"
printf "{%s}", join(y, 1, x, ",")
}
Result
$ cat file
a
b\nc\x0a
$ ./foo.awk file
{'a','\x0a','b','\n','c','\x0a','\0'}

Netmask validation seems not working using regex in bash script while ip validation working fine

I wrote simple script to validate IP address and Netmask as follows
#!/bin/bash
validFormatIP()
{
echo $1 | grep -w -E -o '^(25[0-4]|2[0-4][0-9]|1[0-9][0-9]|[1]?[1-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' > /dev/null
if [ $? -eq 0 ]
then
echo "Valid ipaddress"
else
echo "Inalid ipaddress"
fi
}
validNetMask()
{
echo $1 | grep -w -E -o '^(254|252|248|240|224|192|128)\\.0\\.0\\.0|255\\.(254|252|248|240|224|192|128|0)\\.0\\.0|255\\.255\\.(254|252|248|240|224|192|128|0)\\.0|255\\.255\\.255\\.(254|252|248|240|224|192|128|0)' > /dev/null
if [ $? -eq 0 ]
then
echo "Valid netmask"
else
echo "Invalid netmask"
fi
}
setIpAddress()
{
ip_address=`echo $1 | awk -F= '{print $2}'`
validFormatIP $ip_address
}
setNetMask()
{
ip_address=`echo $1 | awk -F= '{print $2}'`
validNetMask $ip_address
}
let "n_count=0"
netmask=""
let "i_count=0"
ipaddess=""
while getopts ":i:n:" OPTION
do
case $OPTION in
n)
netmask="Netmask=$OPTARG"
let "n_count+=1"
;;
i)
ipaddess="IpAddess=$OPTARG"
let "i_count+=1"
;;
?)
echo "wrong command syntax"
;;
esac
done
if [ $i_count -eq 1 ]
then
setIpAddress $ipaddess
exit 0
fi
if [ $n_count -eq 1 ]
then
setNetMask $netmask
exit 0
fi
Using above result i have successfully filter out invalid IPaddress but not able to filter invalid netmask.I have run above script with different argument as follows and see the output also below after script executing
$ ./script.bash -i 192.168.0.1
Valid ipaddress
$./script.bash -i 255.255.0.0
Inalid ipaddress
$./script.bash -n 255.255.255.0
Invalid netmask
As you see above output the result for IP address validation is expected but why it reject the netmask even i enter valid netmask `255.255.255.0 ?
Any one have idea what i miss in netmask validation or something wrong in my script?
grep doesn't double escaping dots etc so this will work:
validNetMask() {
echo $1 | grep -w -E -o '^(254|252|248|240|224|192|128)\.0\.0\.0|255\.(254|252|248|240|224|192|128|0)\.0\.0|255\.255\.(254|252|248|240|224|192|128|0)\.0|255\.255\.255\.(254|252|248|240|224|192|128|0)' > /dev/null
if [ $? -eq 0 ]; then
echo "Valid netmask"
else
echo "Invalid netmask"
fi
}
Better to use this concise version:
validNetMask() {
grep -E -q '^(254|252|248|240|224|192|128)\.0\.0\.0|255\.(254|252|248|240|224|192|128|0)\.0\.0|255\.255\.(254|252|248|240|224|192|128|0)\.0|255\.255\.255\.(254|252|248|240|224|192|128|0)' <<< "$1" && echo "Valid netmask" || echo "Invalid netmask"
}
mine:
validate_netmask () {
n_masks=(${1//./ })
[ "${#n_masks[#]}" -ne 4 ] && return 1
for i in ${1//./ }; do
bits=$(echo "obase=2;ibase=10;$i" | bc)
pre=$((8-${#bits}))
if [ "$bits" = 0 ]; then
zeros=00000000
elif [ "$pre" -gt 0 ]; then
zeros=$(for ((i=1;i<=$pre;i++)); do echo -n 0; done)
fi
b_mask=$b_mask$zeros$bits
unset zeros
done
if [ $b_mask = ${b_mask%%0*}${b_mask##*1} ]; then
return 0
else
return 1
fi
}

How can I check the balance of ASCII images using bash?

I have some large ASCII images that I want to check are symmetrical. Say I have the following file:
***^^^MMM
*^**^^MMM
**^^^^^MMMMM
The first line is what I want, they are all separated and have the same amount in each section (it doesn't have to be 3 of each ever time though), and the next two are not what I want. I want to count the number of *'s in a row, and then make sure there are the same amount of ^'s and M's following them. I'm trying to get some symmetry on each line, so this would be good:
**^^MM
**********^^^^^^^^^^MMMMMMMMMM
****^^^^MMMM
*^M
etc.
How can I scan through a file and maybe grep the problem lines?
I tried a few loops with cat ASCIIfile | sed 's/\^//g' | sed 's/M//g' | wc -c and assigning output to a variable and then comparing the count to the other char counts, but obviously this doesn't take into account order and lines like *^*^*M^MM were working.
Using perl:
perl -ne ' { $l=$_; chomp; ($v)=/^((.)\2*)/; $t=length($v); \
s/M{$t}//;s/\^{$t}//;s/\*{$t}//; \
print $l if length } ' input_file
Using bash/sed:
while read line; do
m=$(echo $line | sed 's/[^M]*\([M][M]*\)[^M]*/\1/' | wc -c)
s=$(echo $line | sed 's/[^*]*\([*][*]*\)[^*]*/\1/' | wc -c)
n=$(echo $line | sed 's/[^\^]*\([\^][\^]*\)[^\^]*/\1/' | wc -c)
if [[ $m -ne $s || $m -ne $n ]]; then
echo "- $line $m::$s::$n"
else
echo "+ $line $m::$s::$n"
fi
done < input_file
Pure Bash:
#!/bin/bash
for string in '***^^^MMM' '**^^MM' '****^^MMMM' '*^*M^'
do
flag=true
sym=true
patt=''
prevlen=${#string}
for c in '*' '^' 'M'
do
patt+="*\\$c"
sub="${string##$patt}"
sublen="${#sub}"
if $flag
then
flag=false
((compare = prevlen - sublen ))
else
if (( prevlen - sublen != compare ))
then
printf '%s\n' "$string is NOT symmetrical"
sym=false
break
fi
fi
prevlen=$sublen
done
if $sym
then
printf '%s\n' "$string IS symmetrical"
fi
done
To read from a file, change the first for loop to while read -r string and add < filename after the last done on the same line.