Bash - Find variable in many .txt files and calculate statistics - regex

I have many .txt files in a folder. They are full of statistics, and have a name that's representative of the experiment those statistics are about.
exp_1_try_1.txt
exp_1_try_2.txt
exp_1_try_3.txt
exp_2_try_1.txt
exp_2_try_2.txt
exp_other.txt
In those files, I need to find the value of a variable with a specific name, and use them to calculate some statistics: min, max, avg, std dev and median.
The variable is a decimal value and dot "." is used as a decimal separator. No scientific notation, although it would be nice to handle that as well.
#in file exp_1_try_1.txt
var1=30.523
var2=0.6
#in file exp_1_try_2.txt
var1=78.98
var2=0.4
#in file exp_1_try_3.txt
var1=78.100
var2=1.1
In order to do this, I'm using bash. Here's an old script I made before my bash skills got rusty. It calculates the average of an integer value.
#!/bin/bash
folder=$1
varName="nHops"
cd "$folder"
grep -r -n -i --include="*_out.txt" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+))|.*/\2/' | awk '{count1+=$1; count2+=$1+1}END{print "avg hops:",count1/NR; print "avg path length:",count2/NR}' RS="\n"
I'd like to modify this script to:
support finding decimal values of variable length
calculate more statistics
In particular std dev and median may require special attention.
Update: Here's my try to solve the problem using only UNIX tools, partially inspired by this answer. It works fine, except it does not calculate the standard deviation. The chosen answer uses Perl and is probably much faster.
#!/bin/bash
folder=$1
varName="var1"
cd "$folder"
grep -r -n -i --include="exp_1_run_*" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+(\.[0-9]*)?))/\2/' | sort -n | awk '
BEGIN {
count = 0;
sum = 0;
}
{
a[count++] = $1;
sum += $1;
}
END {
avg = sum / count;
if( (count % 2) == 1 ) {
median = a[ int(count/2) ];
} else {
median = ( a[count/2] + a[count/2-1] ) / 2;
}
OFS="\t";
OFMT="%.6f";
print avg, median, a[0], a[count-1];
}
'

To extract just the values, use the -o and -P grep options:
grep -rioPh --include="*_out.txt" "(?<=${varName}=)[\d.]+" .
That looks for a pattern like nHops=1.234 and just prints out 1.234
Given your sample data:
$ var="var1"
$ grep -oPh "(?<=$var=)[\d.]+" exp_1_try_{1,2,3}.txt
30.523
78.98
78.100
To output some stats, you should be able to pipe those numbers into your favourite stats program. Here's an example:
grep -oPh "(?<=$var=)[\d.]+" f? |
perl -MStatistics::Basic=:all -le '
#data = <>;
print "mean: ", mean(#data);
print "median: ", median(#data);
print "stddev: ", stddev(#data)
'
mean: 62.53
median: 78.1
stddev: 22.64
Of course, since this is perl, we don't need grep or sed at all:
perl -MStatistics::Basic=:all -MList::Util=min,max -lne '
/'"$var"'\s*=\s*(\d+\.?\d*)/ and push #data, $1
} END {
print "mean: ", mean(#data);
print "median: ", median(#data);
print "stddev: ", stddev(#data);
print "min: ", min(#data);
print "max: ", max(#data);
' exp_1_try_*
mean: 62.53
median: 78.1
stddev: 22.64
min: 30.523
max: 78.98

Related

reading and analyzing a text file with bash script

I want to read a log file and want to extract 5-6 number digit that is written right next after the keyword "salary". And then want to analyze if the salary is above 2000. If there is even one above 2000, it is a MNC otherwise unknown. After writing the salary, the line ends mostly but sometimes there is an email option.
My script currently looks like this.
salary=$(grep -o 'salary [1-9][0-9]\+$' tso.txt | grep -o '[0-9]\+')
echo $salary
if [ $salary > 2000 ]; then echo "it is mnc....."; else ":it is unknown....."; fi
This can be done in a simple awk like this:
awk '
{
for (i=2; i<=NF; ++i)
if ($(i-1) == "salary" && $i+0 > 2000) {
mnc = 1
exit
}
}
END {
print (mnc ? "it is mnc....." : "it is unknown.....")
}' file
As you seem to be using a GNU grep, you can get the salary value directly with grep -oP 'salary *0*\K[1-9][0-9]*' and then you can check the salary with if [ $salary -gt 0 ].
See the online demo:
#!/bin/bash
tso='salary 23000'
salary=$(grep -oP 'salary *0*\K[1-9][0-9]*' <<< "$tso")
echo $salary # => 23000
if [ $salary -gt 0 ]; then
echo "it is mnc.....";
else
echo "it is unknown.....";
fi
# => it is mnc.....

Search strings from bulk data

I have a folder with many files containing text like the following:
blabla
chargeableDuration 00 01 03
...
timeForStartOfCharge 14 55 41
blabla
...
blabla
calledPartyNumber 123456789
blabla
...
blabla
callingPartyNumber 987654321
I require the output like:
987654321 123456789 145541 000103
I have been trying with following awk:
awk -F '[[:blank:]:=,]+' '/findstr chargeableDuration|dateForStartOfCharge|calledPartyNumber|callingPartyNumber/ && $4{
if (calledPartyNumber != "")
print dateForStartOfCharge, "NIL"
dateForStartOfCharge=$5
next
}
/calledPartyNumber/ {
for(i=1; i<=NF; i++)
if ($i ~ /calledPartyNumber/)
break
print chargeableDuration, $i
chargeableDuration=""
}' file
Cannot make it work. Please help.
Assuming you have a file with text named "test.txt", below linux shell command will do the work for you.
egrep -o "[0-9 ]{1,}" test.txt | tr -d ' \t\r\f' | sort -nr | tr "\n" "\t"
Pretty much like Manishs answer:
tac test_regex.txt | grep -oP '(?<=chargeableDuration|timeForStartOfCharge|calledPartyNumber|callingPartyNumber)\s+([^\n]+)' | tr -d " \t\r\f" | tr "\n" " "
Only difference is, you keep the preceding order instead of sorting the result. So for your example both solutions would produce the same output, but you could end up with different results.
awk '/[0-9 ]+$/{
x=substr($0,( index($0," ") + 1 ) );
gsub(" ","",x);
a[$1]=x
}
END {
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ");
for (i=1;i<=4;i++){
printf a[(b[i])]" "
}
}'
/[0-9 ]+$/ : Find lines end with number separated with/without spaces.
x=substr($0,( index($0," ") + 1 ) ) : Get the index after the first space match in $0 and save the substring after the first space match(ie digits) to a variable x
gsub(" ","",x) : Remove white spaces in x
a[$1]=x : Create an array a with index as $0 and assign x to it
END:
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ") : Create array b where index 1,2,3 and 4 has value of your required field in the order you need
for (i=1;i<=4;i++){
printf a[(b[i])]" "
} : for loop to get the value in array a with index as value in array b[1],b[2],b[3] and b[4]

Parsing a .csv-like file in bash

I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test
A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'
Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1

filtering some text from line using sed linux

I have a following content in the file:
NAME=ALARMCARDSLOT137 TYPE=2 CLASS=116 SYSPORT=2629 STATE=U ALARM=M APPL=" " CRMPLINK=CHASSIS131 DYNDATA="GL:1,15 ADMN:1 OPER:2 USAG:2 STBY:0 AVAL:0 PROC:0 UKNN:0 INH:0 ALM:20063;1406718801,"
I just want to filter out NAME , SYSPORT and ALM field using sed
Try the below sed command to filter out NAME,SYSPORT,ALM fields ,
$ sed 's/.*\(NAME=[^ ]*\).*\(SYSPORT=[^ ]*\).*\(ALM:[^;]*\).*/\1 \2 \3/g' file
NAME=ALARMCARDSLOT137 SYSPORT=2629 ALM:20063
why not using grep?
grep -oE 'NAME=\S*|SYSPORT=\S*|ALM:[^;]*'
test with your text:
kent$ echo 'NAME=ALARMCARDSLOT137 TYPE=2 CLASS=116 SYSPORT=2629 STATE=U ALARM=M APPL=" " CRMPLINK=CHASSIS131 DYNDATA="GL:1,15 ADMN:1 OPER:2 USAG:2 STBY:0 AVAL:0 PROC:0 UKNN:0 INH:0 ALM:20063;1406718801,"'|grep -oE 'NAME=\S*|SYSPORT=\S*|ALM:[^;]*'
NAME=ALARMCARDSLOT137
SYSPORT=2629
ALM:20063
Here is another awk
awk -F" |;" -v RS=" " '/NAME|SYSPORT|ALM/ {print $1}'
NAME=ALARMCARDSLOT137
SYSPORT=2629
ALM:20063
Whenever there are name=value pairs in input files, I find it best to first create an array mapping the names to the values and then operating on the array using the names of the fields you care about. For example:
$ cat tst.awk
function bldN2Varrs( i, fldarr, fldnr, subarr, subnr, tmp ) {
for (i=2;i<=NF;i+=2) { gsub(/ /,RS,$i) }
split($0,fldarr,/[[:blank:]]+/)
for (fldnr in fldarr) {
split(fldarr[fldnr],tmp,/=/)
gsub(RS," ",tmp[2])
gsub(/^"|"$/,"",tmp[2])
name2value[tmp[1]] = tmp[2]
split(tmp[2],subarr,/ /)
for (subnr in subarr) {
split(subarr[subnr],tmp,/:/)
subName2value[tmp[1]] = tmp[2]
}
}
}
function prt( fld, subfld ) {
if (subfld) print fld "/" subfld "=" subName2value[subfld]
else print fld "=" name2value[fld]
}
BEGIN { FS=OFS="\"" }
{
bldN2Varrs()
prt("NAME")
prt("SYSPORT")
prt("DYNDATA","ALM")
}
.
$ awk -f tst.awk file
NAME=ALARMCARDSLOT137
SYSPORT=2629
DYNDATA/ALM=20063;1406718801,
and if 20063;1406718801, isn't the desired value for the ALM field and you just want some subsection of that, simply tweak the array construction function to suit whatever your criteria is.

How to add ".2" in my bash script?

My bash script is
read -p "num 1: " num1
read -p "num 2: " num2
tmbk=$(echo $num1 + $num2 | bc | sed '
s/^\./0./ # .2 -> 0.2
s/^-\./-0./ # -.2 -> -0.2
s/\.0*$// # 2.000 -> 2
');
printf "result : %'d\n" $tmbk
I use printf "%'d\n" to separate 3 zero with point. If I use printf "%s\n" to string, this command does not separate 3 zero with point.
My question:
if I input 0.1 in num1 and 0.1 in num2, why does the result look like this?
printf : 0.2: invalid number
result : 0
I want my bash script to print result: 0.2 and not invalid number
%d is for integers. Try %f instead.
how about in this way?
echo "num 1 :"
read num1
echo "num 2 :"
read num2
awk -v a="$num1" -v b="$num2" 'BEGIN{print "result:" a+b}';
if you need certain format for output, you could use printf in awk
so you want to see '.' as a thousand separator but also as a decimal point?
Bad idea because then you can't determine whether 1.234 is a float or an integer.
locales are there for handling such things (requires these locales to be installed):
for loc in C en_US de_DE de_CH; do
LC_NUMERIC=$loc
printf "%'d\t%'f\t%s\n" 1234 1234 $loc
done
Result:
1234 1234.000000 C
1,234 1,234.000000 en_US
1.234 1.234,000000 de_DE
1'234 1'234.000000 de_CH
As you see none of these locales uses the same character for thousand separator and as decimal point and that's good.
once you've chosen a proper locale you can only agree with Kent. awk is better than bc if you don't like bc's formatting.
your requirement is a bit strange " integers with separated point ( 1.000.000 )", what have you been working on ??
Also i would make a small addition in the line ..... echo "scale=4; $num1 + $num2" | bc
for the "invalid number output" :: the printf for bash uses the same formating that's available in the printf() function of C , as part of libc library.... hence
%d , %i : stands for integers
%g , %f : stands for floating point ... likewise ,
this uses the same validations that the printf() would use in a c - program , hence puts the comment "invalid number" on encountring a float where it expects a integer , as in the following :
Kaizen ~
$ printf "result : %'d\n" 2.3
-bash: printf: 2.3: invalid number
result : 2
Kaizen ~
$ printf "result : %'li\n" 2.3
-bash: printf: 2.3: invalid number
result : 2
I do agree with what #ignacio has suggested , so if you are going to use floating point values to print then you should put a %g or better a %f in your code. The following should work fine for all scenario's in your code :
Kaizen ~
$ printf "result : %'f\n" 2.3
result : 2.300000
Kaizen ~
$ printf "result : %'g\n" 2.3
result : 2.3