Awk: make an average in file with both number and string - regex

here is my problem (not sure that my title was clear), I have to display an average of number in a file. However there is string in the file too.
file: test
Richie;jack;27 Yo;07Richiej#gmail.com
Cash;tom;29 Yo;Ctom01#gmail.com
Megane;susan;37 Yo;meganeSusan#gmail.com
...
it has to display the average age of the people in my file, I'm not supposed to know how many people there are.
I thought about using RegEx to only get number in my 3rd field, but got errors each time.
awk 'BEGIN{FS=";"} / /

To compute the average of the number in the third column:
$ awk -F\; '{s+=$3} END{print s/NR}' test
31
How it works
-F\;
This tells awk to use ; as the field separator. Because ; is a shell-active character, we have to either escape it (as shown above) or quote it.
s+=$3
For each line read, this adds the number in the third column to s. Because += is an arithmetic operation, awk converts the third field to a number.
This code also illustrates awk's automatic conversion of fields to numbers:
$ awk -F\; '{printf "field=\"%s\" number=%s\n", $3, $3+0}' test
field="27 Yo" number=27
field="29 Yo" number=29
field="37 Yo" number=37
When we print $3, we get the full string including the Yo. When we print $3+0, the conversion to a number is forced and, as shown above, we just get the number.
END{print s/NR}
After we have reached the end of the file, this prints the total of the third columns, save in s, divided by the number of lines read, NR.

Related

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix
With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile
another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.
Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Shell script, split string into everything before and after the last whitespace character

I'm having some issues with separating a string in a shell script. I've been trying similar bits of code I've found online for RegEx, perl, awk, grep etc... but I can't seem to get the required result.
Basically I have a number of strings. Most are in the following format:
long string, space, number e.g.
Something!Something_Something_#Something_Something 10
However a small number aren't all the one string (they should be!) but they have spaces instead of underscores, e.g.
Something!Something_Something_#Something Something 10
or
Something!Something - Something_#Something Something 10
Each string is then formatted as follows:
... |awk '{printf "%-100s %10d\n", $1, $2}' > file.out
which prints the correct result for the strings which contain no spaces
Something!Something_Something_#Something_Something 10
However in the case of the first example it only prints the following due to the space delimiter:
Something!Something_Something_#Something 10
So basically I need a way to pull out everything before the last " " space and assign it to $1 in the awk printf statement. Any help would be greatly appreciated!!!
It's a Solaris 5.10 server by the way.
Hackjob but this will work
awk '{x=$NF;NF--;printf "%-100s %10d\n", $0, x}'

Extracting just a phone number from a file

I'm sure the answer to this is already online however i don't know what i am looking for. I just started a course in Unix/Linux and my dad asked me to make him something for his work. He has a text file and on every fourth line there is a 10 digit number somewhere. How do i make a list of just the numbers? I assume the file looks something like this:
Random junk
Random junk fake number 1234567809
Random junk
My phone number is 1234567890 and it is here random numbers 32131;1231
Random junk
Random junk another fake number 2345432345
Random junk
Just kidding my phone number is here 1234567890 the date is mon:1231:31231
I assume its something like grep [1-9].\{9\} file but how do i get just lines 4,8,12 etc. Because i tested it and i get all phone numbers on every line. Also how do i get just the number not the whole line?
Any help will be greatly appreciated, even if its pointing me in the right direction so i can research it myself. Thanks.
You can do it in two steps:
$ awk '!(NR%4)' file | grep -Eo '[0-9]{10}'
1234567890
1234567890
awk '!(NR%4)' file prints those lines whose number is multiple of 4. It is the same as saying awk '(NR%4==0) {print}' file.
grep -Eo '[0-9]{10}' prints the numbers that appear on blocks of 10. Note that -o is for "just print the matches" and -E to use extended regular expressions.
Or also
$ awk '!(NR%4)' file | grep -Eo '[1-9][0-9]{9}' #check if first number is <>0
Using GNU sed:
sed -nr '0~4{s/.*\b([0-9]{10})\b.*/\1/p}' inputfile
Saying 0~4 produces every 4th line starting from the 0th line, i.e. produces every 4th line in the file. The substitution part is rather obvious.
For your sample input, it'd produce:
1234567890
1234567890
Since you are looking for one number per line, an awk solution would involve
awk '!(NR%4) && match($0, /[[:digit:]]{10}/){print substr($0, RSTART, RLENGTH)}' file
Using perl:
$ perl -nle 'print /([0-9]{10})/ if !($.%4)' file
1234567890
1234567890
To solve this, first, you should know what should be the length of the phone number. You should also consider area codes to be recognized by your code, and possible phone number start numbers. That way you will filter only the most possible true numbers. But if I write "My number is 028 2233 5674... Just kidding, it's 028 2233 9873." Then the code will consider both numbers as correct. So, to solve this completely, if there are fake numbers in the text, is nearly impossible. But an intelligent code, will filter the ones that are most likely to be correct.

Print line after multiline match with sed

I am trying to create a script to pull out an account code from a file. The file itself is long and contains a lot of other data, but I have included below an excerpt of the part I am looking at (there is other contents before and after this excerpt)
The section of the file I am interested in sometimes look like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
VIN No.
AAAAAA01 9999 1000 30 days
and sometimes it looks like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
AAAAAA01 9999 1000 30 days
(one field cut off the end, where that field had been wrapping down onto it's own line)
I know I can use | tr -s ' ' | cut -d ' ' -F 1 to pull the code once I have the line it is on, but that is not a set line number (the content before this section is dynamic).
I am starting by trying to handled the case with the extra field, I figure it will be easy enough to make that an optional match with ?
The number of spaces used to separate the fields can change as this is essentially OCRed.
A few of my attempts so far - (assume the file is coming in from STDIN)
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s\+VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\n\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\n\s*VIN No\.\s*/{n;p;}'
These all failed to match whatsoever
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*/,/\s\*VIN No\.\s*/{n;p;}'
This at least matched something, but frustratingly printed the VIN No. line, followed by every second line after it. It also seems like it would be more difficult to mark as an optional part of the expression.
So, given an input of the full file (including either of the above excerpts), I am looking for an output of either
AAAAAA01 9999 1000 30 days
(which I can then trim to the required data) or AAAAAA01 if there is an easier way of getting straight to that.
This might work for you (GNU sed):
sed -n '/Account/{n;/VIN No\./n;p}' file
Use sed with the -n switch, this makes sed act like grep i.e. only print lines explicitly using the commands P or (this case) p.
/Account/ match a line with the pattern Account
For the above match only:
n normally this would print the current line and then read the next line into the pattern space, but as the -n is in action no printing takes place. So now the pattern space contains the next line.
/VIN No\./n If the current line contains Vin No effectively empty the pattern space and read in the next line.
p print whatever is currently in the pattern space.
So this a condition within a condition. When we encounter Action print either the following line or the line following that.
awk '/^\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.$/ {
getline;
if (/^\s*VIN No\.$/) getline;
print;
exit;
}'
Going strictly off your input, in both cases the desired field is on the last line. So to print the first field of the last line,
awk 'END {print $1}'
Result
AAAAAA01