count of extracted text for each number - regex

I have a text file with lot of SQL queries those look something like this...
select * from sometable where customernos like '%67890%';
select name, city from sometable where customernos like '%67890%';
select * from othertable where customernos like '%12345%';
I can get the count using a command like this...
grep -v 67890 file.txt | wc -l
But is there any way I can get the count of all customer numbers report like...
12345 1
67890 2

Could you please try following.
awk '
match($0,/%[^%][0-9]{5}/){
val[substr($0,RSTART+1,RLENGTH-1)]++
}
END{
for(i in val){
print i,val[i]
}
}' Input_file
For shown samples output will be as follows.
12345 1
67890 2
Explanation: Adding explanation for above.
awk ' ##Starting awk program from here.
match($0,/%[^%][0-9]{5}/){ ##Using match function to match from % to till 5 digits before next occurrence of % here.
val[substr($0,RSTART+1,RLENGTH-1)]++ ##Creating val with index of sub-string of matched regex above.
}
END{ ##Starting END block of this program from here.
for(i in val){ ##Traversing through val here.
print i,val[i] ##Printing value of i and value of array val with index i here.
}
}' Input_file ##Mentioning Input_file name here.

This might work for you (GNU grep,sort,uniq and awk):
grep -Eo '\b[0-9]{5}\b' file | sort -n | uniq -c | awk '{print $2,$1}'
Find 5 digit numbers, sort them, filter and count them and then reverse the columns.
Just for fun, here is a sed solution:
sed -nE 'H;$!d;x;s/[^0-9]/ /g;s/ +/ /g;
:a;x;s/.*/1/;x;tb;
:b;s/^(( \S+\b).*)\2\b/\1/;Tc;x;s/.*/expr & + 1/e;x;tb;
:c;G;s/^ (\S+)(.*)\n(.*)/\1 \3\n\2/;/^[0-9]{5} /P;s/.*\n//;/\S/ba' file
Slurp the file into memory.
Space separate numbers.
Reduce multiple occurrences of the first number to one and count the occurrences.
Print the first number and its occurrences if it fits the criteria.
Repeat with all other numbers.

Related

Edited: Grep/Awk- Print specific info from table

(This example is edited, following a user's recommendation, considering a mistake in my table display)
I have a .csv table from where I need certain info. My table looks like this:
Name, Birth
James,2001/02/03 California
Patrick,2001/02/03 Texas
Sarah,2000/03/01 Alabama
Sean,2002/02/01 New York
Michael,2002/02/01 Ontario
From here, I would need to print only the unique birthdates, in an ascending order, like this:
2000/03/01
2001/02/03
2002/02/01
I have thought of a regular expression to identify the dates, such as:
awk '/[0-9]{4}/[0-9]{2}/[0-9]/{2}/' students.csv
However, I'm getting a syntax error in the regex, and I wouldn't know how to follow from this step.
Any hints?
Use cut and sort with -u option to print unique values:
cut -d' ' -f2 students.csv | sort -u > out_file
You can also use grep instead of cut:
grep -Po '\d\d\d\d/\d\d/\d\d' students.csv | sort -u > out_file
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
perlre - Perl regular expressions
Here is a gnu awk solution to get this done in a single command:
awk 'NF > 2 && !seen[$2]++{} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in seen) print i}' file
2000/03/01
2001/02/03
2002/02/01
Using any awk and whether your names have 1 word or more and whether blank chars exist after the commas or not:
$ awk -F', *' 'NR>1{sub(/ .*/,"",$2); print $2}' file | sort -u
2000/03/01
2001/02/03
2002/02/01
With your shown samples, could you please try following. Written and tested in GNU awk, should work in any awk though.
awk '
match($0,/[0-9]{4}(\/[0-9]{2}){2}/){
arrDate[substr($0,RSTART,RLENGTH)]
}
END{
for(i in arrDate){
print i
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(\/[0-9]{2}){2}/){ ##using match function to match regex to match only date format.
arrDate[substr($0,RSTART,RLENGTH)] ##Creating array arrDate which has index as sub string of matched one.
}
END{ ##Starting END block of this awk program from here.
for(i in arrDate){ ##Traversing through arrDate here.
print i ##Printing index of array here.
}
}
' Input_file ##Mentioning Input_file name here.

Extract Number from Constant Output in Bash

I have a script that producing this kind of output stream in infinite loop:
m 17:24:34|ethminer Speed 377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91 gpu/11 29.08 gpu/12 29.68 [A1484+5:R0+0:F0] Time: 04:19
I want to extract the integer after "Speed", which is 377 in this case. So far I have, suppose the string is named string:
$string | grep -oP '(?<=Speed).*'
I got
377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91
I want to get rid of the trailing string by executing:
$string | grep -oP '(?<=Speed).*' | grep -o -E '[1-9][0-9][0-9]*'
but that regular expression is wrong, it doesn't come out with anything. How can I fix this?
regards
You may use
grep -Po 'Speed\s*\K\d+'
Or, to also get the fractional part if it is necessary
grep -Po 'Speed\s*\K\d+(\.\d+)?'
See the online demo
Details
Speed - a literal substring
\s* - 0+ whitespaces
\K - a match reset operator (discarding all text matched so far from the match value)
\d+ - 1+ digits
(\.\d+)? - an optional sequence of a . and 1+ digits
If the output it always like that (i.e. not extra lines in between), a simple cut -d' ' -f6 will do the job.
awk 'match($0,"Speed [0-9]+.?[0-9]*"){print substr($0,RSTART+6,RLENGTH-6)}'
sed '/Speed/s/.*Speed \([^ ]*\).*/\1/'
and if each line is always the same way formatted, you can do:
awk '{print $6}' file
This means, that every line always has the word speed in column 5 and you want to print column 6.
Could you please try following. Considering that your Input_file is same as shown samples.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file
In case you want to save output into Input_file itself then try following.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation too here.
awk ' ##awk script starts from here.
{
sub(/.*Speed /,"") ##Using sub for substitution operation which will substitute from starting of line to till Speed string with NULL fir current line.
sub(/ .*/,"") ##Using sub for substitution of everything starting from space to till end in current line with NULL.
}
1 ##Mentioning 1 will print edited/non-edited lines in Input_file.
' Input_file ##Mentioning Input_file name here.
sed works too.
$: echo $string | sed -En '/ Speed /{ s/.* Speed ([0-9]+).*/\1/; p; }'
377

grep line with exact pattern in first column

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

Basic grep/sed/awk script to find duplicates

I'm starting out with regular expressions and grep and I want to find out how to do this. I have this list:
1. 12493 6530
2. 12475 5462
3. 12441 5450
4. 12413 5258
5. 12478 4454
6. 12416 3859
7. 12480 3761
8. 12390 3746
9. 12487 3741
10. 12476 3557
...
And I want to get the contents of the middle column only (so NF==2 in awk?). The delimiter here is a space.
I then want to find which numbers are there more than once (duplicates). How would I go about doing that? Thank you, I'm a beginner.
Using awk :
awk '{count[$2]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file
But you don't have duplicate numbers in the 2nd column.
the second column in awk is $2
count[$2]++ increment an array value with the treated number as key
the END block is executed #the end, and we test each array values to find those having +1
And with a better concision (credits for jthill)
awk '++count[$2]==2{print $2}' file
Using perl:
perl -anE '$h{$F[1]}++; END{ say for grep $h{$_} > 1, keys %h }'
Iterate the lines and build a hash (%h/$h{...}) with the count (++) of the second column values ($F[1]), and after that (END{ ... }) say all hash keys with count ($h{$_}) which is > 1.
With the data stored in test,
Using a combination of awk, uniq and grep commands
cat test | awk -v x=2 '{print $x}' | sort | uniq -c | sed '/^1 /d' | awk -v x=2 '{print $x}'
Explanation:
awk -v x=2 '{print $x}'
selects 2nd column
uniq -c
counts the appearance of each number
sed '/^1 /d'
deletes all the entries with only one appearance
awk -v x=2 '{print $x}'
removes the number count with awk again

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'