counting a unique string in line - uniq

I try to use "uniq -c" to count 2nd string in line
My file A has around 500,000 lines, and looks like this
File_A
30-Nov 20714 GHI 235
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 20714 GHI 235
...
The command I used
sort -k 2 File_A | uniq -c
I find out the result i get doesn't match the lines.
How can i fix this problem? or Does there has other way to count unique string in line?
The result i get will similar like this (i just random made out the number)
70 30-Nov 10005 ABC 101
5 30-Nov 10355 DEF 111
55 30-Nov 20714 GHI 235

You need to also tell uniq to consider only that field, the same way you did with sort. Perhaps you can use -f of --skip-fields for that. The problem you then have is that uniq doesn't take a "number of fields to check".
Otherwise, if you don't need to keep the original string you can just:
cut -d' ' -f2 | sort ...

Here are a couple, or three, other ways to do it. These solutions have the benefit that the file is not sorted - rather they rely on hashes (associative arrays) to keep track of unique occurrences.
Method 1:
perl -ane 'END{print scalar keys %h,"\n"}$h{$F[1]}++' File_A
The "-ane" makes Perl loop through the lines in File_A, and sets elements of the array F[] equal to the fields of each line as it goes. So your unique numbers end up in F[1]. %h is a hash. The hash element indexed by $F[1] is incremented as each line is processed. At the end, the END{} block is run, and it simply prints the number of elements in the hash.
Method 2:
perl -ane 'END{print "$u\n"}$u++ if $h{$F[1]}++==1' File_A
Similar to the method above, but this time a variable $u is incremented each time incrementing the hash results in it becoming 1 - i.e. the first time we see that number.
I am sure #mpapec or #fedorqui could do it in half the code, but you get the idea!
Method 3:
awk 'FNR==NR{a[$2]++;next}{print a[$2],$0}END{for(i in a)u++;print u}' File_A File_A
Result:
2 30-Nov 20714 GHI 235
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
2 30-Nov 20714 GHI 235
3
This uses awk and runs through your file twice - that is why it appears twice at the end of the command. On the first pass, the code in curly braces after "FNR==NR" is run and it increments the element of associative array a[] as indexed by field 2 ($2) so it is essentially counting the number of times each id in field 2 is seen. Then, on the second pass, the part in the second set of curly braces is run and it prints the total number of times the id was seen on the first pass, plus the current line. At the end, the END{} block is run and it counts the elements in associative array a[] and prints that out.

If your intention is to count the unique values in the second column, the one that has 20714, 10005, ... in it, then you need to extract it first using cut.
cut -d' ' -f 2 File_A | sort | uniq -c

Related

Using awk, how can I find the max value in one column, print it; then print the match value in another column

Let's say I have this data:
1 text1 1 1 5
2 text2 2 2 10
3 text3 3 3 15
4 text4 4 4 50
5 text5 5 5 25
I obtain the max value of column #5 with this code:
awk 'BEGIN {a=0} {if ($5>0+a) a=$5} END{print a}' data.txt
My question is how do I add more parameters in that code in order to find the associated value in whatever column I choose (but just one)? For example, I want to find the max value of column #5 and the associated value from column #2
The output I want is:
50 text4
I don't know how to add more parameters in order to obtain the match value.
Right way to do this is this awk:
awk 'NR==1 || $5>max { max=$5; val=$2 } END { print max, val }' file
50 text4
This sets max=$5 and val=$2 for the first record or when $5 is greater than max variable.
When you find a new max then save both the new max and the associated value from column #2.
One idea, along with some streamlining of the current code:
$ awk '$5>(a+0) { a=$5; col2=$2 } END {print a, col2}' data.txt
50 text4
NOTE:
this assumes that at least one value in column #5 is positive; if all values in column #5 are negative then $5>(a+0) will always be false and a (and col2) will never get set, which in turn means print a, col2 will print a line with a single space; a better solution would be to set a to the first value processed and then go from there (see anubhava's answer for an example)
An alternative using sort
% sort -nk 5 file | tail -1 | awk '{print $5, $2}'
50 text4
With your shown samples please try following sort + awk option here. With GNU sort sorting the file by 5th column and then piping its result to awk where reading very first line which is containing max value and printing it, exiting from program to save time of awk.
sort -s -rnk5 file1 | awk 'FNR==1{print $NF,$2;exit}'
50 text4

grep single digit occurs one time in line

I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

How to print all lines matching the first field of last line

I've been trying to do this for the last two days. I read a lot of tutorials and I learned a lot of new things but so far I couldn't manage to achieve what I'm trying to do. Let's say this is the command line output:
Johnny123 US 224
Johnny123 US 145
Johnny123 US 555
Johnny123 US 344
Robert UK 4322
Robert UK 52
Lucas FR 344
Lucas FR 222
Lucas FR 8945
I want to print the lines which match 'the first field (Lucas) of last line'.
So, I want to print out:
Lucas FR 344
Lucas FR 222
Lucas FR 8945
Notes:
What I'm trying to print have a different line count each time so I can't do something like returning the last 3 lines only.
The first field doesn't have a specific pattern that I can use to print.
Here is another way using tac and awk:
tac file | awk 'NR==1{last=$1}$1==last' | tac
Lucas FR 344
Lucas FR 222
Lucas FR 8945
The last tac is only needed if the order is important.
awk 'NR==FNR{key=$1;next} $1==key' file file
or if you prefer
awk '{val[$1]=val[$1] $0 RS; key=$1} END{printf "%s", val[key]}' file
This might work for you (GNU sed):
sed -nr 'H;g;/^(\S+\s).*\n\1[^\n]*$/!{s/.*\n//;h};$p' file
Store lines with duplicate keys in the hold space. At change of key remove previous lines. At end-of-file print out what remains.

grep -c value NH:i:1 only for every line in file, not also NH:i:12

cat samtry.txt | grep -c NH:i:1
See an example of three lines below. the bold information is whats important
HWI-ST697:178:D1U9CACXX:1:2111:12787:5687 153 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DCDDDDDDDDDDDEEEEEEEEFGHGJIHGHFHJIJIJJIJJJJIHJJIJIIIFJJIGGGIJJJIIJJHIGJIJJJGHJJIJIJIGFJJGHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:1**
HWI-ST697:178:D1U9CACXX:3:1310:18383:72540 89 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DDDDDDDDDDDDDEEEEEEFFFHHHIIJJIIIJIJJJJJJJJJJHJJJJJJJJJJJJJIJJJJJJJJIJJJIJJIJJJJJJJJIHFJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:11**
HWI-ST697:178:D1U9CACXX:7:1212:17559:76798 89 scaffold_1 33007 50 101M * 0 0 CTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACAAG DDDDDDDDDDDDDEEEECDFFHGHIGJIIHJJJIIJJJJJJHHJJJJJJJJJJJIIIJJJJGIIGBJJIJJJJIJJJJJIHHHFJJIJHHHHGFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:16T26G57YT:Z:UU **NH:i:1**
I am trying to use a shell script to count all the lines in a tab-delimited-file (testfile: samtry.txt, contains 10 lines to test on) that contains the following Regular expression NH:i:1
The problem is of course that I get the information I wanted; but it also counts the lines with the following outcome: NH:i:1x (where x is any possible digit: 0-9)
The position of the NH:i:x (x = any digit until around 50) is in every line of the file on 20, its not the last position of the line. Every line has 23 'positions'.
Does anyone know how to do this with grep or another tool?
I've got around 100 files which each have a size of around 3GB each, and I don't know how to solve this problem
I hope I give enough information, I am happy for every answer
Try grep with word boundaries:
grep -c '\<NH:i:1\>' samtry.txt
OR grep -w:
grep -wc 'NH:i:1' samtry.txt