grep line with exact pattern in first column - regex

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!

You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.

So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt

I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

If string is found in line, replace with string from previous line, sed

I have a few thousand lines of code spread out across multiple files that I need to update. Right now they are like so:
active_data(COPIED),
I need to replace all instances of COPIED (only in these lines) with the text inside the parenthesis on the previous line. So it total the code currently might look like:
current_data(FOO|BAR|TEXT),
active_data(COPIED),
and I want it to look like:
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
I can find the lines to be replaced easily enough, and could replace the with some static string with no problem, but I'm not sure how to pull the data from the previous line and use that. I'm sure its pretty simple but can't quite figure it out. Thanks for the help.
(I could see using AWK or something else for this too if sed won't work but I figure sed would be the best solution for a one time change).
sed could work but awk is more natural:
$ awk -F'[()]' '$2 == "COPIED" {sub(/COPIED/, prev)} {prev=$2;} 1' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
-F'[()]'
Use open or claose parens as the field separator.
$2 == "COPIED" {sub("COPIED", prev)}
If the second field is COPIED, then replace it with prev.
prev=$2
Update prev.
1
This is cryptic shorthand which means print the line. It is equivalent to {print $0;}.
How awk sees the fields
$ awk -F'[()]' '{for (i=1;i<=NF;i++)printf "Line %s Field %s=%s\n",NR,i,$i;}' file
Line 1 Field 1=current_data
Line 1 Field 2=FOO|BAR|TEXT
Line 1 Field 3=,
Line 2 Field 1=active_data
Line 2 Field 2=COPIED
Line 2 Field 3=,
Changing in-place all files in a directory
for file in *
do
awk -F'[()]' '$2 == "COPIED" {sub("COPIED", prev)} {prev=$2;} 1' "$file" >tmp$$ && mv tmp$$ "$file"
done
Or, if you have a modern GNU awk:
awk -i inplace -F'[()]' '$2 == "COPIED" {sub("COPIED", prev)} {prev=$2;} 1' *
This might work for you (GNU sed):
sed 'N;s/\((\([^)]*\)).*\n.*(\)COPIED/\1\2/;P;D' file
This keeps a moving window of 2 lines open throughout the length of the file and uses pattern matching to effect the required result.
With GNU awk for FPAT:
$ awk -v FPAT='[(][^)]*[)]|[^)(]*' -v OFS= '$2=="(COPIED)"{$2=prev} {prev=$2; print}' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
With other awks:
$ awk '
match($0,/\([^)]*\)/) {
curr = substr($0,RSTART,RLENGTH)
if (curr == "(COPIED)") {
$0 = substr($0,1,RSTART-1) prev substr($0,RSTART+RLENGTH)
}
else {
prev = curr
}
}
{ print }
' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),

Is there a way to obtain the current pattern searched in an AWK script?

The basic idea is this. Suppose that you want to search a file for multiple patterns from a pipe with awk :
... | awk -f - '{...}' someFile.txt
* '...' is just short for some code
* '-f -' indicates the pattern is taken from pipe
Is there a way to know which pattern is searched at each instant within the awk script
(like you know $1 is the first field, is there something like $PATTERN that contains the current pattern
searched or a way to get something like it?
More Elaboration:
if I have 2 files:
someFile.txt containing:
1
2
4
patterns.txt containing:
1
2
3
4
running this command:
cat patterns.txt |awk -f - '{...}' someFile.txt
What should I type between the braces such that only the pattern in patterns.txt that
has not been matched in someFile.txt is printed?(in this case the number 3 in patterns.txt is not matched)
Under the requirements that patterns.txt be supplied as stdin and that the processing be done with awk:
$ cat patterns.txt | awk 'FNR==NR{p=p "\n" $0;next;} p !~ $0' someFile.txt -
3
This was tested using GNU awk.
Explanation
We want to remove from patterns.txt anything that matches a line in someFile.txt. To do this, we first read in someFile.txt and create patterns from it. Next, we print only the lines from patterns.txt that do not match any of the patterns from someFile.txt.
FNR==NR{p=p "\n" $0;next;}
NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: someFile.txt. We save all such lines in the newline-separated variable p. We then tell awk to skip the remaining commands and jump to the next line.
p !~ $0
If we got here, then we are now reading the second named file on the command line which is - for stdin. This boolean condition evaluates to either true or false. If it is true, the line is printed. If not, it is skipped. In other words, the above is awk's crytic shorthand for:
p !~ $0 {print $0}
cmd | awk 'NR==FNR{pats[$0]; next} {for (p in pats) if ($0 ~ p) delete pats[p]} END{ for (p in pats) print p }' - someFile.txt
Another way in awk
cat patterns.txt | awk 'NR>FNR&&!($0 in a);{a[$0]}' someFile.txt -

Extract substring from rows with regex and remove rows with duplicate substring

I have a text file with some rows in the following form
*,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
I would like to remove duplicate rows that have the same value for * (case insensitive), ie anything left of ,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
For example here's a sample text file
test,bar,log,dog,0,0,0
one
foo,bar,log,dog,0,0,0
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
The resulting text file should have the duplicate foo removed (order does not matter to me so long as the duplicates are removed, leaving 1 unique)
test,bar,log,dog,0,0,0
one
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
What's the simplest bash command I could do to achieve this?
awk -F, '!seen[tolower($1)]++' file
You can do this with awk like this (since you don't care which of the duplicates gets kept):
awk -F, '{lines[tolower($1)]=$0}END{for (l in lines) print lines[l]}'
If you wanted to keep the first instead:
awk -F, '{if (lines[tolower($1)]!=1) { print; lines[tolower($1)]=1 } }'
Search for
(?:(?<=\n)|^)(.*)((?:,(?:d|l|fr)og){2}(?:,[01]){3})(?=\n)([\s\S]*)(?<=\n).*\2(?:\n|$)
...and replace with
$1$2$3
#!/bin/bash
for line in $(cat $1)
do
key=$( echo ${line%%,*} | awk '{print tolower($0)}')
found=0
for k in ${keys[#]} ; do [[ "$k" == "$key" ]] && found=1 && break ; done
(( found )) && continue
echo $line
keys=( "${keys[#]}" "$key" )
done
Using an array instead of an association (hash), which is less performant. But it seems to work.
This might work for you (GNU sed):
cat -n file |
sort -fk2,2 |
sed -r ':a;$!N;s/^.{7}([^,]*),[^,]*(,(d|l|fr)og){2}(,[01]){3}\n(.{7}\1,[^,]*(,(d|l|fr)og){2}(,[01]){3})$/\5/i;ta;P;D' |
sort -n |
sed -r 's/^.{7}//'
Number each line.
Sort by the first key (ignoring case)
Remove duplicates (based on specific criteria)
Sort reduced file back into original order
Remove line numbers

awk print matching line and line before the matched

Following is what I am trying to do using awk. Get the line that matches the regex and the line immediately before the matched and print. I can get the line that matched the regex but not the line immediately before that:
awk '{if ($0!~/^CGCGGCTGCTGG/) print $0}'
In this case you could easily solve it with grep:
grep -B1 foo file
However, if you need to to use awk:
awk '/foo/{if (a && a !~ /foo/) print a; print} {a=$0}' file
/abc/{if(a!="")print a;print;a="";next}
{a=$0}
use more straightforward pattern search
gawk '{if (/^abc$/) {print x; print $0};x=$0}' file1 > file2
I created the following awk script. Prints the matching line as well as the previous 2 lines. You can make it more flexible from this idea.
search.awk
{
a[0]=$0;
for(i=0;i<2;i++)
{
getline;
if(i==0){
a[1]=$0;
}
if(i==1){
if($0 ~ /message received/){
print a[0];
print a[1];
print $0;
}
}
}
}
Usage:
awk '{print $0}' LogFile.log | awk -f search.awk
Why not use grep -EB1 '^CGCGGCTGCTGG'
The awk to do the same thing is very long-winded, see Marco's answer.
Maybe slightly off-topic, but I used the answer from belisarius to create my own variation of the above solution, that searches for the Nth entry, and returns that and the previous line.
awk -v count=1 '/abc/{{i++};if(i==count){print a;print;exit}};{a=$0}' file