Following is what I am trying to do using awk. Get the line that matches the regex and the line immediately before the matched and print. I can get the line that matched the regex but not the line immediately before that:
awk '{if ($0!~/^CGCGGCTGCTGG/) print $0}'
In this case you could easily solve it with grep:
grep -B1 foo file
However, if you need to to use awk:
awk '/foo/{if (a && a !~ /foo/) print a; print} {a=$0}' file
/abc/{if(a!="")print a;print;a="";next}
{a=$0}
use more straightforward pattern search
gawk '{if (/^abc$/) {print x; print $0};x=$0}' file1 > file2
I created the following awk script. Prints the matching line as well as the previous 2 lines. You can make it more flexible from this idea.
search.awk
{
a[0]=$0;
for(i=0;i<2;i++)
{
getline;
if(i==0){
a[1]=$0;
}
if(i==1){
if($0 ~ /message received/){
print a[0];
print a[1];
print $0;
}
}
}
}
Usage:
awk '{print $0}' LogFile.log | awk -f search.awk
Why not use grep -EB1 '^CGCGGCTGCTGG'
The awk to do the same thing is very long-winded, see Marco's answer.
Maybe slightly off-topic, but I used the answer from belisarius to create my own variation of the above solution, that searches for the Nth entry, and returns that and the previous line.
awk -v count=1 '/abc/{{i++};if(i==count){print a;print;exit}};{a=$0}' file
Related
I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG
I want to return lines from awk with a pattern "C," or ".,C" or ".,C,.*".
For example:
Valid
C,G
G,C
G,C,A
Invalid
G,CC
My code is below:
echo G,CC | awk '$0 ~ /^C,+.*|.*,C,*.*/ {print $0}'
output:
G,CC
I hope it returns nothing to me. Unfortunately, it returns "G,CC" to me.
How do I solve this problem?
Edit:
Based on the answers from #Emma and #perreal. I used a shorter command line to solve my question:
awk '$0 ~ /^C,.*|.*,C,.*|.*,C$/ {print $0}'
Until now, it works well. Thanks for your help!!
Could you please try following.
awk '!/CC/ && /^C,+.*|.*,C,*.*/' Input_file
The + is not necessary in ^C,+.*, since you already match the comma and also match whatever comes after.
The * right after the second comma is not correct in .*,C,*.*. It makes the comma optional so it can also match G,CC (.*, matches G, and C,* matches CC).
This should work:
awk '$0 ~ /^[GCA](,[GCA])*$/ && /C/ {print $0}'
My guess is that maybe this would also work:
awk '$0 ~ /^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/ {print $0}'
Demo
Advice
Mr. Rankin is advising that:
It is equivalent to awk '/^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/'. Output
with print is the default operation along with the match against the
record.
$ awk '/(^|,)C(,|$)/' file
C,G
G,C
G,C,A
More alternatives
In other words, you want to select lines with "C" as word? If yes, here are 2 solutions:
grep -w C
grep -E '\<C\>'
The first one advises grep to match only whole words. The second line uses begin-word and end-word patterns. These pattern can be used with awk too:
awk '/\<C\>/ {print}'
A complete different solution (and different form other answers too) is to add commas at both ends before comparing ,C,:
awk '"," $0 "," ~ /,C,/ {print}
I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.
I have a few thousand lines of code spread out across multiple files that I need to update. Right now they are like so:
active_data(COPIED),
I need to replace all instances of COPIED (only in these lines) with the text inside the parenthesis on the previous line. So it total the code currently might look like:
current_data(FOO|BAR|TEXT),
active_data(COPIED),
and I want it to look like:
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
I can find the lines to be replaced easily enough, and could replace the with some static string with no problem, but I'm not sure how to pull the data from the previous line and use that. I'm sure its pretty simple but can't quite figure it out. Thanks for the help.
(I could see using AWK or something else for this too if sed won't work but I figure sed would be the best solution for a one time change).
sed could work but awk is more natural:
$ awk -F'[()]' '$2 == "COPIED" {sub(/COPIED/, prev)} {prev=$2;} 1' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
-F'[()]'
Use open or claose parens as the field separator.
$2 == "COPIED" {sub("COPIED", prev)}
If the second field is COPIED, then replace it with prev.
prev=$2
Update prev.
1
This is cryptic shorthand which means print the line. It is equivalent to {print $0;}.
How awk sees the fields
$ awk -F'[()]' '{for (i=1;i<=NF;i++)printf "Line %s Field %s=%s\n",NR,i,$i;}' file
Line 1 Field 1=current_data
Line 1 Field 2=FOO|BAR|TEXT
Line 1 Field 3=,
Line 2 Field 1=active_data
Line 2 Field 2=COPIED
Line 2 Field 3=,
Changing in-place all files in a directory
for file in *
do
awk -F'[()]' '$2 == "COPIED" {sub("COPIED", prev)} {prev=$2;} 1' "$file" >tmp$$ && mv tmp$$ "$file"
done
Or, if you have a modern GNU awk:
awk -i inplace -F'[()]' '$2 == "COPIED" {sub("COPIED", prev)} {prev=$2;} 1' *
This might work for you (GNU sed):
sed 'N;s/\((\([^)]*\)).*\n.*(\)COPIED/\1\2/;P;D' file
This keeps a moving window of 2 lines open throughout the length of the file and uses pattern matching to effect the required result.
With GNU awk for FPAT:
$ awk -v FPAT='[(][^)]*[)]|[^)(]*' -v OFS= '$2=="(COPIED)"{$2=prev} {prev=$2; print}' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
With other awks:
$ awk '
match($0,/\([^)]*\)/) {
curr = substr($0,RSTART,RLENGTH)
if (curr == "(COPIED)") {
$0 = substr($0,1,RSTART-1) prev substr($0,RSTART+RLENGTH)
}
else {
prev = curr
}
}
{ print }
' file
current_data(FOO|BAR|TEXT),
active_data(FOO|BAR|TEXT),
please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.
Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.
Using grep:
grep -oP 'LN:\K.*' filename
Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.
Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename
Another awk
awk -F"LN:" 'NF>1 {print $2}' file