printing a column after a specific pattern from a text file

printing a column after a specific pattern from a text file - regex

I have these lines of data. I want to just print the numbers that follows A, C and D. Is there a way to print those columns?
(((A 0.0400213 B 0.0400213 0.00737467 C 0.047396 0.03466 D 0.082056 ;
((C 0.0275027 (A 0.023266 B 0.023266 0.00423667 0.0131187 D 0.0406213 ;
((C 0.0310553 (B 0.0210907 A 0.0210907 0.00996467 0.0222647 D 0.05332 ;
((C 0.03491 (A 0.020474 B 0.020474 0.014436 0.0221067 D 0.0570167 ;
((C 0.0485573 (A 0.00938133 B 0.00938133 0.039176 0.00335133 D 0.0519087 ;
If I want to get the numbers following A
My expected out come will be :
0.0400213,
0.023266,
0.0210907,
0.020474,
0.00938133

This should work:
awk '{for(i=1;i<=NF;i++) if($i~/A|C|D/) printf $++i FS; print ""}' file
Output:
$ awk '{for(i=1;i<=NF;i++) if($i~/A|C|D/) printf $++i FS; print ""}' file
0.0400213 0.047396 0.082056
0.0275027 0.023266 0.0406213
0.0310553 0.0210907 0.05332
0.03491 0.020474 0.0570167
0.0485573 0.00938133 0.0519087
More verbose:
$ awk '{for(i=1;i<=NF;i++) if($i~/A|C|D/) { gsub(/[(]/,"",$i) ; printf $i " is " $++i FS }; print ""}' file
A is 0.0400213 C is 0.047396 D is 0.082056
C is 0.0275027 A is 0.023266 D is 0.0406213
C is 0.0310553 A is 0.0210907 D is 0.05332
C is 0.03491 A is 0.020474 D is 0.0570167
C is 0.0485573 A is 0.00938133 D is 0.0519087
Update:
$ awk '{for(i=1;i<=NF;i++)if($i~/A/){gsub(/[(]/,"",$i);print $++i}}' file
0.0400213
0.023266
0.0210907
0.020474
0.00938133

Code for GNU sed:
$sed -r 's/.*(A\s)([0]\.[0-9]+).*/\2/' file
0.0400213
0.023266
0.0210907
0.020474
0.00938133

Related

Compare line by line two files with multiple columns using bash using comperatives

I have two files:
file1:
a b 30 d
a b 50 d
and file2:
a b 20 d
a b 60 d
the preferred output file file_output:
a b 20 d
a b 50 d
I want to compare each file line by line and I want to print in the output file, the line in which the number in the 3rd column is smaller using awk.
I tried something like:
awk ' NR==FNR { arr[$3];next} arr[$3] < $3 ' file1 file2 > file_output
but it prints only the smaller for file1 and not for file2

awk 'FNR==NR {f3[FNR]=$3;a[FNR]=$0;next} FNR in a {print ($3<f3[FNR])?$0:a[FNR]}' file1 file2

How to print a line with no field separator in awk?

I have data like this (file is called list-in.dat)
a ; b ; c ; i
d
e ; f ; a ; b
g ; h ; i
and I want a list like this (output file list-out.dat) with all items, in alphabetically order (case insensitive) and each unique item only once.
a
b
c
d
e
f
g
h
i
My attempt is:
awk -F " ; " ' BEGIN { OFS="\n" ; } {for(i=0; i<=NF; i++) print $i} ' file-in.dat | uniq | sort -uf > file-out.dat
But I end up with all antries except those lines which has only one item:
a
b
c
e
f
g
h
i
How can I get all (unique, sorted) items no matter how many items are in one line / if the field separator is missing?

Using gnu-awk:
awk -F '[[:blank:]]*;[[:blank:]]*' '{
for (i=1; i<=NF; i++) uniq[$i]
}
END {
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in uniq)
print i
}' file
a
b
c
d
e
f
g
h
i
For non-gnu awk use:
awk -F '[[:blank:]]*;[[:blank:]]*' '{for (i=1; i<=NF; i++) uniq[$i]}
END{for (i in uniq) print i}' file | sort

awk -F' ; ' -v OFS='\n' '{$1=$1} 1' ip.txt | sort -fu
-F' ; ' sets space followed by ; followed by space as field separator
-v OFS='\n' sets newline as output field separator
{$1=$1} change $0 as per new OFS
1 print $0
sort -fu sort uniquely ignoring case in alphabetic order

Could you please try following, awk + sort solution, written and tested with shown samples. In case you want to use ignorecase then add IGNORECASE=1 in awk code.
awk '
BEGIN{
FS=" ; "
}
{
for(i=1;i<=NF;i++){
if(!a[$i]++){ print $i }
}
}
' Input_file | sort
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=" ; " ##Setting field separator as space semi-colon space here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop till NF here for each line.
if(!a[$i]++){ print $i } ##Checking condition if current field is NOT present in array a then printing that field value here.
}
}
' Input_file | sort ##Mentioning Input_file name here and passing it to sort as Input to sort the data.

awk to print specific fields based on regex match in file

I am trying to use awk to look in input for keywords and in found print specified fields. The awk below does run but does not produce the desired output. What is supposed to happen is that if TYPE=ins or TYPE=del is found in the line then $1,$2,$4,$5, and the LEN= prints. The LEN= is also a field in the line with a number after the =. Thank you :).
input
chr1 1647893 . C CTTTCTT 31.9545 PASS AF=0.330827;AO=179;DP=695;FAO=132;FDP=399;FR=.;FRO=267;FSAF=67;FSAR=65;FSRF=124;FSRR=143;FWDB=0.0145873;FXX=0.00249994;HRUN=1;LEN=6;MLLD=190.481;OALT=TTTCTT;OID=.;OMAPALT=CTTTCTT;OPOS=1647894;OREF=-;PB=0.5;PBP=1;QD=0.320346;RBI=0.0146526;REFB=-0.0116875;REVB=0.00138131;RO=471;SAF=85;SAR=94;SRF=236;SRR=235;SSEN=0;SSEP=0;SSSB=-0.0324817;STB=0.528856;STBP=0.43;TYPE=ins;VARB=0.0222858 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:31:695:399:471:267:179:132:0.330827:94:85:236:235:65:67:124:143
chr1 1650787 . T C 483.012 PASS AF=0.39;AO=181;DP=459;FAO=156;FDP=400;FR=.;FRO=244;FSAF=100;FSAR=56;FSRF=162;FSRR=82;FWDB=-0.00931067;FXX=0;HRUN=1;LEN=1;MLLD=210.04;OALT=C;OID=.;OMAPALT=C;OPOS=1650787;OREF=T;PB=0.5;PBP=1;QD=4.83012;RBI=0.018986;REFB=-0.0114993;REVB=-0.0165463;RO=276;SAF=116;SAR=65;SRF=184;SRR=92;SSEN=0;SSEP=0;SSSB=-0.0305478;STB=0.515311;STBP=0.652;TYPE=snp;VARB=0.019956 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:483:459:400:276:244:181:156:0.39:65:116:184:92:56:100:162:82
chr1 17034455 . CGCGCGCGT C 50 PASS AF=0.205882;AO=56;DP=272;FR=.;LEN=8;OALT=-;OID=.;OMAPALT=C;OPOS=17034456;OREF=GCGCGCGT;RO=216;SAF=27;SAR=29;SRF=112;SRR=104;TYPE=del GT:GQ:DP:RO:AO:SAF:SAR:SRF:SRR:AF 0/1:99:272:216:56:27:29:112:104:0.205882
awk
awk '/TYPE=ins/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} /TYPE=del/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} 1' input > out
desired output
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

You can use this awk command:
awk 'function find(str) {
return substr($0, match($0, str "=[^; \t]+"), RLENGTH);
}
/TYPE=(ins|del)/ {
print $1, $2, $4, $5, find("TYPE"), find("LEN")
}' file
Output:
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Here is an awk-solution:
awk '$0~"TYPE=del" || $0~"TYPE=ins"{max=split($0,ar,";")
len=""
type=""
for(i=1; i<=max; i++){
if(ar[i]~"LEN="){len=ar[i]}
if(ar[i]~"TYPE="){type=ar[i]}
}
print $1,$2,$4,$5,type,len}' input
Output:
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

grep a range of N to N tokens

I would like to grep (I can accept non-grep answers but it is what I am most used to for this) lines which have a range of tokens delimited by a whitespace and with the ability to ignore punctuation marks. This means that if I want three to five tokens I would get lines with three, four or fives tokens but not one, two, six or twenty tokens. I have periods at the end and sometimes commas in the middle which I things I would like to account for if possible. Also the real data is actually words so I would like an answer with clear instructions for allowing characters which are not necessarily a-zA-Z, for example the word "can't".
My data is like this:
aa .
aa bb'b , c ddd e f gg .
aa bb .
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aa bb'b cc dd e f .
aaaaa bb'b c .
I tried this:
grep -e "[a-zA-Z']* ,*\{3,5\}"
What I expected to get was this:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

With GNU grep:
grep -E "^([a-zA-Z']+ *,* ){3,5}\.$" file
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

I think awk can make this task simple, because it has a variable NF that counts number of fields (separated by blanks) in each line, so:
awk 'NF >= 4 && NF <= 6' infile
I incremented its value to take into account last period. It yields:
a b c d e .
a b c d .
a b c .
EDIT: To ignore commas, use the FS variable (Field Separator) with a regular expression:
awk 'BEGIN { FS = "[[:blank:],]+" } NF >= 4 && NF <= 6' infile
It yields:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Here's a sed example to add to the mix:
sed -n "/^\([a-zA-Z',]* \)\{3,5\}\.$/p"
Output:
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

Another possibility:
awk '/aaa+/' file
aaa bb'b cccc dddd e .
aaaa bb'b cccc , dddd .
aaaaa bb'b c .

How do I match across newlines in a perl regex?

I'm trying to work out how to match across newlines with perl (from the shell). following:
(echo a b c d e; echo f g h i j; echo l m n o p) | perl -pe 's/(c.*)/[$1]/'
I get this:
a b [c d e]
f g h i j
l m n o p
Which is what I expect. But when I place an /s at the end of my regex, I get this:
a b [c d e
]f g h i j
l m n o p
What I expect and want it to print is this:
a b [c d e
f g h i j
l m n o p
]
Is the problem with my regex somehow, or my perl invocation flags?

-p loops over input line-by-line, where "lines" are separated by $/, the input record separator, which is a newline by default. If you want to slurp all of STDIN into $_ for matching, use -0777.
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -pe 's/(c.*)/[$1]/s'
a b [c d e
]f g h i j
l m n o p
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -0777pe 's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]
See Command Switches in perlrun for information on both those flags. -l (dash-ell) will also be useful.

The problem is that your one-liner works one line at a time, your regex is fine:
use strict;
use warnings;
use 5.014;
my $s = qq|a b c d e
f g h i j
l m n o p|;
$s =~ s/(c.*)/[$1]/s;
say $s;

There's More Than One Way To Do It: since you're reading "the entire file at a time" anyway, I'd personally drop the -p modifier, slurp the entire input explicitly, and go from there:
echo -e "a b c d e\nf g h i j\nl m n o p" | perl -e '$/ = undef; $_ = <>; s/(c.*)/[$1]/s; print;'
This solution does have more characters, but may be a bit more understandable for other readers (which will be you in three months time ;-D )

Actually your one-liner looks like this:
while (<>) {
$ =~ s/(c.*)/[$1]/s;
}
It's mean that regexp works only with first line of your input.

You're reading a line at a time, so how do you think it can possibly match something that spans more than one line?
Add -0777 to redefine "line" to "file" (and don't forget to add /s to make . match newlines).
$ (echo a b c d e; echo f g h i j; echo l m n o p) | perl -0777pe's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

printing a column after a specific pattern from a text file - regex

Code for GNU sed: $sed -r 's/.(A\s)([0]\.[0-9]+)./\2/' file 0.0400213 0.023266 0.0210907 0.020474 0.00938133

Related

Compare line by line two files with multiple columns using bash using comperatives

How to print a line with no field separator in awk?

awk to print specific fields based on regex match in file

grep a range of N to N tokens

How do I match across newlines in a perl regex?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

printing a column after a specific pattern from a text file - regex

Code for GNU sed: $sed -r 's/.*(A\s)([0]\.[0-9]+).*/\2/' file 0.0400213 0.023266 0.0210907 0.020474 0.00938133

Related

Compare line by line two files with multiple columns using bash using comperatives

How to print a line with no field separator in awk?

awk to print specific fields based on regex match in file

grep a range of N to N tokens

How do I match across newlines in a perl regex?

Categories

Resources

Code for GNU sed: $sed -r 's/.(A\s)([0]\.[0-9]+)./\2/' file 0.0400213 0.023266 0.0210907 0.020474 0.00938133