do you know where I have a problem? I would like to count values in $4 column for all rows, with condition, for this rows in first column must not be a chrY or chrX, others rows will be count.
awk ' {if ($1 != "chrY" || $1 != "chrX") sum+=$4} END{print sum}' "$i"_pool018_2.tsv
Thank you.
Filip
if ($1 != "chrY" || $1 != "chrX")
should be
if ($1 != "chrY" && $1 != "chrX")
if you used logical or, if the first check is evaluated as true, the 2nd won't be checked. That is, all entries with $1="chrX" would be added into sum variable.
example:
kent$ awk 'BEGIN{x=5;if(x!=3||x!=5)print "OK"}'
OK
If this doesn't solve your problem, you should paste input/output examples.
Related
I have 2 files,
file1:
YARRA2
file2:
59204.9493055556
59205.5930555556
So, file1 has 1 line and file2 has 2 lines. If file1 has 1 line, and file2 has more than 1 line, I want to repeat the lines in file1 according to the number of lines in file2.
So, my code is this:
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
but the output is
YARRA2
Any idea? I have also tried with
awk '{ if( NR<2 && $eprows>1 ) {print} {print}}' file1
but it is the same
You may use this awk solution:
awk '
NR == FNR {
++n2
next
}
{
s = $0
print;
++n1
}
END {
if (n1 == 1)
for (n1=2; n1 <= n2; ++n1)
print s
}' file2 file1
YARRA2
YARRA2
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
Oops! You stepped hip-deep in mixed languages.
The eprows variable is a shell variable. It's not accessible to other processes except through the environment, unless explicitly passed somehow. The awk program is inside single-quotes, which would prevent interpreting eprows even if used correctly.
The value of a shell variable is obtained with $, so
echo $eprows
2
One way to insert the value into your awk script is by interpolation:
awk '{ if( NR<2 && '"$eprows"'>1 ) {print} {print}}' file1
That uses a lesser known trick: you can switch between single- and double-quotes as long as you don't introduce spaces. Because double-quoted strings in the shell are interpolated, awk sees
{ if( NR<2 && 2>1 ) {print} {print} }
Awk also lets you pass values to awk variables on the command line, thus:
awk -v eprows=$eprows '{ if( NR<2 && eprows >1 ) {print} {print}}' file1
but you'd have nicer awk this way:
awk -v eprows=$eprows 'NR < 2 && eprows > 1 { {print} {print} }' file1
whitespace and brevity being elixirs of clarity.
That works because in the awk pattern / action paradigm, pattern is anything that can be reduced to true/false. It need not be a regex, although it usually is.
One awk idea:
awk '
FNR==NR { cnt++; next } # count number of records in 1st file
# no specific processing for 2nd file => just scan through to end of file
END { if (FNR==1 && cnt >=2) # if 2nd file has just 1 record (ie, FNR==1) and 1st file had 2+ records then ...
for (i=1;i<=cnt;i++) # for each record in 1st file ...
print # print current (and only) record from 2nd file
}
' file2 file1
This generates:
YARRA2
YARRA2
I am writing an if-then-else statement using awk in a bash script.
What I would like to do is identify lines with col 1 values not matching a particular string (rs or chr) and append a prefix (chr) to the col 1 values for those identified lines. All lines with the matched string should print as they were - no appending.
My line of code so far is:
awk '{if (! ($1 ~ /rs/ || $1 ~ /chr/)) {($1 == "chr"$1); print $0}}; else {print $0}' filename > newfilename
I keep on receiving syntax error messages with this code.
I can perform the identification and the appending successfully on their own but am having problems combining them into one command.
With idiomatic awk you can rewrite this as
awk '$1!~/rs/ && $1!~/chr/ {$1="chr"$1}1'
or if you like
awk '!($1 ~ /rs/ || $1 ~ /chr/) {$1="chr"$1}1'
or, equivalently
awk '!(/^rs/ || /^chr/) {$1="chr"$1}1'
you can avoid assignment since there is no further action other than printing with
awk '!(/^rs/ || /^chr/) {print "chr"$0;next}1'
sometimes, writing codes in multiple lines may help you spot the mistake:
'{
if (! ($1 ~ /rs/ || $1 ~ /chr/)) {
($1 == "chr"$1);
print $0
}
};
else {print $0}'
You will see that the else is out of the {...}
stay with your codes, this will fix the problem:
'{
if (! ($1 ~ /rs/ || $1 ~ /chr/)) {
($1 == "chr"$1);
print $0
}else
print $0
}'
for the code improvement, check karakra's answer.
The way to write your code syntactically correctly is:
awk '!($1 ~ /rs/ || $1 ~ /chr/) {$1="chr"$1} 1' filename > newfilename
but be warned that the assignment might change the white space in your file so what you probably really want is:
awk '!($1 ~ /rs/ || $1 ~ /chr/) {sub(/^[[:space:]]*/,"&chr")} 1' filename > newfilename
I would like to use awk to extract some information from my data.
As an example I have a data with 5 column
I would like to extract based on col1 and col2
Extract all lines where col1 is 'a' and col2 starts with 'LINE' or 'SINE' or 'ERV'
i tried
awk '{if ($1 == "a" && $2 ~ /SINE/ || $2 ~ /LINE/ || $2 ~ /ERV/ ) print $0}' myData.txt
Somehow this is not working
You can use:
awk '$1 == "a" && $2 ~ /^(LINE|SINE|ERV)/' myData.txt
I have the following sentence in awk
$ gawk '$2 == "-" { print $1 }' file
I was wondering what thing this instruction exactly did because I can't parse exactly the words I need.
Edit: How can I do in order to skip the lines before the following astersiks?
Let's say I have the following lines:
text
text
text
* * * * * * *
line1 -
line2 -
And then I want to filter just
line1
line2
with the sentence I posted above...
$ gawk '$2 == "-" { print $1 }' file
Thanks for your time and response!
This will find all lines on which the second column (Separated by spaces) is a -, and will then print the first column.
The first part ($2 == "-") checks for the second column being a -, and then if that is the case, runs the attached {} block, which prints the first column ($0 being the whole line, and $1, $2, etc being the first, second, ... columns.)
Spaces are the separator here simply because they are the default separator in awk.
Edit: To do what you want to do now, try the following (Not the most elegant, but it should work.)
gawk 'BEGIN { p = 0 } { if (p != 0 && $2 == "-") { print $1 } else { p = ($0 == "* * * * * * *")? 1 : 0 } }'
Spread over more lines for clarity on what's happening:
gawk 'BEGIN { p = 0 }
{ if (p != 0 && $2 == "-")
{ print $1 }
else
{ p = ($0 == "* * * * * * *")? 1 : 0 }
}'
Answer to the original question:
If the second column in a line from the file matches the string "-" then it prints out the first column of the line, columns are by default separated by spaces.
This would match and print out one:
one - two three
This would not:
one two three four
Answer to the revised question:
This code should do what you need after the match of the given string:
awk '/\* \* \* \* \* \* \*/{i++}i && $2 == "-" { print $1 }' data2.txt
Testing on this data gives the following output:
2two
2two
my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?
Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.
I added a rejection condition: not exactly 3 fields
gawk -F, '
BEGIN {
titles = "MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF."
genders = "M|F|Male|Female"
}
$1 !~ titles || $2 ~ /[[:space:]]/ || length($2) < 3 || $3 !~ genders || NF != 3 {
print > "rejected_list.csv"
next
}
{ print > "clean_list.csv" }
' < DATA_file.csv