I am writing an if-then-else statement using awk in a bash script.
What I would like to do is identify lines with col 1 values not matching a particular string (rs or chr) and append a prefix (chr) to the col 1 values for those identified lines. All lines with the matched string should print as they were - no appending.
My line of code so far is:
awk '{if (! ($1 ~ /rs/ || $1 ~ /chr/)) {($1 == "chr"$1); print $0}}; else {print $0}' filename > newfilename
I keep on receiving syntax error messages with this code.
I can perform the identification and the appending successfully on their own but am having problems combining them into one command.
With idiomatic awk you can rewrite this as
awk '$1!~/rs/ && $1!~/chr/ {$1="chr"$1}1'
or if you like
awk '!($1 ~ /rs/ || $1 ~ /chr/) {$1="chr"$1}1'
or, equivalently
awk '!(/^rs/ || /^chr/) {$1="chr"$1}1'
you can avoid assignment since there is no further action other than printing with
awk '!(/^rs/ || /^chr/) {print "chr"$0;next}1'
sometimes, writing codes in multiple lines may help you spot the mistake:
'{
if (! ($1 ~ /rs/ || $1 ~ /chr/)) {
($1 == "chr"$1);
print $0
}
};
else {print $0}'
You will see that the else is out of the {...}
stay with your codes, this will fix the problem:
'{
if (! ($1 ~ /rs/ || $1 ~ /chr/)) {
($1 == "chr"$1);
print $0
}else
print $0
}'
for the code improvement, check karakra's answer.
The way to write your code syntactically correctly is:
awk '!($1 ~ /rs/ || $1 ~ /chr/) {$1="chr"$1} 1' filename > newfilename
but be warned that the assignment might change the white space in your file so what you probably really want is:
awk '!($1 ~ /rs/ || $1 ~ /chr/) {sub(/^[[:space:]]*/,"&chr")} 1' filename > newfilename
Related
I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.
I have 2 files,
file1:
YARRA2
file2:
59204.9493055556
59205.5930555556
So, file1 has 1 line and file2 has 2 lines. If file1 has 1 line, and file2 has more than 1 line, I want to repeat the lines in file1 according to the number of lines in file2.
So, my code is this:
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
but the output is
YARRA2
Any idea? I have also tried with
awk '{ if( NR<2 && $eprows>1 ) {print} {print}}' file1
but it is the same
You may use this awk solution:
awk '
NR == FNR {
++n2
next
}
{
s = $0
print;
++n1
}
END {
if (n1 == 1)
for (n1=2; n1 <= n2; ++n1)
print s
}' file2 file1
YARRA2
YARRA2
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
Oops! You stepped hip-deep in mixed languages.
The eprows variable is a shell variable. It's not accessible to other processes except through the environment, unless explicitly passed somehow. The awk program is inside single-quotes, which would prevent interpreting eprows even if used correctly.
The value of a shell variable is obtained with $, so
echo $eprows
2
One way to insert the value into your awk script is by interpolation:
awk '{ if( NR<2 && '"$eprows"'>1 ) {print} {print}}' file1
That uses a lesser known trick: you can switch between single- and double-quotes as long as you don't introduce spaces. Because double-quoted strings in the shell are interpolated, awk sees
{ if( NR<2 && 2>1 ) {print} {print} }
Awk also lets you pass values to awk variables on the command line, thus:
awk -v eprows=$eprows '{ if( NR<2 && eprows >1 ) {print} {print}}' file1
but you'd have nicer awk this way:
awk -v eprows=$eprows 'NR < 2 && eprows > 1 { {print} {print} }' file1
whitespace and brevity being elixirs of clarity.
That works because in the awk pattern / action paradigm, pattern is anything that can be reduced to true/false. It need not be a regex, although it usually is.
One awk idea:
awk '
FNR==NR { cnt++; next } # count number of records in 1st file
# no specific processing for 2nd file => just scan through to end of file
END { if (FNR==1 && cnt >=2) # if 2nd file has just 1 record (ie, FNR==1) and 1st file had 2+ records then ...
for (i=1;i<=cnt;i++) # for each record in 1st file ...
print # print current (and only) record from 2nd file
}
' file2 file1
This generates:
YARRA2
YARRA2
I am having troubles with passing a parameter to a regex inside an awk command.What seems to be the problem here? Does the regex read the parameter name instead of the value? Thanks
FILE=*some file here*
TEST_STRING1=test
awk -v testString1="$TEST_STRING1" 'BEGIN {
}
{
##Sample REGEX HERE
if ( $0 ~ "^testString1.* - \[.*\] - .*$") {
##DO SOMETHING HERE
}
}
END{}
' $FILE
You need to use awk string concatenation:
if ( $0 ~ "^" testString1 ".* - \[.*\] - .*$" ) {
Or, do the variable substitution in the shell -- the quoting is a bit tricky
awk -v regex="^${TEST_STRING1}"'.* - \[.*\] - .*$'
Then, in awk
if ($0 ~ regex) ...
I would like to use awk to extract some information from my data.
As an example I have a data with 5 column
I would like to extract based on col1 and col2
Extract all lines where col1 is 'a' and col2 starts with 'LINE' or 'SINE' or 'ERV'
i tried
awk '{if ($1 == "a" && $2 ~ /SINE/ || $2 ~ /LINE/ || $2 ~ /ERV/ ) print $0}' myData.txt
Somehow this is not working
You can use:
awk '$1 == "a" && $2 ~ /^(LINE|SINE|ERV)/' myData.txt
my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?
Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.
I added a rejection condition: not exactly 3 fields
gawk -F, '
BEGIN {
titles = "MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF."
genders = "M|F|Male|Female"
}
$1 !~ titles || $2 ~ /[[:space:]]/ || length($2) < 3 || $3 !~ genders || NF != 3 {
print > "rejected_list.csv"
next
}
{ print > "clean_list.csv" }
' < DATA_file.csv