Awk to skip the blank lines - regex

The output of my script is tab delimited using awk as :
awk -v variable=$bashvariable '{print variable"\t single\t" $0"\t double"}' myinfile.c
The awk command is run in a while loop which updates the variable value and the file myinfile.c for every cycle.
I am getting the expected results with this command .
But if the inmyfile.c contains a blank line (it can contain) it prints no relevant information. can I tell awk to ignore the blank line ?
I know it can be done by removing the blank lines from myinfile.c before passing it on to awk .
I am in knowledge of sed and tr way but I want awk to do it in the above mentioned command itself and not a separate solution as below or a piped one.
sed '/^$/d' myinfile.c
tr -s "\n" < myinfile.c
Thanks in advance for your suggestions and replies.

There are two approaches you can try to filter out lines:
awk 'NF' data.txt
and
awk 'length' data.txt
Just put these at the start of your command, i.e.,
awk -v variable=$bashvariable 'NF { print variable ... }' myinfile
or
awk -v variable=$bashvariable 'length { print variable ... }' myinfile
Both of these act as gatekeepers/if-statements.
The first approach works by only printining out lines where the number of fields (NF) is not zero (i.e., greater than zero).
The second method looks at the line length and acts if the length is not zero (i.e., greater than zero)
You can pick the approach that is most suitable for your data/needs.

You could just add
/^\s*$/ {next;}
To the front of your script that will match the blank lines and skip the rest of the awk matching rules. Put it all together:
awk -v variable=$bashvariable '/^\s*$/ {next;} {print variable"\t single\t" $0"\t double"}' myinfile.c

may be you could try this out:
awk -v variable=$bashvariable '$0{print variable"\t single\t" $0"\t double"}' myinfile.c

Try this:
awk -v variable=$bashvariable '/^.+$/{print variable"\t single\t" $0"\t double"}' myinfile.c

I haven't seen this solution, so: awk '!/^\s*$/{print $1}' will run the block for all non-empty lines.
\s metacharacter is not available in all awk implementations, but you can also write !/^[ \t]*$/.
https://www.gnu.org/software/gawk/manual/gawk.html
\s Matches any space character as defined by the current locale. Think of it as shorthand for ‘[[:space:]]’.

Based on Levon's answer, you may just add | awk 'length { print $1 }' to the end of the command.
So change
awk -v variable=$bashvariable '{ whatever }' myinfile.c
to
awk -v variable=$bashvariable '{ whatever }' myinfile.c | awk 'length { print $1 }'
In case this doesn't work, use | awk 'NF { print $1 }' instead.

another awk way of only trimming out actually zero length lines but keep the ones with only spaces tabs is this :
awk 8 RS=
just doing awk NF trims out lines 3 (zero length) and 5 (spaces and tabs) …..
1 abc
2 def
3
4 3591952
5
6 93253
1 abc
2 def
3 3591952
4
5 93253
1 abc
2 def
3 3591952
4 93253
but the RS= approach keeps line 5 for u:
1 abc
2 def
3 3591952
4
5 93253
** lines with \013 \v VT :: \014 \f FF :: \015 \r CR aren't skipped by default FS = " ", despite them also belonging to POSIX [[:space:]]

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

Add delimiters at specific indexes

I want to add a delimiter in some indexes for each line of a file.
I have a file with data:
10100100010000
20200200020000
And I know the offset of each column (2, 5 and 9)
With this sed command: sed 's/\(.\{2\}\)/&,/;s/\(.\{6\}\)/&,/;s/\(.\{11\}\)/&,/' myFile
I get the expected output:
10,100,1000,10000
20,200,2000,20000
but with a large number of columns (~200) and rows (300k) is really slow.
Is there an efficient alternative?
1st solution: With GNU awk could you please try following:
awk -v OFS="," '{$1=$1}1' FIELDWIDTHS="2 3 4 5" Input_file
2nd Solution: Using sed try following.
sed 's/\(..\)\(...\)\(....\)\(.....\)/\1,\2,\3,\4/' Input_file
3rd solution: awk solution using substr.
awk 'BEGIN{OFS=","} {print substr($0,1,2) OFS substr($0,3,3) OFS substr($0,6,4) OFS substr($0,10,5)}' Input_file
In above substr solution, I have taken 5 digits/characters in substr($0,10,5) in case you want to take all characters/digits etc starting from 10th position use substr($0,10) which will take rest of all line's characters/digits here to print.
Output will be as follows.
10,100,1000,10000
20,200,2000,20000
Modifying your sed command to make it add all the separators in one shot would likely make it perform better :
sed 's/^\(.\{2\}\)\(.\{3\}\)\(.\{4\}\)/\1,\2,\3,/' myFile
Or with extended regular expression:
sed -E 's/(.{2})(.{3})(.{4})/\1,\2,\3,/' myFile
Output:
10,100,1000,10000
20,200,2000,20000
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='2 3 4 *' -v OFS=',' '{$1=$1}1' file
10,100,1000,10000
20,200,2000,20000
You'll need a newer version of gawk for * at the end of FIELDWIDTHS to mean "whatever's left", with older version just choose a large number like 999.
If you start the substitutions from the back, you can use the number flag to s to specify which occurrence of any character you'd like to append a comma to:
$ sed 's/./&,/9;s/./&,/5;s/./&,/2' myFile
10,100,1000,10000
20,200,2000,20000
You could automate that a bit further by building the command with a printf statement:
printf -v cmd 's/./&,/%d;' 9 5 2
sed "$cmd" myFile
or even wrap that in a little shell function so we don't have to care about listing the columns in reverse order:
gencmd() {
local arr
# Sort arguments in descending order
IFS=$'\n' arr=($(sort -nr <<< "$*"))
printf 's/./&,/%d;' "${arr[#]}"
}
sed "$(gencmd 2 5 9)" myFile

Is there a way to obtain the current pattern searched in an AWK script?

The basic idea is this. Suppose that you want to search a file for multiple patterns from a pipe with awk :
... | awk -f - '{...}' someFile.txt
* '...' is just short for some code
* '-f -' indicates the pattern is taken from pipe
Is there a way to know which pattern is searched at each instant within the awk script
(like you know $1 is the first field, is there something like $PATTERN that contains the current pattern
searched or a way to get something like it?
More Elaboration:
if I have 2 files:
someFile.txt containing:
1
2
4
patterns.txt containing:
1
2
3
4
running this command:
cat patterns.txt |awk -f - '{...}' someFile.txt
What should I type between the braces such that only the pattern in patterns.txt that
has not been matched in someFile.txt is printed?(in this case the number 3 in patterns.txt is not matched)
Under the requirements that patterns.txt be supplied as stdin and that the processing be done with awk:
$ cat patterns.txt | awk 'FNR==NR{p=p "\n" $0;next;} p !~ $0' someFile.txt -
3
This was tested using GNU awk.
Explanation
We want to remove from patterns.txt anything that matches a line in someFile.txt. To do this, we first read in someFile.txt and create patterns from it. Next, we print only the lines from patterns.txt that do not match any of the patterns from someFile.txt.
FNR==NR{p=p "\n" $0;next;}
NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: someFile.txt. We save all such lines in the newline-separated variable p. We then tell awk to skip the remaining commands and jump to the next line.
p !~ $0
If we got here, then we are now reading the second named file on the command line which is - for stdin. This boolean condition evaluates to either true or false. If it is true, the line is printed. If not, it is skipped. In other words, the above is awk's crytic shorthand for:
p !~ $0 {print $0}
cmd | awk 'NR==FNR{pats[$0]; next} {for (p in pats) if ($0 ~ p) delete pats[p]} END{ for (p in pats) print p }' - someFile.txt
Another way in awk
cat patterns.txt | awk 'NR>FNR&&!($0 in a);{a[$0]}' someFile.txt -

awk how to set record separator as multiple consecutive empty lines or lines only include space and/or tab characters?

I know I can use RS="" to set record separator as multiple consecutive empty lines. However if those lines contain space or tab characters it will not work. I'm thinking to set RF as some kind of regular expression to do the match. But it's hard, since in this case often \n will be used as the field separator FS. Any suggestions?
Here is a way to do it:
awk '!NF {$0=""}1' file | awk -v RS="" '{print NR,$0}'
The first awk counts the fields on the line. This will be 0 if you have blank lines or lines with spaces and tabs only. Then it just change the line to nothing. After this you can use the RS=""
Here is a gnu awk version (due to multiple characters in RS):
awk -v RS="\n([[:space:]]*\n)+" '{print NR,$0}' file
It may work without parentheses, but I am not sure if all will be covered then:
awk -v RS="\n[[:space:]]*\n+" '{print NR,$0}' file
With GNU awk for multi-char RS:
awk -v RS='\n(([[:space:]]*\n)+|$)' '{print NR, "<" $0 ">"}' file
e.g.
$ awk '{print NR, "<" $0 ">"}' file
1 <a>
2 < b>
3 < >
4 < c>
$ awk -v RS='\n(([[:space:]]*\n)+|$)' '{print NR, "<" $0 ">"}' file
1 <a
b>
2 < c>

awk, a field doesn't match but it should match

I have a file structured as record list, where field separator is \t.
I want to extract only records where the second field is a number from 1 to 9, but my awk script doesn't work.
The awk script is
cat file |awk -v FS="\t" '$2 ~ /[0-9]{1}/ {print $0;}'
or this
cat file |awk -v FS="\t" '$2 ~ /.{1}/ {print $0;}' #because the second fields of my file have all second fields as number
Why these sscript don't work? Isn't regex a good regex?
Update
Even with the interval {1}, you are still going to match a field like 23 because the 2 matches a single number. What you really want to use are anchors and forget about intervals:
awk '$2 ~ /^[0-9]$/{print}' FS="\t" file
The problem is the use of intervals {1}. awk less than version 4 doesn't support intervals. gawk on the other hand will if you add the following flag: --re-interval
Try this:
awk --re-interval '$2 ~ /[0-9]{1}/{print}' FS="\t" file
Some other things to note:
Built in vars such as FS can be assigned at the end without the need for -v
You can use just print rather than print $0 as that is its default behavior
Useless use of cat. awk can take a file as an argument, use that instead
If you want to ensure the 2nd field is a single-digit number, you don't really need a regex:
awk '1 <= $2 && $2 <= 9 {print}'