AWK: dynamically change FS or RS - regex

I cannot seem to get the trick to interchange the FS/RS variables dynamically, so that I get the following results from the input:
Input_file
header 1
header 2
{
something should not be removed
}
50
(
auto1
{
type good;
remove not useful;
}
auto2
{
type good;
keep useful;
}
auto3
{
type moderate;
remove not useful;
}
)
Output_file
header 1
header 2
{
something that should not be removed
}
50
(
auto1//good
{
type good;//good
}
auto2//good
{
type good;//good
keep useful;
}
auto3//moderate
{
type moderate;//moderate
}
)
The key things are:
There's no change is happening when the code-block {...} is not preceded by a autoX (X can be 1,2,3 etc.).
The changes should happen when autoX is followed by a codeblock {...}.
the value inside the codeblock & autoX is modified with the addition of \\good or //moderate, which needs to be read from the {...} itself.
the whole line should be removed from {...}, if it contains the phrase remove.
HINT: It might be something that can use regex and the idea explained here, with this particular example.
For now, I only have been able to meet the last requirement, with the following code:
awk ' {$1=="{"; FS=="}";} {$1!="}"; gsub("remove",""); print NR"\t\t"$0}' Input_file
Thanks in advance, for your skill & time, to tackle this problem with awk.

Here is my attempt to solve this problem:
awk '
FNR==NR{
if($0~/auto[0-9]+/){
found1=1
val=$0
next
}
if(found1 && $0 ~ /{/){
found2=1
next
}
if(found1 && found2 && $0 ~ /type/){
sub(/;/,"",$NF)
a[val]=$NF
next
}
if($0 ~ /}/){
found1=found2=val=""
}
next
}
found3 && /not useful/{
next
}
/}/{
found3=val1=""
}
found3 && /type/{
sub($NF,$NF"//"a[val1])
}
/auto[0-9]+/ && $0 in a{
print $0"//"a[$0]
found3=1
val1=$0
next
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
FNR==NR{ ##FNR==NR will be TRUE when first time Input_file is being read.
if($0~/auto[0-9]+/){ ##Check condition if a line is having auto string followed by digits then do following.
found1=1 ##Setting found1 to 1 which makes sure that the line with auto is FOUND to later logic.
val=$0 ##Storing current line value to variable val here.
next ##next will skip all further statements from here.
}
if(found1 && $0 ~ /{/){ ##Checking condition if found1 is SET and line has { in it then do following.
found2=1 ##Setting found2 value as 1 which tells program further that after auto { is also found now.
next ##next will skip all further statements from here.
}
if(found1 && found2 && $0 ~ /type/){ ##Checking condition if found1 and found2 are ET AND line has type in it then do following.
sub(/;/,"",$NF) ##Substituting semi colon in last field with NULL.
a[val]=$NF ##creating array a with variable var and its value is last column of current line.
next ##next will skip all further statements from here.
}
if($0 ~ /}/){ ##Checking if line has } in it then do following, which basically means previous block is getting closed here.
found1=found2=val="" ##Nullify all variables value found1, found2 and val here.
}
next ##next will skip all further statements from here.
}
/}/{ ##Statements from here will be executed when 2nd time Input_file is being read, checking if line has } here.
found3=val1="" ##Nullifying found3 and val1 variables here.
}
found3 && /type/{ ##Checking if found3 is SET and line has type keyword in it then do following.
sub($NF,$NF"//"a[val1]) ##Substituting last field value with last field and array a value with index val1 here.
}
/auto[0-9]+/ && $0 in a{ ##Searching string auto with digits and checking if current line is present in array a then do following.
print $0"//"a[$0] ##Printing current line // and value of array a with index $0.
found3=1 ##Setting found3 value to 1 here.
val1=$0 ##Setting current line value to val1 here.
next ##next will skip all further statements from here.
}
1 ##1 will print all edited/non0-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.

You can use two newlines as record separator and process each record which may contain one
autoX
{
...
...
}
block.
awk '
BEGIN{
RS="\n\n" # set record separator RS to two newlines
a["good"]; a["moderate"] # create array a with indices "good" and "moderate"
}
{
sub(/\n[ \t]+remove[^;]+;/, "") # remove line containing "remove xxx;"
for (i in a){ # loop array indices "good" and "moderate"
if (index($0, i)){ # if value exists in record
sub(i";", i";//"i) # add "//good" to "good;" or "//moderate" to "moderate;"
match($0, /(auto[0-9]+)/) # get pos. RSTART and length RLENGTH of "autoX"
if (RSTART){ # RSTART > 0 ?
# set prefix including "autox", "//value" and suffix
$0=substr($0, 1, RSTART+RLENGTH-1) "//"i substr($0, RSTART+RLENGTH)
}
break # stop looping (we already replaced "autoX")
}
}
printf "%s", (FNR==1 ? "" : RS)$0 # print modified line prefixed by RS if not the first line
}
' Input_file

Related

awk, skip current rule upon sanity check

How to skip current awk rule when its sanity check failed?
{
if (not_applicable) skip;
if (not_sanity_check2) skip;
if (not_sanity_check3) skip;
# the rest of the actions
}
IMHO, it's much cleaner to write code this way than,
{
if (!not_applicable) {
if (!not_sanity_check2) {
if (!not_sanity_check3) {
# the rest of the actions
}
}
}
}
1;
I need to skip the current rule because I have a catch all rule at the end.
UPDATE, the case I'm trying to solve.
There is multiple match point in a file that I want to match & alter, however, there's no other obvious sign for me to match what I want.
hmmm..., let me simplify it this way, I want to match & alter the first match and skip the rest of the matches and print them as-is.
As far as I understood your requirement, you are looking for if, else if here. Also you could use switch case available in newer version of gawk packages too.
Let's take an example of a Input_file here:
cat Input_file
9
29
Following is the awk code here:
awk -v var="10" '{if($0<var){print "Line " FNR " is less than var"} else if($0>var){print "Line " FNR " is greater than var"}}' Input_file
This will print as follows:
Line 1 is less than var
Line 2 isgreater than var
So if you see code carefully its checking:
First condition if current line is less than var then it will be executed in if block.
Second condition in else if block, if current line is greater than var then print it there.
I'm really not sure what you're trying to do but if I focus on just that last sentence in your question of I want to match & alter the first match and skip the rest of the matches and print them as-is. ... is this what you're trying to do?
{ s=1 }
s && /abc/ { $0="uvw"; s=0 }
s && /def/ { $0="xyz"; s=0 }
{ print }
e.g. to borrow #Ravinder's example:
$ cat Input_file
9
29
$ awk -v var='10' '
{ s=1 }
s && ($0<var) { $0="Line " FNR " is less than var"; s=0 }
s && ($0>var) { $0="Line " FNR " is greater than var"; s=0 }
{ print }
' Input_file
Line 1 is less than var
Line 2 is greater than var
I used the boolean flag variable name s for sane as you also mentioned something in your question about the conditions tested being sanity checks so each condition can be read as is the input sane so far and this next condition is true?.

awk to sum values among grouped lines after specific str match and header

I' ve this program in awk:
BEGIN {
FS="[>;]"
OFS=";"
}
function p(a, i)
{
for(i in a)
print ">" i, "*nr=" ln
}
/^>/ {p(out);ln=0;split("",out);next}
/[*]/ {idx=$2 OFS $3; out[idx]}
{ln++}
END {
if (ln) p(out)
}
it works on a file like this:
>Cluster 300
0 151nt, >last238708;size=1... *
>Cluster 301
0 141nt, >last103379;size=1... at -/99.29%
1 151nt, >last104482;size=1... *
>Cluster 302
0 151nt, >last104505;size=1... *
>Cluster 303
0 119nt, >last325860;size=1... at +/99.16%
1 122nt, >last106751;size=1... at +/99.18%
2 151nt, >last284418;size=1... *
3 113nt, >last8067;size=3... at -/100.00%
4 122nt, >last8102;size=3... at -/100.00%
5 135nt, >last14200;size=2... at +/99.26%
>Cluster 304
0 151nt, >last285146;size=1... *
What I need is that the program print, for each cluster, the id (lastxxxxxx) of the line with the asterisk and that computes the sum of all the "size=" numbers . for example for Cluster 303 it has to output this:
>last284418;nr=11
and for Cluster 304:
>last285146;nr=1
for the moment my code is only able to count the lines and sum them but doesn't take into account the "size=" value.
Thanks for your help!
Could you please try following, written and tested with shown samples only in GNU awk.
awk '
/^>Cluster [0-9]+/{
if(sum){
print clus_line ORS val_line" = "sum
}
val_line=sum=clus_line=""
clus_line=$0
next
}
{
match($0,/size=[0-9]+/)
line=substr($0,RSTART,RLENGTH)
sub(/.*size=/,"",line)
sum+=line
}
/\*$/{
match($0,/>last[^;]*/)
val_line=substr($0,RSTART+1,RLENGTH-1)
}
END{
if(sum){
print clus_line ORS val_line" = "sum
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>Cluster [0-9]+/{ ##Checking condition if line starts from Cluster with digits in line then do following.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
val_line=sum=clus_line="" ##Nullifying val_line, sum and clus_line here.
clus_line=$0 ##Assigning current line to clus_line here.
next ##next will skip all further statements from here.
}
{
match($0,/size=[0-9]+/) ##Using match function to match size= digits in line.
line=substr($0,RSTART,RLENGTH) ##Creating line which has sub-string for current line starts from RSTART till RLENGTH.
sub(/.*size=/,"",line) ##Substituting everything till size= keyword here with NULL in line variable.
sum+=line ##Keep on adding value of digits in line variable in sum here.
}
/\*$/{ ##Checking condition if a line ends with * then do following.
match($0,/>last[^;]*/) ##Using match function to match >last till semi-colon comes here.
val_line=substr($0,RSTART+1,RLENGTH-1) ##Creating val_line which has sub-string of current line from RSTART+1 till RLENGTH-1 here.
}
END{ ##Starting END block of this program from here.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
}' Input_file ##Mentioning Input_file name here.

In awk, Divide values in to array and count then compare

I have a csv file in which column-2 has certain values with delimiter of "," and some values in column-3 with delimiter "|". Now I need to count the values in both columns and compare them. If both are equal, column-4 should print passed, if not is should print failed. I have written below awk script but not getting what I expected
cat /tmp/test.csv
awk -F '' 'BEGIN{ OFS=";"; print "sep=;\nresource;Required_packages;Installed_packages;Validation;"};
{
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";}'/tmp/test.csv
[![my csv][1]][1]
my csv file looks:
resource Required_Packages Installed_packages
--------------------------------------------------
Vm1 a,b,c,d a|b|c
vm2 a,b,c,d b|a
vm3 a,b,c,d c|b|a
my expected file:
resource Required_packages Installed_packages Validation
------------------------------------------------------------------
Vm1 a,b,c,d a|b|c Failed
vm2 a,b,c,d b|a Failed
vm3 a,b,c,d c|b|a|d Passed
you code doesn't match the input/output data (where are the dashed printed, etc) but
this code segment
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";
can be replaced with
print $1,$2,$3,(split($2,a,",")==split($3,a,"|")?"Passed":"Failed")
Also, just checking the counts may not be enough, I think you should be checking the matches as well.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR<=2{
print
next
}
{
num=split($2,array1,",")
num1=split($3,array2,"|")
for(i=1;i<=num;i++){
value[array1[i]]
}
for(k=1;k<=num1;k++){
if(array2[k] in value){ count++ }
}
if(count==num){ $(NF+1)="Passed" }
else { $(NF+1)="Failed" }
count=num=num1=""
delete value
}
1
' Input_file | column -t
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR<=2{ ##Checking condition if line number is lesser or equal to 2 then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
{
num=split($2,array1,",") ##Splitting 2nd field into array named array1 with field separator of comma and num will have total number of elements of array1 in it.
num1=split($3,array2,"|") ##Splitting 3rd field into array named array2 with field separator of comma and num1 will have total number of elements of array2 in it.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num here.
value[array1[i]] ##Creating value which has key as value of array1 who has key as variable i in it.
}
for(k=1;k<=num1;k++){ ##Starting a for loop from from 1 to till value of num1 here.
if(array2[k] in value){ count++ } ##Checking condition if array2 with index k is present in value then increase variable of count here.
}
if(count==num){ $(NF+1)="Passed" } ##Checking condition if count equal to num then adding Passed to new last column of current line.
else { $(NF+1)="Failed" } ##Else adding Failed into nw last field of current line.
count=num=num1="" ##Nullify variables count, num and num1 here.
delete value
}
1 ##1 will print current line.
' Input_file | column -t ##Mentioning Input_file and passing its output to column command here.

Do not print if $previous_line matches $current_line.* [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to be able to use awk in place of a while loop to remove subdomains from an input string if it also contains the main domain.
Source file:
1234.f.dsfsd.test.com
abc.test.com
ad.sdk.kaffnet.com
amazon.co.uk
analytics.test.dailymail.co.uk
bbc.co.uk
bbc.test.com
dailymail.co.uk
kaffnet.com
sdk.kaffnet.com
sub.test.bbc.co.uk
t.dailymail.co.uk
test.amazon.co.uk
test.bbc.co.uk
test.com
test.dailymail.co.uk
Desired Output:
amazon.co.uk
bbc.co.uk
dailymail.co.uk
kaffnet.com
test.com
Solution: #EdMorton
Check the last part of a domain and see which string is the shortest one among them:
BEGIN{FS="."}
{
ind=$(NF-1) FS $NF;
if (!(ind in size) || (size[ind] > length)) {
size[ind]=length # check the minimum size for this domain
domain[ind]=$0 # store the string with the minimum size on this domain
}
}
END {for (ind in domain) print domain[ind]}
As a one-liner:
$ awk 'BEGIN{FS="."} {ind=$(NF-1) FS $NF; if (!(ind in size) || (size[ind] > length)) { size[ind]=length; domain[ind]=$0}} END {for (ind in domain) print domain[ind]}' file
test.com
bbc.co.uk
Previous approach, that works for top level domains:
Just make use of the field separator and set it to the dot. This way, it is just a matter of storing the penultimate and last one as a string and check how many different ones you find:
$ awk -F. '{a[$(NF-1) FS $NF]} END{for (i in a) print i}' file
test.com
How does this work? a[] is an array to which we keep adding indeces. The index is defined with the penultimate field followed by a dot and the last field. This way, any new bla.test.com will still have the same index and do not add extra info into the array.
With other inputs:
$ cat file
1234.f.dsfsd.test.com
abc.test.com
bbc.test.com
test.com
bla.com
another.bla.com
$ awk -F. '{a[$(NF-1) FS $NF]} END{for (i in a) print i}' file
test.com
bla.com
New answer based on new requirements and new sample input file:
$ cat tst.awk
{ doms[$0] }
END {
for (domA in doms) {
hasSubDom = 0
for (domB in doms) {
if ( index(domA,domB ".") == 1 ) {
hasSubDom = 1
}
}
if ( !hasSubDom ) {
print domA
}
}
}
$ rev file | awk -f tst.awk | rev
bbc.co.uk
dailymail.co.uk
amazon.co.uk
kaffnet.com
test.com
$ rev file | sort |
awk -F'.' 'index($0,prev FS)!=1{ print; prev=$1 FS $2 }' |
rev
bbc.co.uk
test.com
The above just implements the algorithm you described in your question. It reverses the chars on each line and then sorts the result just like you were already doing, then if the previous line was foo.bar.stuff then prev is foo.bar and so if the current line is foo.bar.otherstuff then the call to index WILL find that foo.bar. (note the . at the end - adding that last . to the comparison is important so that foo.bar doesn't falsely match foo.barristers.wig) DOES occur at the start (index position 1) of the current line and so we will NOT print that line and prev will remain as is. If, on the other hand the current line is my.sharona.song then prev (foo.bar) DOES NOT occur at the start of that line and so that line IS printed and prev gets set to my.sharona. Finally it just reverses the chars on each output line back to their original order.
You can test a dynamic regex inside awk if you build a variable with the ~ operator
awk 'NR==1{a=$0} NR>1{if(length(a)>0){regex="^"a;if($0~regex){print a}}a=$0}'
Example (using tac and rev to facilitate the reversion)
The problem with your method is that you need at least 2 lines for the domain because you only display the previous line, but what if you did not have a previous line? Maybe that is not an issue for you if your domains always come with at least 2 lines.
For what it's worth, here is a version that works without requiring reversing and sorting the input.
awk -F. 'BEGIN {
SLDs = "co.uk,gov.uk,add.others" # general-use second-level domains we recognize
split(SLDs, slds, /,/);
for (i in slds) slds[slds[i]] = 1
}
/./ {
tld = $(NF-1) "." $(NF)
if (NF > 2 && tld in slds) tld = $(NF-2) "." tld
lines[NR] = $0
tlds[NR] = tld
if (tld == $0) existing_tlds[tld] = 1
}
END {
for (i = 1; i <= length(lines); i++) {
line = lines[i]; tld = tlds[i]
if (!(tld in existing_tlds) || tld == line) print(line)
}
}' input_file
This goes through the file and builds an array of existing TLDs. In the END block it prints a line only when it is a TLD itself or its TLD does not exist in said array.
When input_file is
1234.f.dsfsd.test.com
abc.test.com
amazon.co.uk
bbc.co.uk
bbc.test.com
sub.test.bbc.co.uk
test.amazon.co.uk
test.bbc.co.uk
test.com
it prints
amazon.co.uk
bbc.co.uk
test.com

AWK - Search for a pattern-add it as a variable-search for next line that isn't a variable & print it + variable

I have a given file:
application_1.pp
application_2.pp
#application_2_version => '1.0.0.1-r1',
application_2_version => '1.0.0.2-r3',
application_3.pp
#application_3_version => '2.0.0.1-r4',
application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp
#application_5_version => '3.0.0.1-r8',
application_5_version => '3.0.0.2-r9',
I would like to be able to read this file and search for the string
".pp"
When that string is found, it adds that line into a variable and stores it.
It then reads the next line of the file. If it encounters a line preceded by a # it ignores it and moves onto the next line.
If it comes across a line that does not contain ".pp" and doesn't start with # it should print out that line next to a the last stored variable in a new file.
The output would look like this:
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
I would like to achieve this with awk. If somebody knows how to do this and it is a simple solution i would be happy if they could share it with me. If it is more complex, it would be helpful to know what in awk I need to understand in order to know how to do this (arrays, variables, etc). Can it even be achieved with awk or is another tool necessary?
Thanks,
I'd say
awk '/\.pp/ { if(NR != 1) print line; line = $0; next } NF != 0 && substr($1, 1, 1) != "#" { line = line $0 } END { print line }' filename
This works as follows:
/\.pp/ { # if a line contains ".pp"
if(NR != 1) { # unless we just started
print line # print the last assembled line
}
line = $0 # and remember this new one
next # and we're done here.
}
NF != 0 && substr($1, 1, 1) != "#" { # otherwise, unless the line is empty
# or a comment
line = line $0 # append it to the line we're building
}
END { # in the end,
print line # print the last line.
}
You can use sed:
#n
/\.pp/{
h
:loop
n
/[^#]application.*version/{
H
g
s/\n[[:space:]]*/\t/
p
b
}
/\.pp/{
x
p
}
b loop
}
If you save this as s.sed and run
sed -f s.sed file
You will get this output
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
Explanation
The #n supresses normal output.
Once we match the /\.pp/, we store that line into the hold space with h, and start the loop.
We go to the next line with n
If it matches /[^#]application.*version/, meaning it doesn't start with a #, then we append the line to the hold space with H, then copy the hold space to the pattern space with g, and substitute the newline and any subsequent whitespace for a tab. Finally we print with p, and skip to the end of the script with b
If it matches /\.pp/, then we swap the pattern and hold spaces with x, and print with p.