How to use awk or sed to replace specific word under certain profile - regex

a file contains data in the following format, Now I want to chang the value of showfirst under XYZ section. how to achieve that with sed or awk or grep?
I thought of line number or second appearance but that's not going to be constant In future file can contain hundreds of such profile so it has to be user based.
I know that I can extract 1st line after 'XYZ' pattern but I want it to be field based.
Thanks for help
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=10
showlast=3

With sed:
sed '/^\[XYZ\]/,/^showfirst *=/{0,//!s/.*/showfirst=20/}' file
How it works:
/^\[XYZ\]/,/^showfirst *=/: address range that matches lines from [XYZ] to next ^showfirst
// is for lines matching above addresses([XYZ] and showfirst=10). So 0,//!: NOT in the first matching line(that is showfirst=10 line)
s/.*/showfirst=20/: replace line with showfirst=44

You can use awk like this:
awk -v val='50' 'BEGIN{FS=OFS="="} /^\[.*\]$/{ flag = ($1 == "[XYZ]"?1:0) }
flag && $1 == "showfirst" { $2 = val } 1' file
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=50
showlast=3

In awk. All parameterized:
$ awk -v l='XYZ' -v k='showfirst' -v v='666' ' # parameters lable, key, value
BEGIN { FS=OFS="=" } # delimiters
/\[.*\]/ { f=0 } # flag down #new lable
$1=="[" l "]" { f=1 } # flag up #lable
f==1 && $1==k { $2=v } # replace value when flag up and match
1' file # print
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=666
showlast=3
Man, even I feel confused looking at that code.

If you set the record separator (RS) to the empty stiring, awk will read a whole record at a time, assuming the records are double-newline separated.
So for example you can do something like this:
awk -v k=XYZ -v v=42 '$1 ~ "\\[" k "\\]" { sub("showfirst *=[^\n]*", "showfirst=" v) } 1' RS= ORS='\n\n' infile
Output:
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=42
showlast=3

with sed following command solved my problem for every user,
sed '/.*\[ABC\]/,/^$/ s/.*showfirst.*/showfirst=20/' input.conf
syntax goes as sed [address] command
'/.*\[ABC\]/,/^$/ : This generates address range pointing to region starting from [ABC] up to next first blank line. Search for string will be done in this particular range only.
s/.*showfirst.*/showfirst=20/: This search for any line having showfirst in it and replace entire line with showfirst=20

Related

Interval expressions in gawk to awk

I hope this is an easy fix
I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.
sample file.fasta:
>gene1
>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT
>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT
>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
what I know works is awk is the following:
awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
the culprit therefore is the interval expression of {21,}
What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.
Can anyone help?
Edit:
Thanks for all the help:
There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...
Prior to awk command:
cat file1.fasta
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Following codes all produced the same desired output:
original code
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
slight modification that adds interval function to original awk version >3.x.x
awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
Allows for modification of val and correct output , untested but should work with lower versions of awk
awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta
awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta
Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Lastly just to show the discarded lines
>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo
Hope this helps others!
EDIT: As per OP comment need to print gene ids too then try following.
awk '
/gene/{
id=$0
next
}
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print id ORS $0
}
id=""
}
' Input_file
OR one-liner form of above solution as per OP's request:
awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file
Could you please try following, written and tested with shown samples only.
awk '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print
}
}
' Input_file
OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.
awk -v usr_count="21" '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=usr_count){
print
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*GG/){ ##Using Match function to match everything till GG in current line.
val=substr($0,RSTART,RLENGTH-2) ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
if(gsub(/[ACTG]/,"&",val)>=21){ ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
print ##Printing current line then.
}
}
' Input_file ##Mentioning Input_file name here.
GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.
awk --re-interval '/a{3,6}/{print}' file
There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.
The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { **process_sequence_here** }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { **process_sequence_here** }' file.fasta
In the presented case, your sequence processing looks like:
awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta
Sounds like what you want is:
awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file
but this is probably all you need given the sample input you provided:
awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
They'll both work in any awk.
Using your updated sample input:
$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Linux Replace With Variable Containing Double Quotes

I have read the following:
How Do I Use Variables In A Sed Command
How can I use variables when doing a sed?
Sed replace variable in double quotes
I have learned that I can use sed "s/STRING/$var1/g" to replace a string with the contents of a variable. However, I'm having a hard time finding out how to replace with a variable that contains double quotes, brackets and exclamation marks.
Then, hoping to escape the quotes, I tried piping my result though sed 's/\"/\\\"/g' which gave me another error sed: -e expression #1, char 7: unknown command: E'. I was hoping to escape the problematic characters and then do the variable replacement: sed "s/STRING/$var1/g". But I couldn't get that far either.
I figured you guys might know a better way to replace a string with a variable that contains quotes.
File1.txt:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
Variable:
var1=$(cat file1.txt)
Example:
echo "STRING" | sed "s/STRING/$var1/g"
Desired output:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
using awk
$ echo "STRING" | awk -v var="$var1" '{ gsub(/STRING/,var,$0); print $0}'
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
-v var="$var1": To use shell variable in awk
gsub(/STRING/,var,$0) : To globally substitute all occurances of "STRING" in whole record $0 with var
Special case : "If your var has & in it " say at the beginning of the line then it will create problems with gsub as & has a special meaning and refers to the matched text instead.
To deal with this situation we've to escape & as follows :
$ echo "STRING" | awk -v var="$var1" '{ gsub(/&/,"\\\\&",var); gsub(/STRING/,var,$0);print $0}'
&Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
The problem isn't the quotes. You're missing the "s" command, leading sed to treat /STRING/ as a line address, and the value of $var1 as a command to execute on matching lines. Also, $var1 has unescaped newlines and a / character that'll cause trouble in the substitution. So add the "s", and escape the relevant characters in $var1:
var1escaped="$(echo "$var1" | sed 's#[\/&]#\\&#; $ !s/$/\\/')"
echo "STRING" | sed "s/STRING/$var1escaped/"
...but realistically, #batMan's answer (using awk) is probably a better solution.
Here is one awk command that gets text-to-be-replaces from a file that may consist of all kind of special characters such as & or \ etc:
awk -v pat="STRING" 'ARGV[1] == FILENAME {
# read replacement text from first file in arguments
a = (a == "" ? "" : a RS) $0
next
}
{
# now run a loop using index function and use substr to get the replacements
s = ""
while( p = index($0, pat) ) {
s = s substr($0, 1, p-1) a
$0 = substr($0, p+length(pat))
}
$0 = s $0
} 1' File1.txt <(echo "STRING")
To be able to handle all kind of special characters properly this command avoids any regex based functions. We use plain text based functions such as index, substr etc.

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

Regex match as many of strings as possible

I don't know if this is possible or makes sense, but what I'm trying to do is grep or awk a file matching for multiple strings, but only showing the match that matches the most strings.
So I would have a file like:
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall
phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
I want to match a single line that includes the most of these strings: cat|dog|table|key|wall. Not necessarily having to include all of them, but whatever line matches the most, print it.
So for example, I would want it to display this output:
cat,dog,key
Since it is the line that includes most of the strings that are being searched for.
I've tried using:
cat filename \
|egrep -iE 'cat' \
|egrep -iE 'dog' \
|egrep -iE 'table' \
|egrep -iE 'key' \
|egrep -iE 'wall'
But it will only display lines that show ALL strings, I have also tried:
egrep -iE 'cat|dog|table|key|wall' filename
But that shows any line that matches any one of those strings.
Is regex possible of doing something like this?
Use awk, and increment a counter for each word that matches. If the counter is higher than the highest count, save this line.
awk 'BEGIN {max = 0}
{ count=0;
if (/\bcat\b/) count++;
if (/\bdog\b/) count++;
...
if (count > max) { saved = $0; max = count; }
}
END { print saved; }'
$ awk -F, -v r='^(cat|dog|table|key|wall)$' '{c=0;for (i=1;i<=NF;i++)if ($i~r)c++; if (c>max){max=c;most=$0}} END{print most}' file
cat,dog,key
How it works
-F,
This sets the field separator to a comma.
-v r='^(cat|dog|table|key|wall)$'
This sets the variable r to a regex matching your words of interest. The regex begins with ^ and ends with $. This assures that only whole words are matched.
c=0;for (i=1;i<=NF;i++)if ($i~r)c++
This sets the variable c to the number of matches on the current line.
if (c>max){max=c;most=$0}
If the number of matches on the current line, c, exceeds the previous maximum, max, then update max and set most to the current line.
END{print most}
When we are done reading the file, print the line with the most matches.
To make the problem more interesting I created two input files:
InFile1 ...
cat|dog|table|key|wall
InFile2 ...
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
Note that InFile2 differs from the original post
in that it contains two lines each with three matches.
Hence, there is a "tie" for first place and both are
reported.
This code ...
awk -F, '{if (NR==FNR) r=$0; else {count=0
for (j=1;j<=NF;j++) if ($j ~ r) count++
a[FNR]=count" matching words in "$0
if (max<count) max=count}}
END{for (j=1;j<=FNR;j++) if (1==index(a[j],max)) print a[j]}' \
$InFile1 $InFile2 >$OutFile
... produced this OutFile ...
3 matching words in cat,dog,key
3 matching words in table,key,dog,banana
Daniel B. Martin

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,
awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf
hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH
In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file