How to print from A word to B word?( awk ) - regex

I need to print from the word 'update' to ';'.
file.txt:
-- Host (first) kkk (queen1)
-- prince princess#/king 1/1
update schema.table_name t set "A=123","B=234" where "C=222" and "D=333"
and "F=2342";
-- Host (first) ddd (queen2)
-- prince princess#/king 2/2
update schema2.table_name2 t set "A=123","B=234" where "C=222" and "D=333"
and "F=2342";
With the below awk, I can specify the block to parse, but I'm not sure how I can print the statement from update....until semicolon().
file.awk:
BEGIN {
}
/-- Host/,/;/ {
if (/-- Host/) printf "%s#%s#",$3,$5;
if (/update /) printf ??????????????;
}
END {
}
This is how I execute it:
awk -f file.awk -F'[ ()]+' file.txt
Can you let me know of any idea?

I guess your problem is, the update... line was broken into multiple lines. This one-liner may help you. However you perhaps have to adjust it a little bit to fit your whole script.
awk 'p||/^update/{p=1;printf "%s",$0}/;$/&&p{p=0;print ""}' file
with your file as input, it outputs:
update schema.table_name t set "A=123","B=234" where "C=222" and "D=333"and "F=2342";
update schema2.table_name2 t set "A=123","B=234" where "C=222" and "D=333"and "F=2342";

awk idiomatic way of doing this is
awk '/update/,/;/' file.txt

Related

Interval expressions in gawk to awk

I hope this is an easy fix
I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.
sample file.fasta:
>gene1
>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT
>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT
>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
what I know works is awk is the following:
awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
the culprit therefore is the interval expression of {21,}
What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.
Can anyone help?
Edit:
Thanks for all the help:
There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...
Prior to awk command:
cat file1.fasta
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Following codes all produced the same desired output:
original code
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
slight modification that adds interval function to original awk version >3.x.x
awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
Allows for modification of val and correct output , untested but should work with lower versions of awk
awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta
awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta
Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Lastly just to show the discarded lines
>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo
Hope this helps others!
EDIT: As per OP comment need to print gene ids too then try following.
awk '
/gene/{
id=$0
next
}
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print id ORS $0
}
id=""
}
' Input_file
OR one-liner form of above solution as per OP's request:
awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file
Could you please try following, written and tested with shown samples only.
awk '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print
}
}
' Input_file
OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.
awk -v usr_count="21" '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=usr_count){
print
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*GG/){ ##Using Match function to match everything till GG in current line.
val=substr($0,RSTART,RLENGTH-2) ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
if(gsub(/[ACTG]/,"&",val)>=21){ ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
print ##Printing current line then.
}
}
' Input_file ##Mentioning Input_file name here.
GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.
awk --re-interval '/a{3,6}/{print}' file
There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.
The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { **process_sequence_here** }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { **process_sequence_here** }' file.fasta
In the presented case, your sequence processing looks like:
awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta
Sounds like what you want is:
awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file
but this is probably all you need given the sample input you provided:
awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
They'll both work in any awk.
Using your updated sample input:
$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Is there a way to use regex with awk to execute a command only when the pattern is matched?

I'm trying to write a bash script that gets user input, checks a .txt for the line that contains that input then plugs that into a wget statement to commence a download.
In testing the functionality awk seems to print out every line, not just pattern matched lines.
chosen=DSC01985
awk -v c="$chosen" 'BEGIN {FS="/"; /c/}
{print $8, "found", c}
END{print " done"}' ./imgLink.txt
The above should take from imgLink.txt, search for the pattern and return that the pattern is found. Instead it prints the the 8th field of every line in the file.
I have tried moving /c/ out of the begin statement but to no avail.
what's going on here?
Example input:
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01533.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01536.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01543.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01558.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01565.jpg
etc.
Example output:
...
DSC02028.jpg found DSC01985
DSC02030.jpg found DSC01985
DSC02032.jpg found DSC01985
DSC02038.jpg found DSC01985
DSC02042.jpg found DSC01985
etc.
You were close in your attempt, you can't search an awk variable like /var/ you need different method for this. Could you please try following.Considering that your string which you want to look will come in URL value(s) which you have currently xxxed in your post.
awk -v c="$chosen" -F'/' '$0 ~ c{print $NF " found " c}' Input_file
Not sure why you have written done in your END block, you could add it here if you need it. Also $NF means last field of current line you could print it as per your need too.

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.
For instance, the 1st indicator is a list of words
(no|noone|haven't)
and the 2nd indicator is a list of punctuation
Code:
(.|,|!)
From an input text such as
"Noone understands the plot. There is no storyline. I haven't
recommended this movie to my friends! Did you understand it?"
The desired result would be.
"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I
haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX
friends_AFFIX! Did you understand it?"
I know that there is the following sed:
sed -n '/WORD1/,/WORD2/p' /path/to/file
which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.
I have also considered to use awk, such as
awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile
yet still, it does not allow me to append the affix.
Does anyone have a suggestion to do so, either with awk or sed?
Little more compact awk
$ awk 'BEGIN{RS=ORS=" ";s="_AFFIX"}
/[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}
f{$0=$0s}
/Noone|no|haven'\''t/{f=1}1' story
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?
Perl to the rescue!
perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
join " ", map "${_}_AFFIX", split " ", $1/egi
' infile > outfile
\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
Here is one verbose awk command for the same:
s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"
awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
a=0
for (i=2; i<=NF; i++) {
if ($(i-1) ~ "\\y" kw "\\y")
a=1
if (a && $i ~ pct "$") {
p = substr($i, length($i), 1)
$i = substr($i, 1, length($i)-1)
}
if (a)
$i=$i "_AFFIX" p
if(p) {
p=""
a=0
}
}
} 1'
Output:
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Awk print if no match

I am using the following statement in awk with text piped to it from another command:
awk 'match($0,/(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)/) && NR<11 {print substr($0,RSTART,RLENGTH)}'
which is almost working for what I need (find one of the words in the regex within the first 10 lines of the input and print that word). The main thing I need to do is to output something if there is no match. For instance, if none of those words are found in the first ten lines it would output UNKNOWN.
I also need to limit the output to the first match, as I need to ensure a single line of output per input file. I can do this with head or ask another question if needs be, I only include it here in case it affects how to output the no-match text.
I am also not tied to awk as a tool - if there is a simpler way to do this with sed or something else I am open to it.
You just need to exit at the first match, or on line 11 if no match
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print substr($0,RSTART,RLENGTH)
exit
}
NR == 11 {print "UNKNOWN"; exit}
'
I like glenn jackman's answer, however, if you wish to print matches for all 10 lines then you can try something like this:
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print NR " ---> " substr($0,RSTART,RLENGTH)
flag=1
}
flag==0 && NR==11 {
print "UNKNOWN"
exit
}'
You can do this..
( head -10 | egrep -o '(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|
DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)'
|| print "Unkownn" ) | head -1