I have a file containing lines that look like
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
My goal is to find lines containing n occurences of the pattern "$long".
Anyone knowing the grep regex for this match?
You don't need a regex for this. With awk you can use $long as field separator and check how many fields each line has:
awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' file
Test
$ awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' a
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v count=4 'BEGIN {FS="\\$long"} NF==(count+1)' a
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
$ awk -v count=5 'BEGIN {FS="\\$long"} NF==(count+1)' a
$
awk solution by Fedorqui should work fine. You can also use grep for this:
grep -E '(.*\$long){4}' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
but if $long can occur in contexts other than as a field of it's own, e.g.:
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
and you only want to count it when it's in a field of it's own then you'll need something more like:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
e.g.:
$ cat file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Wrong:
$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Right:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
$ awk -F, -v n=4 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?
Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.
Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.
This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt
This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2
Suppose I have this text
The code for 233-CO is the main reason for 45-DFG and this 45-GH
Now I have this regexp \s[0-9]+-\w+ which matches 233-CO, 45-DFG and 45-GH.
How can I display just the third match 45-GH?
sed -re 's/\s[0-9]+-\w+/\3/g' file.txt
where \3 should be the third regexp match.
Is it mandatory to use sed? You could do it with grep, using arrays:
text="The code for 233-CO is the main reason for 45-DFG and this 45-GH"
matches=( $(echo "$text" | grep -o -m 3 '\s[0-9]\+-\w\+') ) # store first 3 matches in array
echo "${matches[0]} ${matches[2]}" # prompt first and third match
To find the last occurence of your pattern, you can use this:
$ sed -re 's/.*\s([0-9]+-\w+).*/\1/g' file
45-GH
if awk is accepted, there is an awk onliner, you give the No# of match you want to grab, it gives your the matched str.
awk -vn=$n '{l=$0;for(i=1;i<n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' file
test
kent$ echo $STR #so we have 7 matches in str
The code for 233-CO is the main reason for 45-DFG and this 45-GH,foo 004-AB, bar 005-CC baz 006-DDD and 007-AWK
kent$ n=6 #now I want the 6th match
#here you go:
kent$ awk -vn=$n '{l=$0;for(i=1;i<=n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' <<< $STR
006-DDD
This might work for you (GNU sed):
sed -r 's/\b[0-9]+-[A-Z]+\b/\n&\n/3;s/.*\n(.*)\n.*/\1/' file
s/\b[0-9]+-[A-Z]+\b/\n&\n/3 prepend and append \n (newlines) to the third (n) pattern in question.
s/.*\n(.*)\n.*/\1/ delete the text before and after the pattern
With grep for matching and sed for printing the occurrence:
$ egrep -o '\b[0-9]+-\w+' file | sed -n '1p'
233-CO
$ egrep -o '\b[0-9]+-\w+' file | sed -n '2p'
45-DFG
$ egrep -o '\b[0-9]+-\w+' file | sed -n '3p'
45-GH
Or with a little awk passing the occurrence to print using the variable o:
$ awk -v o=1 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
233-CO
$ awk -v o=2 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-DFG
$ awk -v o=3 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-GH