finding lines that contain n occurences of a certain pattern - regex

I have a file containing lines that look like
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
My goal is to find lines containing n occurences of the pattern "$long".
Anyone knowing the grep regex for this match?

You don't need a regex for this. With awk you can use $long as field separator and check how many fields each line has:
awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' file
Test
$ awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' a
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v count=4 'BEGIN {FS="\\$long"} NF==(count+1)' a
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
$ awk -v count=5 'BEGIN {FS="\\$long"} NF==(count+1)' a
$

awk solution by Fedorqui should work fine. You can also use grep for this:
grep -E '(.*\$long){4}' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2

$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
but if $long can occur in contexts other than as a field of it's own, e.g.:
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
and you only want to count it when it's in a field of it's own then you'll need something more like:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
e.g.:
$ cat file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Wrong:
$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Right:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
$ awk -F, -v n=4 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2

Related

How to print matching pattern only

I need to print [PR:XXXXX] only.
Ex: [Test][PR:John][Finished][Reviewer:SE] to [PR:John] only. (PR tag)
Note:Other strings rather than the [PR:XXXXX] may changed time to time
Ex:
[Test][PR:Cook][Completed]
[Test][Finished][PR:Russell][Reviewer:SE]
[Dump][Reviewer:SE][Complete][PR:Arnold]
Note: There are no multi line inputs and only one PR tag is included in all of inputs.
Untill I create following sed command but it did not work:
sed "s/\[PR:[^]]*\]//"
You might use bash for this:
s='[Test][PR:Cook][Completed]'
regex='\[PR:[^]]*]'
[[ "$s" =~ $regex ]] && echo "${BASH_REMATCH[0]}"
# => [PR:Cook]
See this online demo.
You may use grep:
grep -o '\[PR:[^]]*]'
See this demo.
Or, you can use this sed:
sed -n 's/.*\(\[PR:[^]]*]\).*/\1/p'
See this online demo.
Or, you can use awk
awk 'match($0,/\[PR:[^]]*]/) {print substr($0,RSTART,RLENGTH)}'
See the online demo.
If you have more than 1 occurrences of [PR to be printed in a single line then try following.
awk '{while(match($0,/\[PR:[^]]*\]/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}' Input_file
Simple explanation would be, using match function of awk to find all blocks which have [PR.....] in them, then printing all occurrences until all are printed in each line.
If you only have 1 such field max per line and want a blank line printed if no such field exists on the line then using GNU awk for FPAT:
$ awk -v FPAT='[[]PR:[^]]+]' '{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Arnold]
If you can have 0 to N such fields per line, e.g.:
$ cat file
[Test][PR:Cook][Completed]
[Test][Finished][PR:Russell][Reviewer:SE]
[Test][Finished][Reviewer:SE]
[Test][Finished][PR:Jack][PR:Russell][Reviewer:SE]
[Dump][Reviewer:SE][Complete][PR:Arnold]
then here's some of the options depending on your requirements:
$ awk -v FPAT='[[]PR:[^][]+]' '{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' 'NF{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' '{for (i=1; i<=NF; i++) print $i}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Russell]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' '{$1=$1} 1' file
[PR:Cook]
[PR:Russell]
[PR:Jack] [PR:Russell]
[PR:Arnold]
Use this perl command line:
perl -pe 's/\[[^P][^R][^:].*?\]//g' your_file
Test Below:
$ echo "[Test][Finished][PR:Russell][Reviewer:SE][PR:Rachel]"|perl -pe 's/\[[^P][^R][^:].*?\]//g'
[PR:Russell][PR:Rachel]
Another perl:
perl -lne 'print join "", grep {/^\[PR:/} /\[.+?\]/g' file
This will accomodate multiple PR tags on one line.
Here is a a shorter gnu-awk solution (using same input file as in Ed's answer):
awk -v RS='\\[PR:[^]]+]' 'RT {print RT}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Russell]
[PR:Arnold]
Here is a Ruby:
ruby -lne 'puts $_.scan(/\[PR.+?\]/).join("")' file
This accommodates multiple PR tags per line.

awk Extract Text Nth Occurrence Square Brackets (Containing Line Break In File Text)

I have 0.txt and 1.txt files. In the 0.txt file the content is as shown below:
[{A,B,C},{D,E,F}][{G,H,I}]
The contents of the 1.txt file is as shown below:
[{A,B,C},{D,E,F}]
[{G,H,I}]
That is, the difference between 0.txt and 1.txt is that in the 1.txt file there is a line break.
What I desire is to be able to extract all the text between '[' and ']' in your NTH OCCURRENCE using awk -F'[][]' -v n=2 '{ print $(2*n) }' 1.txt > 2.txt (Extract the text between the nth occurrence of square brackets), this for text formatting shown in 1.txt file.
So my wishes corresponding to n=2 would be 2.txt as shown below:
{G,H,I}
To do so, at first, I have been able to do this just for text formatting when there is no line break as shown at 0.txt.
I need to know how to extract the text between '[' and ']' when there is line break as shown at 1.txt.
The output of awk -F'[][]' -v n=2 '{ print $(2*n) }' 1.txt > 2.txt has been all the content of 0.txt (except Square Brackets) instead of only {G,H,I}. That is, the 2.txt content is as below:
{A,B,C},{D,E,F}
{G,H,I}
Edit Update 01:
The solution must have the same effect for a third input file as below, considering the third occurrence, that is, [{J, K, L}]. So the expected exit should be {J, K, L}.
[{A,B,C},{D,E,F}]
[{G,H,I}]
[{J,K,L}]
For all purposes for the nth occurrence of [{x, y, z, ..}] (respected the condition of no text out of [ and ], and no blank line) Any solution given should extract to the Output file exactly {x, y, z, ..}.
how to extract the text between [ and ] with the given record number
You may try this gnu-awk command that will work irrespective of presence of line break between bracket pairs
awk -v n=2 -v RS='\\[[^]]*]' 'RT && NR == n {print substr(RT, 2, length(RT)-2)}' file
{G,H,I}
Since we are using custom RS of [...] it will print correct record no matter if 2nd pair of [...] is in first line or second line.
With GNU awk for FPAT:
$ awk -v FPAT='[^][\n]+' -v RS='^$' -v n=2 '{print $n}' 0.txt
{G,H,I}
$ awk -v FPAT='[^][\n]+' -v RS='^$' -v n=2 '{print $n}' 1.txt
{G,H,I}
With any awk and assuming you don't have any blank lines in the input:
$ awk -v RS= -F '][[:space:]]*[[]|^[[]|]$' -v n=2 '{print $(n+1)}' 0.txt
{G,H,I}
$ awk -v RS= -F '][[:space:]]*[[]|^[[]|]$' -v n=2 '{print $(n+1)}' 1.txt
{G,H,I}
Here's an alternate approach assuming your input doesn't have any non-newline characters outside of [] delimiters. This will work with any awk.
$ tr '[]' '\n' <ip.txt | awk -v RS= -v ORS= 'NR==2'
{G,H,I}
The tr command will replace all [] characters with newline characters. The awk command uses 2 or more consecutive newlines as record separator. Any excess newlines at the beginning of the input will be ignored. So, you can now just use the record number to get the desired output.
If you preprocess the data with grep, the extraction becomes trivial, e.g.:
n=3
<0.txt grep -oE '\{[^}]+\}' | sed -n ${n}p
<1.txt grep -oE '\{[^}]+\}' | sed -n ${n}p
Output :
{G,H,I}
{G,H,I}
Edit - Change in OPs requirements
If what you want is the contents of the square-brackets, then a minor change to this solution would still work, e.g.:
n=3
<new.txt grep -oE '\[[^]]+\]' | tr -d '[]' | sed -n ${n}p
Output:
{J,K,L}

Pipe awk's results to sed (deletion)

I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?
Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.
Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.
This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt
This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2

Extract numbers with a regex and grep

i have a file which contain:
abc:12345
def:56323
i want to extract number by grep :
grep -o "[0-9]"
but it could not give the result :
12345
56323
Thanks for anyhelp
Maybe you missed [0-9]*:
$ grep -o "[0-9]*" file
12345
56323
Note that for this particular case, you can also make use of other tools:
while IFS=: read text number
do
echo "$number"
done < file
Or cut, sed or awk:
cut -d: -f2 file
sed 's/^[^:]*://' file
awk -F: '{print $2}' file

How can i display the second matched regex in sed

Suppose I have this text
The code for 233-CO is the main reason for 45-DFG and this 45-GH
Now I have this regexp \s[0-9]+-\w+ which matches 233-CO, 45-DFG and 45-GH.
How can I display just the third match 45-GH?
sed -re 's/\s[0-9]+-\w+/\3/g' file.txt
where \3 should be the third regexp match.
Is it mandatory to use sed? You could do it with grep, using arrays:
text="The code for 233-CO is the main reason for 45-DFG and this 45-GH"
matches=( $(echo "$text" | grep -o -m 3 '\s[0-9]\+-\w\+') ) # store first 3 matches in array
echo "${matches[0]} ${matches[2]}" # prompt first and third match
To find the last occurence of your pattern, you can use this:
$ sed -re 's/.*\s([0-9]+-\w+).*/\1/g' file
45-GH
if awk is accepted, there is an awk onliner, you give the No# of match you want to grab, it gives your the matched str.
awk -vn=$n '{l=$0;for(i=1;i<n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' file
test
kent$ echo $STR #so we have 7 matches in str
The code for 233-CO is the main reason for 45-DFG and this 45-GH,foo 004-AB, bar 005-CC baz 006-DDD and 007-AWK
kent$ n=6 #now I want the 6th match
#here you go:
kent$ awk -vn=$n '{l=$0;for(i=1;i<=n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' <<< $STR
006-DDD
This might work for you (GNU sed):
sed -r 's/\b[0-9]+-[A-Z]+\b/\n&\n/3;s/.*\n(.*)\n.*/\1/' file
s/\b[0-9]+-[A-Z]+\b/\n&\n/3 prepend and append \n (newlines) to the third (n) pattern in question.
s/.*\n(.*)\n.*/\1/ delete the text before and after the pattern
With grep for matching and sed for printing the occurrence:
$ egrep -o '\b[0-9]+-\w+' file | sed -n '1p'
233-CO
$ egrep -o '\b[0-9]+-\w+' file | sed -n '2p'
45-DFG
$ egrep -o '\b[0-9]+-\w+' file | sed -n '3p'
45-GH
Or with a little awk passing the occurrence to print using the variable o:
$ awk -v o=1 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
233-CO
$ awk -v o=2 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-DFG
$ awk -v o=3 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-GH