How to print matching pattern only - regex

I need to print [PR:XXXXX] only.
Ex: [Test][PR:John][Finished][Reviewer:SE] to [PR:John] only. (PR tag)
Note:Other strings rather than the [PR:XXXXX] may changed time to time
Ex:
[Test][PR:Cook][Completed]
[Test][Finished][PR:Russell][Reviewer:SE]
[Dump][Reviewer:SE][Complete][PR:Arnold]
Note: There are no multi line inputs and only one PR tag is included in all of inputs.
Untill I create following sed command but it did not work:
sed "s/\[PR:[^]]*\]//"

You might use bash for this:
s='[Test][PR:Cook][Completed]'
regex='\[PR:[^]]*]'
[[ "$s" =~ $regex ]] && echo "${BASH_REMATCH[0]}"
# => [PR:Cook]
See this online demo.
You may use grep:
grep -o '\[PR:[^]]*]'
See this demo.
Or, you can use this sed:
sed -n 's/.*\(\[PR:[^]]*]\).*/\1/p'
See this online demo.
Or, you can use awk
awk 'match($0,/\[PR:[^]]*]/) {print substr($0,RSTART,RLENGTH)}'
See the online demo.

If you have more than 1 occurrences of [PR to be printed in a single line then try following.
awk '{while(match($0,/\[PR:[^]]*\]/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}' Input_file
Simple explanation would be, using match function of awk to find all blocks which have [PR.....] in them, then printing all occurrences until all are printed in each line.

If you only have 1 such field max per line and want a blank line printed if no such field exists on the line then using GNU awk for FPAT:
$ awk -v FPAT='[[]PR:[^]]+]' '{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Arnold]
If you can have 0 to N such fields per line, e.g.:
$ cat file
[Test][PR:Cook][Completed]
[Test][Finished][PR:Russell][Reviewer:SE]
[Test][Finished][Reviewer:SE]
[Test][Finished][PR:Jack][PR:Russell][Reviewer:SE]
[Dump][Reviewer:SE][Complete][PR:Arnold]
then here's some of the options depending on your requirements:
$ awk -v FPAT='[[]PR:[^][]+]' '{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' 'NF{print $1}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' '{for (i=1; i<=NF; i++) print $i}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Russell]
[PR:Arnold]
$ awk -v FPAT='[[]PR:[^][]+]' '{$1=$1} 1' file
[PR:Cook]
[PR:Russell]
[PR:Jack] [PR:Russell]
[PR:Arnold]

Use this perl command line:
perl -pe 's/\[[^P][^R][^:].*?\]//g' your_file
Test Below:
$ echo "[Test][Finished][PR:Russell][Reviewer:SE][PR:Rachel]"|perl -pe 's/\[[^P][^R][^:].*?\]//g'
[PR:Russell][PR:Rachel]

Another perl:
perl -lne 'print join "", grep {/^\[PR:/} /\[.+?\]/g' file
This will accomodate multiple PR tags on one line.

Here is a a shorter gnu-awk solution (using same input file as in Ed's answer):
awk -v RS='\\[PR:[^]]+]' 'RT {print RT}' file
[PR:Cook]
[PR:Russell]
[PR:Jack]
[PR:Russell]
[PR:Arnold]

Here is a Ruby:
ruby -lne 'puts $_.scan(/\[PR.+?\]/).join("")' file
This accommodates multiple PR tags per line.

Related

finding lines that contain n occurences of a certain pattern

I have a file containing lines that look like
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
My goal is to find lines containing n occurences of the pattern "$long".
Anyone knowing the grep regex for this match?
You don't need a regex for this. With awk you can use $long as field separator and check how many fields each line has:
awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' file
Test
$ awk -v count=3 'BEGIN {FS="\\$long"} NF==(count+1)' a
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v count=4 'BEGIN {FS="\\$long"} NF==(count+1)' a
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
$ awk -v count=5 'BEGIN {FS="\\$long"} NF==(count+1)' a
$
awk solution by Fedorqui should work fine. You can also use grep for this:
grep -E '(.*\$long){4}' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
but if $long can occur in contexts other than as a field of it's own, e.g.:
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
and you only want to count it when it's in a field of it's own then you'll need something more like:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
e.g.:
$ cat file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Wrong:
$ awk -v n=3 'gsub(/\$long/,"&")==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
$ awk -v n=4 'gsub(/\$long/,"&")==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
Right:
$ awk -F, -v n=3 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC",,$long,,,,
A,B,1,2,3,$long,6,"A","",$long,,,,"ABC$longDEF",,$long,,,,
$ awk -F, -v n=4 '{c=0; for (i=1;i<=NF;i++) if ($i=="$long") c++} c==n' file
E,F,2,3,4,$long,$long,$long,$long,,,"A","STRING";123456,,,1,2

Pipe awk's results to sed (deletion)

I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?
Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.
Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.
This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt
This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2

bash regex multiple match in one line

I'm trying to process my text.
For example i got:
asdf asdf get.this random random get.that
get.it this.no also.this.no
My desired output is:
get.this get.that
get.it
So regexp should catch only this pattern (get.\w), but it has to do it recursively because of multiple occurences in one line, so easiest way with sed
sed 's/.*(REGEX).*/\1/'
does not work (it shows only first occurence).
Probably the good way is to use grep -o, but i have old version of grep and -o flag is not available.
This grep may give what you need:
grep -o "get[^ ]*" file
Try awk:
awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
You might need to tweak the regex between the slashes for your specific issue. Sample output:
$ awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
get.this
get.that
get.it
With awk:
awk -v patt="^get" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
printf "%s%s", $i, OFS;
print ""
}' <<< "$text"
bash
while read -a words; do
for word in "${words[#]}"; do
if [[ $word == get* ]]; then
echo -n "$word "
fi
done
echo
done <<< "$text"
perl
perl -lane 'print join " ", grep {$_ =~ /^get/} #F' <<< "$text"
This might work for you (GNU sed):
sed -r '/\bget\.\S+/{s//\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1 /g;s/ $//}' file
or if you want one per line:
sed -r '/\n/!s/\bget\.\S+/\n&\n/g;/^get/P;D' file

How can i display the second matched regex in sed

Suppose I have this text
The code for 233-CO is the main reason for 45-DFG and this 45-GH
Now I have this regexp \s[0-9]+-\w+ which matches 233-CO, 45-DFG and 45-GH.
How can I display just the third match 45-GH?
sed -re 's/\s[0-9]+-\w+/\3/g' file.txt
where \3 should be the third regexp match.
Is it mandatory to use sed? You could do it with grep, using arrays:
text="The code for 233-CO is the main reason for 45-DFG and this 45-GH"
matches=( $(echo "$text" | grep -o -m 3 '\s[0-9]\+-\w\+') ) # store first 3 matches in array
echo "${matches[0]} ${matches[2]}" # prompt first and third match
To find the last occurence of your pattern, you can use this:
$ sed -re 's/.*\s([0-9]+-\w+).*/\1/g' file
45-GH
if awk is accepted, there is an awk onliner, you give the No# of match you want to grab, it gives your the matched str.
awk -vn=$n '{l=$0;for(i=1;i<n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' file
test
kent$ echo $STR #so we have 7 matches in str
The code for 233-CO is the main reason for 45-DFG and this 45-GH,foo 004-AB, bar 005-CC baz 006-DDD and 007-AWK
kent$ n=6 #now I want the 6th match
#here you go:
kent$ awk -vn=$n '{l=$0;for(i=1;i<=n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' <<< $STR
006-DDD
This might work for you (GNU sed):
sed -r 's/\b[0-9]+-[A-Z]+\b/\n&\n/3;s/.*\n(.*)\n.*/\1/' file
s/\b[0-9]+-[A-Z]+\b/\n&\n/3 prepend and append \n (newlines) to the third (n) pattern in question.
s/.*\n(.*)\n.*/\1/ delete the text before and after the pattern
With grep for matching and sed for printing the occurrence:
$ egrep -o '\b[0-9]+-\w+' file | sed -n '1p'
233-CO
$ egrep -o '\b[0-9]+-\w+' file | sed -n '2p'
45-DFG
$ egrep -o '\b[0-9]+-\w+' file | sed -n '3p'
45-GH
Or with a little awk passing the occurrence to print using the variable o:
$ awk -v o=1 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
233-CO
$ awk -v o=2 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-DFG
$ awk -v o=3 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-GH

find lines containing "^" and replace entire line with ""

I have a file with a string on each line... ie.
test.434
test.4343
test.4343t34
test^tests.344
test^34534/test
I want to find any line containing a "^" and replace entire line with a blank.
I was trying to use sed:
sed -e '/\^/s/*//g' test.file
This does not seem to work, any suggestions?
sed -e 's/^.*\^.*$//' test.file
For example:
$ cat test.file
test.434
test.4343
test.4343t34
test^tests.344
test^34534/test
$ sed -e 's/^.*\^.*$//' test.file
test.434
test.4343
test.4343t34
$
To delete the offending lines entirely, use
$ sed -e '/\^/d' test.file
test.434
test.4343
test.4343t34
other ways
awk
awk '!/\^/' file
bash
while read -r line
do
case "$line" in
*"^"* ) continue;;
*) echo "$line"
esac
done <"file"
and probably the fastest
grep -v "\^" file