perl regex to extract a specifc word - regex

I have the following exmaple of a text file:
AFUA_2G08360|pyrG
AFUA_2G12630
gel1|bgt2|AFUA_2G01170
and I wish to do a regex to filter out AFUA_2G08360, AFUA_2G12630, AFUA_2G01170 using perl -l -ne in unix command line.
How would you suggest to do that?

why not using 'sed' with something like
sed 's/AFUA_2G\d{5}//'

Try this expression:
/(AFUA_2G\d+)/g

Here is a doable one-liner for your example input.
cat data | perl -l -e 'while (<>) {s/.*(AFUA_[^\|]*).*/\1/g; print}'

AFUA_[0-9A-Za-z]{7}
See here : http://regexr.com?328gj
Command line :
user#mch:/tmp$ cat input.txt
AFUA_2G08360|pyrG
AFUA_2G12630
gel1|bgt2|AFUA_2G01170
user#mch:/tmp$ cat input.txt | perl -lne "#matches = /AFUA_[0-9A-Za-z]{7}/g; print join("\n", #matches)";
AFUA_2G08360
AFUA_2G12630
AFUA_2G01170

use
perl -pe 's/.*(AFUA_[0-9a-zA-Z]*).*$/\1/' your_file
tested:
> cat temp
AFUA_2G08360|pyrG
AFUA_2G12630
gel1|bgt2|AFUA_2G01170
> perl -pe 's/.*(AFUA_[0-9a-zA-Z]*).*$/\1/' temp
AFUA_2G08360
AFUA_2G12630
AFUA_2G01170

Related

SED - Regex fails

Given the following files:
input_file:
if_line1
if_line2
template_file_1:
temp_file_line1
temp_file_line2
##regex_match## <= must be replaced by input_file
temp_file_line3
template_file_2:
temp_file_line1
temp_file_line2
{my_file.global} <= must be replaced by input_file
temp_file_line3
output_file:
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3
For template_file_1 the following sed command works:
sed -n -e '/##regex_match##/{r input_file' -e 'b' -e '}; p' template_file_1 > output_file
However, for template_file_2 the analog sed command fails:
sed -r -n -e '/(?<={).+\.global(?=})/{r input_file' -e 'b' -e '}; p' template_file_2 > output_file
sed complains the regular expression was invalid
The given regex is at least PCRE valid, for example grep -oP '(?<={).+\.global(?=})' template_file_2 works. Any idea how to deal with that?
perl one-liners:
perl -pe 'do {local $/; open $f, "<input_file"; $_ = <$f>; close $f} if /\{.+?\.global\}/' template_file_2
or perhaps this one, not "pure" perl
perl -ne 'if (/\{.+?\.global\}/) {system("cat","input_file")} else {print}' template_file_2
Using CPAN modules can make this really tidy:
perl -MPath::Tiny -pe '$_ = path("input_file")->slurp if /\{.+?\.global\}/' template_file_2
idk exactly what that PCRE is intended to do but taking a guess at it, this will work using any awk in any shell on every UNIX box:
$ awk 'NR==FNR{new=new s $0; s=ORS; next} /##regex_match##/{$0=new} 1' input_file template_file_1
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3
$ awk 'NR==FNR{new=new s $0; s=ORS; next} /\{[^.{}]+\.global}/{$0=new} 1' input_file template_file_2
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3

Can we do multiple substitutions with a single Perl command?

Is there a way to make the following into one perl -pe instead of piping it in sequence?
cat text.txt | perl -pe "s/PATTERN1/$PATTERN1/g" | perl -pe "s/PATTERN2/$PATTERN2/g"
The answer in the comments is perfect, but here's a goofy way to do it just for fun:
perl -pe '$_ = s/PATTERN1/$PATTERN1/gr =~ s/PATTERN2/$PATTERN2/gr' text.txt
Anyway, so you don't need to use pipes at all. Just add the file name as the last argument.
Just for reference, here is the best answer, which was given above in the comments:
perl -pe 's/PATTERN1/$PATTERN1/g; s/PATTERN2/$PATTERN2/g' text.txt

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.
Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.
Using grep:
grep -oP 'LN:\K.*' filename
Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.
Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename
Another awk
awk -F"LN:" 'NF>1 {print $2}' file

What is the Unix command to display all lines of a file with two certain strings

Basically, I have a file that I want to search and display only the lines that have the strings 'abc' and 'vhg'. What is the Unix command for this?
You can use grep for it:
grep abc file.txt | grep vhg
OR
you can use awk:
awk '/abc/ && /vhg/' file.txt
One more way with grep:
grep .*abc.*vhg file.txt
Use the grep command.
grep 'word1\|word2\|word3' /path/to/file
Example:
grep 'abc\|vhg' filename
Since a sed solution has not yet been given:
sed -n '/abc/{ /vhg/p; }'

Regular expression to replace a word with another word on the same line unix

Let A,B,C,D are the words
Input File :
..
A/B/C/D
W/B/C/Z
L/B/C/O
..
Output file:
..
A/B/C/A
W/B/C/W
L/B/C/L
..
Replace the word D with word A one the same line, only if /B/C/ delimiter present in the line and like wise for the other lines
Any sed/awk/perl oneliner to accomplish that
This is a awk solution:
awk -F/ -v OFS=/ '$2=="B" && $3=="C" {$4=$1}1' input.txt
You can do:
sed -re 's/^([^/]*)(\/B\/C\/)([^/]*)$/\1\2\1/' file
Demo:
$ cat file
A/B/C/D
W/B/C/Z
L/B/C/O
$ sed -re 's/^([^/]*)(\/B\/C\/)([^/]*)$/\1\2\1/' file
A/B/C/A
W/B/C/W
L/B/C/L
pearl.306> echo "A/B/C/D"|awk '{split($0,a,"/");print a[1]"/"a[2]"/"a[3]"/"a[1]}'
A/B/C/A
pearl.307>
another way is:
pearl.309> echo "A/B/C/D" | awk -F"/" '{OFS="/"}{$NF=$1;print}'
A/B/C/A
pearl.310>
pearl.318> cat file1
A/B/C/D
W/B/C/Z
L/B/C/O
pearl.319> awk -F"/" '{OFS="/"}{$NF=$1;print}' file1
A/B/C/A
W/B/C/W
L/B/C/L
pearl.320>
This might work for you:
sed 's|^\(\(.\)/B/C/\).|\1\2|' file
if A/B/C/D are real words e.g. wordA/wordB/wordC/wordD, then:
sed 's/|^\(\([^/]*\)/wordB/wordC/\).*|\1\2|' file
This should do the trick. perl -p -e 's/D/A/g'
In sed sed -e 's/D/A/'
perl -pe 's#(/B/C/)(.*)#$1$`#' file
this should work +