How to extract a part of string in Perl? - regex

I have an input file with data like below:
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
I need the result to be only giving me
D-X-X-A
D-X-A
D-X-X-X-X-A
Please help me!!

You can try,
open my $fh, "<", "file" or die $!;
while (my $line = <$fh>) {
$line =~ s/^[-X]+ | [-X]+(?=\s*$)//xg;
print $line;
}
close $fh;
or from cmd line,
perl -pe 's/^[-X]+ | [-X]+(?=\s*$)//xg' file

Tested code:
my #a = qw[
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
];
for my $a (#a) {
if ($a =~ /(D[-X]+A)/) {
print $1,"\n";
}
}

perl -lne 'print $1 if(/[^D]*(D.*A).*/)'
test
inside a script while reading a file:
while (my $line = <$fh>) {
$line =~ m/[^D]*(D.*A).*/g;
print $1;
}
Regex:
[^D]*(D.*A).*
[^D]* line may not start with D
(D.*A) Capture(round braces) the staring from D till last A in the
line. captured part will be stored in $1
.* left over string in the line.

Related

Search and replace multiple lines from a file

I'm trying to remove a part of a.txt file and replace with contents of b.txt file while also doing modification to other lines in a.txt using a Perl program.
file a.txt
line1
line2
replace from below line
replace from this line
bla bla...
bla bla...
to this line
line3
line4
file b.txt
replacement1
replacement2
replacementn
Below is my code which is not working.
#!/apps/perl/5.8.3/bin/perl -w
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
while (<INPUT>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
if($_ =~ /replace from below line/){
while(<INPUT>){
{
local undef $/;
s/replace from this line.*to this line/$replace_text/smg;
}
s/line3/modified_line3/;
s/line4/modified_line4/;
print OUTPUT;
}
}
}
close(INPUT);
close(OUTPUT);
Expected output file c.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
Can someone help me understand where I'm going wrong?
I don't think you need nested while loops to read your input file.
One way is to use a variable to control when you print to the output file:
use warnings;
use strict;
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
my $print = 1;
while (<INPUT>) {
s/line(\d)/modified_line$1/;
$print = 0 if /replace from below line/;
if (/to this line/) {
$print = 1;
$_ = $replace_text;
}
print OUTPUT if $print;
}
close(INPUT);
close(OUTPUT);
Output:
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
I also consolidated your 4 line substitutions into 1 using \d.
As much as I like perl, it's really not necessary here:
sed -e 's/line1/modified_line1/' \
-e 's/line2/modified_line2/' \
-e 's/line3/modified_line3/' \
-e 's/line4/modified_line4/' \
-e '/replace from below/rb.txt' \
-e '/replace from below/,/to this line/d' a.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
If you did want to use perl, I'd just do:
#!/usr/bin/env perl
use strict;
use warnings;
open my $ah, '<', "a.txt" or die "a.txt: $!\n";
while(<$ah>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
s/line3/modified_line3/;
s/line4/modified_line4/;
if( /replace from below/ ){
system "cat b.txt" and exit 1;
}
next if( /replace from below/ .. /to this line/);
print;
}
The problem description does not specify how big can be a.txt file. Posted code utilizes regular expression with modifier /smg what indicates that OP tries to work on multiline text.
Let's assume that input file is small enough to be read and processed in the memory.
For code manageability substitute placed into __DATA__ block which read in %substitute hash.
Build regular expression $re based on keys %substitute to utilize in substitution pattern.
Multiline substitution is based on original OP's code (is not applicable to line by line read of input data).
Two subroutines defined to read content of the file into variable and to store variable data into a file -- just to make the code easier to read and understand.
use strict;
use warnings;
use feature 'say';
my($fname_in,$fname_repl,$fname_out) = qw/a.txt b.txt c.txt/;
my %substitute = split(/[,\s]/, do{ local $/; <DATA>} );
my $re = '\b(' . join('|',keys %substitute) . ')\b';
my $data = read_file($fname_in);
my $replace_with = read_file($fname_repl);
$data =~ s/$re/$substitute{$1}/g;
$data =~ s/replace from below line.*?to this line/$replace_with/gsm;
save_file($fname_out,$data);
say $data;
exit 0;
sub read_file {
my $fname = shift;
my $data;
open my $fh, '<', $fname
or die "Couldn't open $fname";
$data = do { local $/; <$fh> };
close $fh;
return $data;
}
sub save_file {
my $fname = shift;
my $data = shift;
open my $fh, '>', $fname
or die "Couldn't open $fname";
say $fh $data;
close $fh;
}
__DATA__
line1,modified_line1
line2,modified_line2
line3,modified_line3
line4,modified_line4
Output
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4

Assigning regex search to variable: Uninitialized variable error

I am opening files in a directory that contain two lines of sequences in each file. The top sequence is longer than the bottom, but includes the bottom sequence. I would like to extend the bottom sequence by the two flanking letters in each direction once it is found in the top sequence. I am trying this by a doing a regex match, but am getting a uninitialized error for the $newsequence variable.
Here is what a typical file looks like:
>CCCCNNNNNCCCC
NNNNN
I would like to print to one file all the sequences in the following format:
>CCCCNNNNNCCCC
CCNNNNNCC
Here is my code so far:
use strict;
use warnings;
my ($directory) = #ARGV
my #array = glob "$directory/*";
my $header;
my $sequence;
my $newsequence;
open(OUT, ">", "/path/to/out.txt") or die $!;
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
}
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
}
print OUT $header, "\n", $newsequence, "\n";
}
How can I improve my regex assignment to $newsequence to get adequate output? Thanks.
This line is wrong:
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
The my keyword is creating a new variable $newsequence local to the while loop, not assigning the variable in the main script. So when you try to write $newsequence after the loop is done, the variable is still uninitialized.
Either put the print statement inside the while loop, or remove the my keyword in this assignment.
Also, you should put that assignment statement inside the elseif block. Otherwise, you'll try to use $sequence before you've assigned it. So the whole thing should look like:
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
print OUT $header, "\n", $newsequence, "\n";
}
}
}
If your conditions are accurate (each file contains only 2 lines, and the sequence is always found in the header), then you can make your code a lot simpler, including the regex:
for my $file (#array) {
open (my $fh, $file) or die $!;
chomp ((my $header, my $sequence) = <$fh>);
$header =~ /(..)$sequence(..)/;
print OUT "$header\n$1$sequence$2";
}

Perl - regexp not being replaced

I have an array containing words that I want to remove from each line of a file. The code I am using is as follows:
my $INFILE;
my $OUTFILE;
my $STOPLIST;
open($INFILE, '<', $ARGV[0]);
open($STOPLIST, '<', "stop.txt");
open($OUTFILE, '>', $ARGV[1]);
my #stoplist = <$STOPLIST>;
my $line;
my $stopword;
while (<$INFILE>) {
$line = $_;
$line =~ s/\[[0-9]*\] //g;
$line =~ s/i\/.*\/; //g;
foreach (#stoplist) {
$stopword = $_;
$line =~ s/${stopword}//g;
}
print $OUTFILE lc($line);
}
However, the words in the stoplist still appear in the text in the output file, which would indicate that the $line =~ s/${stopword}//g; line wasn't doing it's job as I expected.
How can I make sure that all words in the stop list that appear in the input text are replaced with 0 characters in the output?
You need to remove newlines from your stoplist using chomp:
my #stoplist = <$STOPLIST>;
chomp #stoplist;

Perl Print if regex match A and not regex match B

I want to print a line of a file if it contains string A and it doesnt contain string B (also splitting into newlines at each colon). What is the proper syntax? Here is what I tried ( I want it print lines containing "bash" but to not print lines containing numbers):
my $file = passwdtest;
open(FH, "$file");
foreach (<FH>) {
print join("\n", split(/:/, "$_")) if ($_ =~ /bash/ and $_ != /\d+/);
};
close FH;
$_ != /\d+/
is short for
$_ != ($_ =~ /\d+/)
Instead of != you need !~
if ($_ =~ /bash/ and $_ !~ /\d+/);
Isn't that just:
if (/bash/ && ! /\d+/)

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}
I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$
perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;
You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.