The perfect regex to extract a particular string in perl

The perfect regex to extract a particular string in perl - regex

I have a text file abc.txt that looks like this:
dQdC(sA1B2C3,sC5) = A lot of stuff
a = b = c
Baseball
dQdC(sC2V3X1,sD5) = A lot of stuff again
Now I want create two arrays in perl, one of which will contain A1B2C3 and C2V3X1, the other array will contain C5 and D5. I don't care about the other intermediate lines. To achieve this goal, I am trying this perl script:
for (my $in=0;$in<=$#lines;$in++){
if ($lines[$in]=~/dQdC\(s([A-Z0-9]+?),s([A-Z0-9]+?)\)/) {
print "1111"; #this line is just to check if it is at all going inside the loop
#A = $1;
#B = $2;
}
However, it is not even going inside the loop. So I guess I did something wrong with the regex. Will someone please tell me what I am doing wrong here?

my (#a, #b);
while ($file =~ /^dQdC\(s(\w+),s(\w+)\)/mg) {
push #a, $1;
push #b, $2;
}
or
my (#a, #b);
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #a, $1;
push #b, $2;
}
}
Working with parallel arrays isn't nice.
Alternative 1: Hash
my %hash = $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg;
or
my %hash;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
$hash{$1} = $2;
}
}
Alternative 2: AoA
use List::Util qw( pairs ); # 1.29+
my #pairs = pairs( $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg );
or
my #pairs;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #pairs, [ $1, $2 ];
}
}

If the format of your target lines is always as shown
use warnings;
use strict;
my $file = ...
my (#ary_1, #ary_2);
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
my ($v1, $v2) = /dQdC\(s([^,]+),s([^\)]+)/ or next;
push #ary_1, $v1;
push #ary_2, $v2;
}
which captures between ( and a , and then between a , and ). The first pattern might as well be s(.*?), as there is no benefit of the negated character class since the following , still need be matched (but I left it with [^...] for consistency with the other one).
Comments
In general better process a file line-by-line, unless there are specific reasons to read it first
C-style loop is rarely needed. To iterate over array index use for my $i (0..$#ary)
Please use warnings; and use strict; always

Try this:
(?<=\(s)([A-Z0-9]+)(?=,)
It matches substrings that come between (s and , using lookbehind and lookahead.
Similarily, use (?<=,s)([A-Z0-9]+)(?=\)) to capture the substrings between ,s and ).
Putting them together, you can create two capturing groups, each containing the different kind of substrings: (A1B2C3, C2V3X1), (C5, D5)

Related

How do I store culmultive length of strings into an array with Perl

I have a file which has a series of lines that are made up of A's, C's, G's and T's. I want to find the length of those lines, make a list of the culmultive lengths (adding the lengths together sequentially), and putting that into an array. So far I have:
#! /usr/bin/perl -w
use strict;
my $input = $ARGV[0];
my %idSeq;
my (#ID, #Seq);
open (my $INPUT, "<$input") or die "unable to open $input";
while (<$INPUT>) {
if (my $culm_length = /^([AGCT]\w+)\n$/) {
length($culm_length) = $_;
push (#Seq, $1);
}
}
bla bla bla....
So far I think what I have written gives me an array of the length of individual lines. I want the culmultive lengths, any ideas?

With reference to your previous question How do I read strings into a hash in Perl which was put on hold, I think perhaps you want a running total of the lengths of the lines
I would write it this way. It keeps the running total in $total and pushes its value onto the #lengths array every time it changes
use strict;
use warnings 'all';
my ( $input ) = #ARGV;
open my $fh, '<', $input or die qq{unable to open "$input" for input: $!};
my #lengths;
my $total = 0;
while ( <$fh> ) {
push #lengths, $total += length $1 if /^([ACGT]+)/;
}

#!/usr/bin/perl -w
use strict;
my $length = 0;
while (<>) {
$length += length($1) if /^([AGCT]\w+)$/;
}
my #length = $length; # But why would you want to do this?!
...

how to extract string with any operator between?

I have an array contain #arr = { "a=b", "a>b", "a<b", "a!=b", "a-b" }. What is the best way to get a and b with any operator between. I can extract by
for($i=0; $i<=$#arr; $i++){
$str = $arr[$i];
if($str =~ m/^(.*?)(\s*=\s*)(.*)(;)/g){
my $d = $1;
my $e = $3;
}
Follow by all if statement with the possible operator like "!=", "<" etc. But this will make my code look messy. Any better solution for this?

You could try something like this one liner
perl -e '#a = ("a=b","a>b","a<b","a!=b","a-b"); for $l (#a) { $l =~ s/(.).*(.)/$1/; print "$1$2\n"};'
The key thing is the greedy match ie "(.*)" between the two single character matches ie "(.)". To really make sure that you start at the start and end of the strings you could use this
perl -e '#a = ("a=b","a>b","a<b","a!=b","a-b"); for $l (#a) { $l =~ s/^(.).*(.)$/$1/; print "$1$2\n"};'
A complete working example that demonstrates the whole thing would be
#!/usr/bin/perl
use strict;
use warnings;
my #expressions = ("a=b","a>b","a<b","a!=b","a-b");
for my $exp (#expressions) {
$exp =~ s/^(.).*(.)$/$1$2/;
print "$1$2 is the same as $exp\n";
};

A very simple regex might be
/^(\w+)\s*(\W+)\s*(\w+)$/
Or you enumerate possible operators
/^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==)\s*(\w+)$/
It depends whether the input can be trusted or not. If not, you might have to be more meticulous w.r.t. the identifiers, too. Here's a simpler loop, and no need to use m//g(lobal). Not sure about the semicolon - omitted it.
my #arr = ( "a=b", "a>b", "a<b", "a!=b", "a-b" );
for my $str (#arr){
if($str =~ /^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==)\s*(\w+)$/ ){
my $d = $1;
my $e = $3;
print "d=$d e=$e\n";
}
}
Later If you enumerate the operators, you can also add word symbols:
if($str =~ /^(\w+)\s*(=|!=|<|>|<=|>=|\+|-|\*|\/|==|x?or|and)\s*(\w+)$/ ){
...

if there always 'a' and 'b' at the beginning and the end you could try:
my $str = 'a<b';
my( $op ) = $str =~ /^a(.*)b$/;

Not a well thought out answer. Will reconsider the problem.

Perl extract between start and end

I am aiming to extract a string from start to an end word, (dIonly is start and should be the end workset [including these parenthesis]; furthermore I would like to print the output into a file named report.
I have had problems with lookbehind, as the variable length was not implemented.
Now I reversed the string, to do lookahead. However, something is still not working.
I need to start from dIonly which means I have to reverse the string to circumvent the problem described above, as there are many workset(( in the whole string, which means I can't start from there...
Thank you! I edited the script now. What I need to do is reverse the string. I did that by splitting the string with a space as delimiter into a list, then reversed it, and put it into a string again. Just to split it into a list again at the delimiter 'solution' as my output will have several strings of which I want to extract dIonly to workset (this only works once the string is reversed as otherwise I would encouter worksets that I do not want and extract a different string, as dIonly is a distinct part of the pattern of the solution from which I can work forward to the second workset (which itself is the first workset with 2 parenthesis). Then I want to print it to a new output file. Any suggestions welcome!
This is a sample of the data:
... denotes that it continues after
..... maxRiskC(cA, 3)) c workset((RiskCA(cA, 3), RiskCB(cB, 2), maxRiskC(cA, 3))) c RiskCA(cA, 3) c RiskCB(cB, 2)) ***********
equation (built-in equation for symbol <=) 6 <= 40 ---> true
Solution 4 (state 31) states: 40 rewrites: 8421 in 5357394502ms cpu
(1464ms real) (0) rewrites/second) G:Game --> workset(empty) c playA
c dIonly c
.....
#!/usr/bin/perl
# perl -d ./perl_debugger.pl
use strict;
use Data::Dumper qw(Dumper);
use File::Slurp;
my #a_linesorig;
my #a_out;
my #a_str;
my $line;
my $reversedline;
my #a_linesrev;
my #reversedarray;
my $reversedline;
my $str;
open(my $fh, "<", "data.txt")
or die "cannot open < data.txt: $!";
my $line = read_file('data.txt');
#a_linesorig = split(' ', $line);
#a_linesrev = reverse(#a_linesorig);
$reversedline = join(' ', #a_linesrev); # joins the reversed list to a single string again
#reversedarray = split( /solution/, $reversedline ); # should split huge string into a list from one solution to next
foreach $str (#reversedarray) {
if ($str =~ /\bdIonly:\b(.*?)\bworkset\b/g);
print Dumper \$str;
print (#a_out, "$str");
}
close $fh
or die "can't close file: $!";
open(my $fh, ">", "output.txt")
or die "cannot open > output.txt: $!";
foreach $str (#a_out)
{
print ($fh "$str\n");
}
close $fh
or die "can't close file: $!";

Take off the reverse, it will reverse letters also and not individual words, for that scalar.
You can try it with a greedy match since you are only interested in the last workset:
while (my $line = <$input>) {
chomp $line;
if ($line =~ /.*workset(.*dIonly)/) {
# do something with results
say $fh "'$1'";
}
}
And if you need to reverse before writing to the file, you can do:
while (my $line = <$input>) {
chomp $line;
if ($line =~ /.*workset(.*dIonly)/) {
say $fh join " ",reverse (split / /,$1);
}
}

Perl Regex match works, but replace does not

I have put together a Perl script to go through a directory and match various keys in the source and output the results to a text file. The match operation works well, however the end goal is to perform a replace operation. The Perl script is as follows:
#!/usr/bin/perl
#use strict;
use warnings;
#use File::Slurp;
#declare variables
my $file = '';
my $verbose = 0;
my $logfile;
my #files = grep {/[.](pas|cmm|ptd|pro)$/i} glob 'C:\users\perry_m\desktop\epic_test\pascal_code\*.*';
#iterate through the files in input directory
foreach $file (#files) {
print "$file\n";
#read the file into a single string
open FILEHANDLE, $file or die $!;
my $string = do { local $/; <FILEHANDLE> };
#perfrom REGEX on this string
########################################################
#fix the include formats to conform to normal PASCAL
$count = 0;
while ($string =~ m/%INCLUDE/g)
{
#%include
$count++;
}
if ($count > 0)
{
print " $count %INCLUDE\n";
}
$count = 0;
while ($string =~ m/INCLUDE/g)
{
#%INCLUDE;
$count++;
}
if ($count > 0)
{
print " $count INCLUDE\n";
}
$count = 0;
while ($string =~ m/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/g)
{
#$1$2;
$count++;
}
if ($count > 0)
{
print " $count XXXX:include \n";
}
}
This produces output as desired, an example is below:
C:\users\perry_m\desktop\epic_test\pascal_code\BRTINIT.PAS
1 INCLUDE
2 XXXX:include
39 external and readonly
However if I change the regex operations to try and implement a replace, using the replacement operation shown in the commented lines above, the scripts hangs and never returns. I imagine it is somehow related to memory, but I am new to Perl. I was also trying to avoid parsing the file by line if possible.
Example:
while ($string =~ s/%INCLUDE/%include/g)
{
#%include
$count++;
}
and
while ($string =~ s/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/$1$2;/g)
{
#$1$2;
$count++;
}
Edit: simplified the examples

The problem is with your while loops. A loop like
while ($string =~ m/INCLUDE/g) { ... }
will execute once for each ocurrence of INCLUDE in the target string, but a subtitution like
$string =~ s/INCLUDE/%INCLUDE;/
will make all of the replacement in one go and retuen the number of replacements made. So a loop
while ($string =~ s/INCLUDE/%INCLUDE;/g) { ... }
will endlessly add more and more percentage signs before and semicolons after every INCLUDE.
To find the number of replacements made, change all your loops like this to just
$count = $string =~ s/INCLUDE/%INCLUDE;/g

the pattern in s/INCLUDE/%INCLUDE/g will match the replacement also, so if you're running it in a while loop it will run forever (until you run out of memory).
s///g will replace all matches in a single shot so you very rarely will need to put it in a loop. Same goes for m//g, it will do the counting in a single step if you put it in list context.

How to match exactly two empty lines

I have a question about regular expressions. I have a file and I need to parse it in such a way that I could distinguish some specific blocks of text in it. These blocks of text are separated by two empty lines (there are blocks which are separated by 3 or 1 empty lines but I need exactly 2). So I have a piece of code and this is \s*$^\s*$/ regular expression I think should match, but it does not.
What is wrong?
$filename="yu";
open($in,$filename);
open(OUT,">>out.text");
while($str=<$in>)
{
unless($str = /^\s*$^\s*$/){
print "yes";
print OUT $str;
}
}
close($in);
close(OUT);
Cheers,
Yuliya

By default, Perl reads files a line at a time, so you won't see multiple new lines. The following code selects text terminated by a double new line.
local $/ = "\n\n" ;
while (<> ) {
print "-- found $_" ;
}

New Answer
After having problems excluding >2 empty lines, and a good nights sleep here is a better method that doesn't even need to slurp.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my #blocks; #each element will be an arrayref, one per block
#that referenced array will hold lines in that block
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 0;
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
$block_num++; # move on to next block
$empty = 0;
} else {
$empty = 0;
}
push #{ $blocks[$block_num] }, $line;
}
#write out each block to a new file
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out join("\n", #$block);
}
In fact rather than store and write later, you could simply write to one file per block as you go:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 1;
open(OUT, '>', $block_num . '.txt');
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
close(OUT); #just learned this line isn't necessary, perldoc -f close
open(OUT, '>', ++$block_num . '.txt');
$empty = 0;
} else {
$empty = 0;
}
print OUT "$line\n";
}
close(OUT);

use 5.012;
open my $fh,'<','1.txt';
#slurping file
local $/;
my $content = <$fh>;
close $fh;
for my $block ( split /(?<!\n)\n\n\n(?!\n)/,$content ) {
say 'found:';
say $block;
}

Deprecated in favor of new answer
justintime's answer works by telling perl that you want to call the end of a line "\n\n", which is clever and will work well. One exception is that this must match exactly. By the regex you are using it makes it seem like there might be whitespace on the "empty" lines, in which case this will not work. Also his method will split even on more than 2 linebreaks, which was not allowed in the OP.
For completeness, to do it the way you were asking, you need to slurp the whole file into a variable (if the file is not so large as to use all your memory, probably fine in most cases).
I would then probably say to use the split function to split the block of text into an array of chunks. Your code would then look something like:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my $text;
open(my $fh, '<', $file);
{
local $/; enables slurp mode inside this block
$text = <$fh>;
}
close($fh);
my #blocks = split(
/
(?<!\n)\n #check to make sure there isn't another \n behind this one
\s*\n #first whitespace only line
\s*\n #second "
(?!\n) #check to make sure there isn't another \n after this one
/x, # x flag allows comments and whitespace in regex
$text
);
You can then do operations on the array. If I understand your comment to justintime's answer, you want to write each block out to a different file. That would look something like
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}
Notice that since you open $out lexically (with my) when it reaches the end of the foreach block, the $out variable dies (i.e. "goes out of scope"). When this happens to a lexical filehandle, the file is automatically closed. And you can do a similar thing to that with justintime's method as well:
local $/ = "\n\n" ;
my $file_num = 1;
while (<>) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

The perfect regex to extract a particular string in perl - regex

Related

How do I store culmultive length of strings into an array with Perl

how to extract string with any operator between?

Perl extract between start and end

Perl Regex match works, but replace does not

How to match exactly two empty lines

Categories

Resources