Perl Validate Regex and Loop Captures - regex

I need to validate a string like "[one.two.three]", where titles are separated by ".", there has to be a minimum of at least one title. Each title needs to be extracted. Is there any way to do this in a loop or does it have to be two separate steps?
use strict;
use warnings;
my #tests = ("[one]", "[two.three.four]");
foreach (#tests) {
while ($_ =~ /^\[(\w+)(?:\.\w+)*\]$/) {
print "$1\n";
}
print "\n\n\n";
}

Yes, it makes sense to separate the validation and extraction:
my ($titles) = $input =~ /^\[((?:\w+|\b\.\b)+)\]\z/
or die "invalid input $input.\n";
my #title = split /\./, $titles;
Though you could do it all at once, I think the readability suffers:
my #title = split /\./, ( $input =~ /^\[((?:\w+|\b\.\b)+)\]\z/ )[0] // ''
or die "invalid input $input.\n";

Related

The perfect regex to extract a particular string in perl

I have a text file abc.txt that looks like this:
dQdC(sA1B2C3,sC5) = A lot of stuff
a = b = c
Baseball
dQdC(sC2V3X1,sD5) = A lot of stuff again
Now I want create two arrays in perl, one of which will contain A1B2C3 and C2V3X1, the other array will contain C5 and D5. I don't care about the other intermediate lines. To achieve this goal, I am trying this perl script:
for (my $in=0;$in<=$#lines;$in++){
if ($lines[$in]=~/dQdC\(s([A-Z0-9]+?),s([A-Z0-9]+?)\)/) {
print "1111"; #this line is just to check if it is at all going inside the loop
#A = $1;
#B = $2;
}
However, it is not even going inside the loop. So I guess I did something wrong with the regex. Will someone please tell me what I am doing wrong here?
my (#a, #b);
while ($file =~ /^dQdC\(s(\w+),s(\w+)\)/mg) {
push #a, $1;
push #b, $2;
}
or
my (#a, #b);
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #a, $1;
push #b, $2;
}
}
Working with parallel arrays isn't nice.
Alternative 1: Hash
my %hash = $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg;
or
my %hash;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
$hash{$1} = $2;
}
}
Alternative 2: AoA
use List::Util qw( pairs ); # 1.29+
my #pairs = pairs( $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg );
or
my #pairs;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #pairs, [ $1, $2 ];
}
}
If the format of your target lines is always as shown
use warnings;
use strict;
my $file = ...
my (#ary_1, #ary_2);
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
my ($v1, $v2) = /dQdC\(s([^,]+),s([^\)]+)/ or next;
push #ary_1, $v1;
push #ary_2, $v2;
}
which captures between ( and a , and then between a , and ). The first pattern might as well be s(.*?), as there is no benefit of the negated character class since the following , still need be matched (but I left it with [^...] for consistency with the other one).
Comments
In general better process a file line-by-line, unless there are specific reasons to read it first
C-style loop is rarely needed. To iterate over array index use for my $i (0..$#ary)
Please use warnings; and use strict; always
Try this:
(?<=\(s)([A-Z0-9]+)(?=,)
It matches substrings that come between (s and , using lookbehind and lookahead.
Similarily, use (?<=,s)([A-Z0-9]+)(?=\)) to capture the substrings between ,s and ).
Putting them together, you can create two capturing groups, each containing the different kind of substrings: (A1B2C3, C2V3X1), (C5, D5)

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

perl match single occurence pattern in string

I have a list of names and I want to look for names containing two given letters asigned using variables.
$one = "A";
$two = "O";
Please note that I want those letters to be present anywhere in the checked names, so that I can get outputs like this:
Jason
Damon
Amo
Noma
Boam
...
But each letter must only be present once per name, meaning that this wouldn't work.
Alamo
I've tried this bit of code but it doesn't work.
foreach my $name (#list) {
if ($name =~ /$one/) {
if ($name =~ /$two/) {
print $name;
}}
else {next}; }
How about this?
for my $name (#list) {
my $ones = () = $name =~ /$one/gi;
my $twos = () = $name =~ /$two/gi;
if ($ones == 1 && $twos == 1) {
print $name;
}
}
#!/usr/bin/env perl
#
# test.pl is the name of this script
use warnings;
use strict;
my %char = map {$_ => 1} grep {/[a-z]/} map {lc($_)} split //, join '', #ARGV;
my #chars = sort keys %char; # the different characters appearing in the command line arguments
while (my $line = <STDIN>)
{
grep {$_ <=> 1} map {scalar(() = $line =~ /$_/ig )} #chars
or print $line;
}
Now:
echo hello world | test.pl fw will print nothing (w occurs exactly once in hello world, but f does not)
echo hello world | test.pl hw will print a line consisting of hello world (both h and w occur exactly once).
One way to get it all into a single regex is to use an expression within the regex pattern to search for the other letter (a or o) based on which one was found first:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
while(<DATA>){
chomp;
say if m/^
[^ao]* # anything but a or o
([ao]) # an 'a' or 'o'
[^ao]* # anything but a or o
(??{($1 and lc($1) eq 'a') ? 'o' : 'a'}) # the other 'a' or 'o'
[^ao]* $/xi; # anything but a or o
}
__DATA__
Abdi
Chelal
Jucal
Husham
Gallio
Pileser
Tekoa
Kenaz
Raphah
See the perlre section on Extended Expressions for more info.
This is my solution. You don't make it clear whether there will always be just two single-character strings to match but I have assumed that there may be more
Unfortunately the classical way of counting occurrences of a character -- tr/// -- doesn't interpolate variables into its searchlist and doesn't have a case-independent modifier /i. But the pattern-match operator m// does, so that is what I have used
I thoroughly dislike the so-called goatse operator, but there isn't a neater way that I know of that allows you to count the number of times a global regex pattern matches
I could have used a grep for the inner loop, but I went for a regular for loop and a next with a label as I believe it's more readable this way
use strict;
use warnings;
use v5.10.1;
use autodie;
my #list = do {
open my $fh, '<', 'names.txt';
<$fh>;
};
chomp #list;
my ($one, $two) = qw/ A O /;
NAME:
for my $name ( #list ) {
for ( $one, $two) {
my $count = () = $name =~ /$_/gi;
next NAME unless $count == 1;
}
say $name;
}
output
Gallio
Tekoa
Achbor
Clopas
This is the input that I used
Abdi
Chelal
Jucal
Husham
Gallio
Pileser
Tekoa
Kenaz
Raphah
Tiras
Jehudi
Bildad
Shemidah
Meshillemoth
Tabeel
Achbor
Jesus
Osee
Elnaam
Rephah
Asaiah
Er
Clopas
Penuel
Shema
Marsena
Jaare
Joseph
Shamariah
Levi
Aphses

Perl Regex match works, but replace does not

I have put together a Perl script to go through a directory and match various keys in the source and output the results to a text file. The match operation works well, however the end goal is to perform a replace operation. The Perl script is as follows:
#!/usr/bin/perl
#use strict;
use warnings;
#use File::Slurp;
#declare variables
my $file = '';
my $verbose = 0;
my $logfile;
my #files = grep {/[.](pas|cmm|ptd|pro)$/i} glob 'C:\users\perry_m\desktop\epic_test\pascal_code\*.*';
#iterate through the files in input directory
foreach $file (#files) {
print "$file\n";
#read the file into a single string
open FILEHANDLE, $file or die $!;
my $string = do { local $/; <FILEHANDLE> };
#perfrom REGEX on this string
########################################################
#fix the include formats to conform to normal PASCAL
$count = 0;
while ($string =~ m/%INCLUDE/g)
{
#%include
$count++;
}
if ($count > 0)
{
print " $count %INCLUDE\n";
}
$count = 0;
while ($string =~ m/INCLUDE/g)
{
#%INCLUDE;
$count++;
}
if ($count > 0)
{
print " $count INCLUDE\n";
}
$count = 0;
while ($string =~ m/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/g)
{
#$1$2;
$count++;
}
if ($count > 0)
{
print " $count XXXX:include \n";
}
}
This produces output as desired, an example is below:
C:\users\perry_m\desktop\epic_test\pascal_code\BRTINIT.PAS
1 INCLUDE
2 XXXX:include
39 external and readonly
However if I change the regex operations to try and implement a replace, using the replacement operation shown in the commented lines above, the scripts hangs and never returns. I imagine it is somehow related to memory, but I am new to Perl. I was also trying to avoid parsing the file by line if possible.
Example:
while ($string =~ s/%INCLUDE/%include/g)
{
#%include
$count++;
}
and
while ($string =~ s/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/$1$2;/g)
{
#$1$2;
$count++;
}
Edit: simplified the examples
The problem is with your while loops. A loop like
while ($string =~ m/INCLUDE/g) { ... }
will execute once for each ocurrence of INCLUDE in the target string, but a subtitution like
$string =~ s/INCLUDE/%INCLUDE;/
will make all of the replacement in one go and retuen the number of replacements made. So a loop
while ($string =~ s/INCLUDE/%INCLUDE;/g) { ... }
will endlessly add more and more percentage signs before and semicolons after every INCLUDE.
To find the number of replacements made, change all your loops like this to just
$count = $string =~ s/INCLUDE/%INCLUDE;/g
the pattern in s/INCLUDE/%INCLUDE/g will match the replacement also, so if you're running it in a while loop it will run forever (until you run out of memory).
s///g will replace all matches in a single shot so you very rarely will need to put it in a loop. Same goes for m//g, it will do the counting in a single step if you put it in list context.

In Perl, how can I remove all spaces that are not inside double quotes " "?

I'm tying to come up with some regex that will remove all space chars from a string as long as it's not inside of double quotes (").
Example string:
some string with "text in quotes"
Result:
somestringwith"text in quotes"
So far I've come up with something like this:
$str =~ /"[^"]+"|/g;
But it doesn't seem to be giving the intended result.
I'm honestly very new at perl and haven't had too much regexp experience. So if anyone willing to answer would also be willing to provide some insight into the why and how that would be great!
Thanks!
EDIT
String will not contain escaped "'s
It should actually always be formatted like this:
Some.String = "Some Value"
Result would be
Some.String="Some Value"
Here is a technique using split to separate the quoted strings. It relies on your data being consistent and will not work with loose quotes.
use strict;
use warnings;
my #line = split /("[^"]*")/;
for (#line) {
unless (/^"/) {
s/[ \t]+//g;
}
}
print #line; # line is altered
Basically, you split up the string in order to isolate the quoted strings. Once that is done, perform the substitution on all other strings. Since the array elements are aliased in the loop, substitutions are performed on the actual array.
You can run this script like so:
perl -n script.pl inputfile
To see the output. Or
perl -n -i.bak script.pl inputfile
To do in-place edit on inputfile, while saving backup in inputfile.bak.
With that said, I'm not sure what your edit means. Do you want to change
Some.String = "Some Value"
to
Some.String="Some Value"
Text::ParseWords is tailor-made for this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #strings = (
q{This.string = "Hello World"},
q{That " string " and "another shoutout to my bytes"},
);
for my $s ( #strings ) {
my #words = quotewords '\s+', 1, $s;
print join('', #words), "\n";
}
Output:
This.string="Hello World"
That" string "and"another shoutout to my bytes"
Using Text::ParseWords means if you ever had to deal with quoted strings with escaped quotation marks in them, you'd be ready ;-)
Also, this sounds like you have a configuration file of some sort and you're trying to parse it. If that is the case, there are probably better solutions.
I suggest removing the quoted substrings using split and then recombining them with join after removing whitespace from the intermediate text.
Note that if the regex used for split contains captures then the captured values will also be included in the list returned.
Here's some sample code.
use strict;
use warnings;
my $source = <<END;
Some.String = "Some Value";
Other.String = "Other Value";
Last.String = "Last Value";
END
print join '', map {s/\s+// unless /"/; $_; } split /("[^"]*")/, $source;
output
Some.String= "Some Value";Other.String = "Other Value";Last.String = "Last Value";
I would simply loop through the string char by char. This way you can handle escaped strings too (just add an isEscaped variable).
my $text='lala "some thing with quotes " lala ... ';
my $quoteOpen = 0;
my $out;
foreach $char(split//,$text) {
if ($char eq "\"" && $quoteOpen==0) {
$quoteOpen = 1;
$out .= $char;
} elsif ($char eq "\"" && $quoteOpen==1) {
$quoteOpen = 0;
$out .= $char;
} elsif ($char =~ /\s/ && $quoteOpen==1) {
$out .= $char;
} elsif ($char !~ /\s/) {
$out .= $char;
}
}
print "$out\n";
Splitting on double quotes, removing spaces only from even fields (i.e. those in quotes):
sub remove_spaces {
my $string = shift;
my #fields = split /"/, $string . ' '; # trailing space needed to keep final " in output
my $flag = 1;
return join '"', map { s/ +//g if $flag; $flag = ! $flag; $_} #fields;
}
It can be done with regex:
s/([^ ]*|\"[^\"]*\") */$1/g
Note that this won't handle any kind of escapes inside the quotes.