regex: confirm if a optional portion was matched - regex

I have a string that can be of two forms, and it is unknown which form it will be each time:
hello world[0:10]; or hello world;
There may or may not be the brackets with numbers. The two words (hello and world) can vary. If the brackets and numbers are there, the first number is always 0 and the second number (10) varies.
I need to capture the first word (hello) and, if it exists, the second number (10). I also need to know which form of the string it was.
hello world[0:10]; I would capture {hello, 10, form1}, and hello world; I would capture {hello, form2}. I don't really care how the "form" is formatted, I just need to be able to differentiate. It can be a bit (1=form1, 0=form2), structural (form1 puts me in one scope and form2 another), etc.
I currently have the following (now working) regex:
/(\w*) \s \w* (?:\[0:(\d*)\])?;/x
This gives me $1 = hello and potentially $2 = 10. I now just need to know if the bracketed numbers were there or not. This will be repeated many times, so I can't assume $2 = undef going into the regex. $2 could also be the same thing a few times in a row so I can't just look for a change in $2 before and after the regex.
My best solution so far is to run the regex twice, the first time with the brackets and the second time without:
if( /(\w*) \s \w* \[0:(\d*)\];/x ) {...}
elsif( /(\w*) \s \w*;/x ) {...}
This seems very inefficient and inelegant though so I was wondering if there is a better way?

You can use ? to optionally match portions of your regex. Then you can capture the output directly as a return value from the regex.
my $re = qr{ (\w*) \s* (?:\[0:(\d+)\])?; }x;
if( my($word, $num) = $line =~ $re ) {
say "Word: $word";
say "Num: $num" if defined $num;
}
else {
say "No match";
}
(?:\[0:(\d+)\])? says there may be a [0:\d+]. (?:) makes the grouping non-capturing so only \d+ is captured.
$1 and $2 are also safe to use, they are reset on each match, but using lexical variables makes things more explicit.

Related

Perl string manipulation and find

I am currently working on a phonebook program for a class and I am having a little bit of trouble with the regex part in order to format my text and find what im looking for. Firstly, I am having trouble editing my phone number text to what I want. I am able to find the text that have 7 numbers in a row (777777) but I am unable to substitute it to (1-701-777-777).
if($splitIndex[1] =~ m/^(\d{3}\d{4})/) {
$splitIndex[1] =~ s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/;
print "Updated: $splitIndex[1]";
}
When I run this code the output ends up being (wont let me imbed image here is output https://imgur.com/a/8HtW7xm).
Secondly, I am having trouble doing the actual regex part for the searching. I save all the possible letter combinations in $letofSearch and the number order combination in $numOfSearch. Through playing around in regex I have figured out if I do [$numOfSearch]+[$numOfSearch[-1]...[$numOfSearch[1] it gives me the correct find for the numbers but I am unable to write it properly in my code.
#If user input is only numbers
if($searchValue =~ m/(\D)/) {
#print "Not a number\n";
if($splitIndex[1] =~ m/([$numOfSearch]+)/) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
if($splitIndex[0] =~ m/([$letOfSearch])/i) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
$found = 0;
} else {
#If it is a number search for that number combo immedietly
if($splitIndex[1] =~ m/([$numOfSearch]+)/) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
if($splitIndex[0] =~ m/([$letOfSearch])/i) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
$found = 0;
}
}
}
Instead of:
if($splitIndex[1] =~ m/^(\d{3}\d{4})/) {
$splitIndex[1] =~ s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/;
print "Updated: $splitIndex[1]";
}
try this:
if ($splitIndex[1] =~ s/(\d{3})(\d{4})/1-701-$1-$2/)
{
print "Updated: $splitIndex[1]";
}
In regular expressions, a set of square brackets ([ and ]) will match one and only one character, regardless of what's between the brackets. So when you write [\d{3}][\d{4}], that will match exactly two characters, because you are using two sets of []. And those two characters will be one of \d (any digit), {, 3, 4, or }, because that's what you wrote inside the brackets.
The order doesn't matter inside of the square brackets of a regular expression, so [\d{3}] is the same as [}1527349806{3]. As you can see, that's probably not what you wanted.
What you meant to do was capture the \d{3} and \d{4} strings, and you do that with a regular set of capturing parentheses, like this: (\d{3})(\d{4})
Since you had only one set of parentheses (that is, you had ([\d{3}][\d{4}])) and it contained exactly two []s, it was putting exactly two characters into $1, and nothing at all into $2. That's why, when you attempted to use $2 in the second half of your s///, it was complaining about an uninitialized value in $2. You were attempting to use a value ($2) that simply wasn't set.
(Also, you were doing two sets of matches: One for the m//, and one for the s///. I simply removed the m// match and kept the s/// match, using its return value to determine if we need to print() anything.)
The second part of the s/// does not use regular expressions, so any [, ], {, }, (, or ) will show up literally as that character. So if you don't want square brackets in the final phone number, don't use them. That's why I used s/.../1-701-$1-$2/; instead of s/.../1-701-[$1]-[$2]/;.
So when you wrote s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/, the ([\d{3}][\d{4}]) part was putting two characters into $1, and nothing into $2. That's why you got a result that contained [77] (which was $1 surrounded by brackets) and [] (which was $2 (an uninitialized value) surrounded by brackets).
As for the second part of your post, I notice that you use a lot of capturing parentheses in your regular expressions, but you never actually use what you capture. That is, you never use $1 (or $2). For example, you write:
if($searchValue =~ m/(\D)/) {
which has m/(\D)/, yet you never use $1 anywhere in that code. I wonder: What's the point of capturing that non-digit character if you don't use it anywhere in your code?
I've seen programmers get confused and mix up the purpose of parentheses and square brackets. When using regular expressions, square brackets ([ and ]) match (not capture) exactly one character. What they match is not put in $1, $2, or any other $n.
Parentheses, on the other hand, capture whatever they match, by setting $1 (or $2, $3, etc.) to what was matched. In general, you shouldn't use parentheses unless you plan on capturing and using that match later. (The main exception to this rule is if you need to group a set of matches, like this: m/I have a (cat|dog|bird)/.)
Many programmers confuse square brackets and parentheses in regular expressions, and try to use them interchangeably. They'll write something like m/I have a [cat|dog|bird]/ and not realize that it's the same as m/I have a [abcdgiort|]/ (which doesn't capture anything, since there are no parentheses), and wonder why their program complains that $1 is an uninitialized value.
This is a common mistake, so don't feel bad if you didn't know the difference. Now you know, and hopefully you can figure out what needs to be corrected in the second part of your code.
I hope this helps.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

Negative lookahead assertion with the * modifier in Perl

I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";