Can't split a string by a parenthesis on Perl - regex

For example the following code:
$test_str = 'foo(bar';
#arr = split('(', $test_str);
causes the 500 error
Why?

As ikegami says, split expects a pattern as its first argument. A string will just be converted into a pattern. Because an open parenthesis ( has a special meaning, this will error. You need to escape it.
my #arr = split /\(/, $str;

According to perldoc -f split, the first argument to the split() function is a regular expression /PATTERN/. So if you were to write this:
split('some text', $string)
it would be equivalent to:
split( m/some text/, $string )
And if some text contains characters that are special in regular expressions, then they will treated as such. So your line:
#arr = split('(', $test_str);
will be treated as:
#arr = split( m/(/, $test_str );
This is likely not what you want, as m/(/ is an invalid (you could say incomplete) regular expression. To match on a literal (, you need to escape it with a back-slash, so use this instead:
#arr = split( m/\(/, $test_str );
By now you've noticed that Perl tries to be helpful by converting your first argument from the string '(' to the regular expression pattern m/(/. Although you can pass in a string as the first argument, I don't recommend it -- use m/PATTERN/ instead.
The reason for my recommendation is because:
A pattern makes it clear that the first argument is a regular expression pattern, and not just any old string.
According to perldoc -f split, there is a special case where you can pass in a string as the first argument:
As a special case, specifying a PATTERN of space (' ') will
split on white space just as "split" with no arguments does.
Thus, "split(' ')" can be used to emulate awk's default
behavior, whereas "split(/ /)" will give you as many null
initial fields as there are leading spaces. A "split" on
"/\s+/" is like a "split(' ')" except that any leading
whitespace produces a null first field. A "split" with no
arguments really does a "split(' ', $_)" internally.
It's good not to confuse the two. So use ' ' as the first argument to split() when you want to use the special case of splitting on whitespace, and use m/PATTERN/ as the first argument for every other case.

Related

Perl is returning hash when I am trying to find the characters after a searched-for character

I want to search for a given character in a string and return the character after it.
Based on a post here, I tried writing
my $string = 'v' . '2';
my $char = $string =~ 'v'.{0,1};
print $char;
but this returns 1 and a hash (last time I ran it, the exact output was 1HASH(0x11823a498)). Does anyone know why it returns a hash instead of the character?
Return a character after a specific pattern (a character here)
my $string = 'example';
my $pattern = qr(e);
my ($ret) = $string =~ /$pattern(.)/; #--> 'x'
This matches the first occurrence of $pattern in the $string, and captures and returns the next character, x. (The example doesn't handle the case when there may not be a character following, like for the other e; it would simply fail to match so $ret would stay undef.)
I use qr operator to form a pattern but a normal string would do just as well here.
The regex match operator returns different things in scalar and list contexts: in the scalar context it is true/false for whether it matched, while in the list context it returns matches. See perlretut
So you need that matching to be in the list context, and a common way to provide that is to put the variable that is being assigned to in parenthesis.
The first problem with the example in the question is that the =~ operator binds more tightly than the . operator, so the example is effectively
my $char = ( ($string =~ 'v') . {0,1} );
So there's first the regex match, which succeeds and returns 1 (since it is in the scalar context, imposed by the . operator) and then there is a hash-reference {0,1} which is concatenated to that 1. So $char gets assigned the 1 concatenated with a stringification for a hashref, which is a string HASH(0x...) (in the parens is a hex stringification of an address).
Next, the needed . in the pattern isn't there. Got confused with the concatenation . operator?
Then, the capturing parenthesis are absent, while needed for the intended subpattern.
Finally, the matching is the scalar context, as mentioned, what would only yield true/false.
Altogether, that would need to be
my ($char) = $string =~ ( q{v} . q{(.)} );
But I'd like to add: while Perl has very fluid semantics I'd recommend to not build regex patterns on the fly like that. I'd also recommend to actually use delimiters in the match operator, for clarity (even though you indeed mostly don't have to).

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Why `stoutest` is not a valid regular expression?

From perlop:
If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.
So I can pick up any letter as a delimiter. Eventually this regex should be fine:
stoutest
That can be rewritten
s/ou/es/
However it does not seems to work in Perl. Why?
$ perl -e '$_ = qw/ou/; stoutest; print'
ou
Because Perl can't pick out the operator s
perldoc perlop says this
Any non-whitespace delimiter may replace the slashes. Add space after the s when using a character allowed in identifiers.
This program works fine
my $s = 'bout';
$s =~ s toutest;
say $s;
output
best
Because stoutest, or any other string of alphanumeric characters, is a single token in the eyes of the Perl parser. Otherwise we couldn't use any barewords that begin with s (or m, or q, or y).
This works, though
$_ = "ou";
s toutest;
print
The substitute operator starts with an s identifier, and you code doesn't have one. Gotta use
s toutest
If it worked the way you think, we couldn't have any operators or subroutines that start with m, s, tr, q or y since all of them can be followed by any non-whitespace delimiter.
Ironically, your very own code proves demonstrates why it can't be the way you think. If it worked the way you think
$_ = qw/ou/; stoutest; print
wouldn't be equivalent to
$_ = qw/ou/; s/ou/es/; print
It would be equivalent to
$_ = q'/ou/; stoutest; print
aka
$_ = '/ou/; stoutest; print

Is it possible to make conditional regex of following?

Hello I am wondering whether it is possible to do this type of regex:
I have certain characters representing okjects i.e. #,#,$ and operations that may be used on them like +,-,%..... every object has a different set of operations and I want my regex to find valid pairs.
So for examle I want pairs #+, #-, $+ to be matched, but yair $- not to be matched as it is invalid.
So is there any way to do this with regexes only, without doing some gymnastics inside language using regex engine?
every okject with it's own rules in []
/(#[+-]|\$[+]|#[+-])/
you need to properly escape special characters
Gymnastics is hard. Try something like /#\+|#-|\$\+/ or something like that.
Just remember, +, $, and ^ are reserved, so they'll need to be escaped.
Another approach, mix not allowed with raw combinations, but this might be slower.
/(?!\$-|\$\%)([\#\$\#][+\-\%])/, though not if there are many alternations of the first character.
my $str = '
#+, #-, $+ to be matched,
but yair $- not to be matched asit is invalid.
$% $- #% $%
';
my $regex =
qr/
(?!\$-|\$\%) # Specific combinations not allowed
(
[\#\$\#][+\-\%] # Raw combinations allowed
)
/x;
while ( $str =~ /$regex/g ) {
print "found: '$1'\n";
}
__END__
Output:
found: '#+'
found: '#-'
found: '$+'
found: '#%'

How can I convert a string into a regular expression that matches itself in Perl?

How can I convert a string to a regular expression that matches itself in Perl?
I have a set of strings like these:
Enter your selection:
Enter Code (Navigate, Abandon, Copy, Exit, ?):
and I want to convert them to regular expressions sop I can match something else against them. In most cases the string is the same as the regular expression, but not in the second example above because the ( and ? have meaning in regular expressions. So that second string needs to be become an expression like:
Enter Code \(Navigate, Abandon, Copy, Exit, \?\):
I don't need the matching to be too strict, so something like this would be fine:
Enter Code .Navigate, Abandon, Copy, Exit, ..:
My current thinking is that I could use something like:
s/[\?\(\)]/./g;
but I don't really know what characters will be in the list of strings and if I miss a special char then I might never notice the program is not behaving as expected. And I feel that there should exist a general solution.
Thanks.
As Brad Gilbert commented use quotemeta:
my $regex = qr/^\Q$string\E$/;
or
my $quoted = quotemeta $string;
my $regex2 = qr/^$quoted$/;
There is a function for that quotemeta.
quotemeta EXPR
Returns the value of EXPR
with all non-"word" characters
backslashed. (That is, all characters
not matching /[A-Za-z_0-9]/ will be
preceded by a backslash in the
returned string, regardless of any
locale settings.) This is the internal
function implementing the \Q escape in
double-quoted strings.
If EXPR is omitted, uses $_.
From http://www.regular-expressions.info/characters.html :
there are 11 characters with special meanings: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket )
In Perl (and PHP) there is a special function quotemeta that will escape all these for you.
To put Brad Gilbert's suggestion into an answer instead of a comment, you can use quotemeta function. All credit to him
Why use a regular expression at all? Since you aren't doing any capturing and it seems you will not be going to allow for any variations, why not simply use the index builtin?
$s1 = 'hello, (world)?!';
$s2 = 'he said "hello, (world)?!" and nothing else.';
if ( -1 != index $s2, $s1 ) {
print "we've got a match\n";
}
else {
print "sorry, no match.\n";
}