recently I have become friendly with regular expressions and used them to over come a number of tasks very efficiently. As with most perl TIMTOWTDI has clouded my judgement. There are times I can use equality operator or binding operator. However are there times where it is more appropriate to use one over the other?
Firstly the simplified case
my $name = 'Chris';
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
So in this case this is the most simplified, where using the equality is less typing, however in this simplified example is there any benefit to one or the other.
In a slightly more complex example
my $name = 'Christopher';
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
here the binding operator is less typing. However I am not sure of the benefit either may hold over the other.
So is the general rule if you are matching an entire string with a fixed value to use equality operator and if your matching a string with a pattern for example any 5 digits string /\d{5}/ then use binding operator.
Is it inappropriate to use binding operator in the above examples. I appreciate that these examples are just made up and may not reflect a real life problem. However they were the ones i thought of to try to explain my question.
however in this simplified example is there any benefit to one or the other.
Well, they're not equivalent. /^Chris$/ matches Chris and Chris followed by a newline.
If you had used an equivalent pattern (/^Chris\z/), the difference would have been performance. A single string comparison will be faster than a regex match. It's also clearer.
For more complicated comparisons, you generally want to go with what's simpler, clearer, and more maintainable. Address performance (by using profile and running benchmarks) when it becomes an issue.
I would expect slightly (if at all) better performance from the eq operator because the regular expression might require a compilation phase as well as analysis before coming up with its determination.
So in the case:
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
... I would expect the first statement to be fastest.
In the second example, however, you have to consider the summed times of the failed cases where you've provided a logical OR:
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
... here things are less cut and dried. Sure, eq may be faster, but are two eqs faster than a regular expression which doesn't have to backtrack (in this example)? I can't be so sure.
Usually you won't have to consider the performance benefits. So you can't argue one is "better" than the other - I'd usually encourage code clarity in this situation. But it's important to realise that eq is very unforgiving while regular expressions are very flexible - allowing for case-insensitive searches, anchoring to just the beginning, etc. When you do hit some code in which comparison speed is critical then ultimately you'll want to benchmark.
The power of regular expressions is realized in its variability.
When you give a regex engine a template, you "suggest" match outcomes to the engine.
Inernally, its the same C "strncmp()" and such as you would do as in Perl, ie: $str eq "asdf", both are templates.
However, you cannot describe variablilty very well with just a language, thats why regular expression engines exist.
There is an overhead to "eterring" the engine, ie: reset variables, state tracking etc..
But after that, the engine will outperform any combination of language constructs you can
concieve of. Not by a little, but by a huge, huge percentage.
Related
(sorry for the title, but this "feature" really confuses me)
Learning Perl, I learned that the o modifier for a regular expression using variables would be evaluated only once, even if the variable changes after initial evaluation.
Initially that reads like having no issues, this being clearly specified.
Obviously that initial evaluation cannot happen before the variable being used has got it value.
Now qr made life a bit more interesting.
Consider this code (executed in a loop defining other variables, too):
{
my $n = $name;
$n =~ s/[^\w\.-]/_/g;
$n = qr:^${n}\#${another_variable}$:o;
#a = grep { !/$n/ } #a;
}
When using the regex for qr directly, one could argue that the regex is compiled only once, even if the scope with the variable goes out of scope (is going out of scope being considered as a change of the variable?)
But when using qr to build a regex, assigning it to a lexical variable, the compiled regex would go out of scope, so I was expecting that the regex cannot be reused and would be re-built (The basic idea was that the regex inside grep shouldn't be rebuilt for every iteration).
As life is cruel, it seems the whole regex referenced by $n isn't ever rebuilt, so the first value is used until the program stops.
Interestingly in Perl 5.18.2 (the version being used) does not mention the o modifier in perlre(1) any more, and perl 5.26.1 says in the corresponding page:
o - pretend to optimize your code, but actually introduce bugs
So can anybody explain the rules for "once" evaluation (and whether the semantics had changed over the lifespan of Perl)?
Related:
Does the 'o' modifier for Perl regular expressions still provide any benefit?
How do you force Perl to re-compile a regex compiled with "/o" on demand?
Perl has a couple of constructs that don't just store state in variables, but also some states in the opcodes themselves. Aside from /o regex patterns, this also includes the .. flip-flop operator (in scalar context) or state variables.
Perhaps state variables are clearest, since it corresponds to static local variables in many other languages (e.g. C). A state variable is initialized at most once during the lifetime of the program. An expression state $var = initialize() can be understood as
my $var;
if (previously_initialized) {
$var = cached_value;
} else {
$var = initialize();
}
This does not track dependencies in the initialize() expression, but only evaluates it once.
Similarly, it can make sense to consider a regex pattern /.../o as a kind of hidden state variable state $compiled_pattern = qr/.../.
The /o feature was a good idea a very long time ago when regexes were compiled on the fly, similarly to how it works in other languages where regex patterns are provided to a search function as a string.
It hasn't been necessary for performance purposes since a long time, and only has effects when doing variable interpolation. But if you actually want that behaviour, using a state variable would communicate that intent more clearly. Thus, I'd argue that there is no appropriate use for the /o modifier.
The "largely obsolete /o" (perlop) flag still has the "once" meaning and operation. While it is barely mentioned in perlre and in passing, it is addressed in perlop
/PATTERN/msixpodualngc
...
... Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a /o (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed. But now, the only reasons to use /o are one of:
The variables are thousands of characters long and you know that they don't change, and you need to wring out the last little bit of speed by having Perl skip testing for that. (There is a maintenance penalty for doing this, as mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.)
you want the pattern to use the initial values of the variables regardless of whether they change or not. (But there are saner ways of accomplishing this than using /o.)
If the pattern contains embedded code, such as
use re 'eval';
$code = 'foo(?{ $x })';
/$code/
then perl will recompile each time, even though the pattern string hasn't changed, to ensure that the current value of $x is seen each time. Use /o if you want to avoid this.
The bottom line is that using /o is almost never a good idea.
So, indeed, apparently it won't even test whether variables to interpolate changed, and this may have a legitimate use. But, indeed, all told it probably shouldn't be used.
An example to demonstrate the "once" operation
perl -Mstrict -wE'
sub tt {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say tt( q(a1), $_ ) for qw(a b c)'
With the /o, either on the qr-ed pattern or on the regex, this matches (changes) that a1 string every time even though a is passed for the pattern only in the first iteration. Clearly the pattern isn't recompiled since the variable later has b and then c and shouldn't match.
Without /o only the first iteration has the regex matching.
In every iteration the lexical $re clearly goes away, with the whole function, but its originally compiled value keeps getting used. This means that some of the operation of /o is "very" global. Such stubbornly global behavior is not unheard of with some other ancient features (glob comes to mind, for one).
It goes the same way if the sub is a code reference in a lexical variable, remade all the time in a dynamic scope.
perl -Mstrict -wE'
for my $c (qw(a b c)) {
my $tt = sub {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say $tt->( q(a1), $c )
}'
Prints X1 all three times.
A Perl 6 Regex is a more specific type of Method, so I had the idea that maybe I could do something black-magicky in a regular method that produces the same thing. I particularly am curious about doing this without changing any grammars.
However, looking at Perl6/Grammar.nqp (which I barely understand), that this is really not an inheritance thing. I think, based on my reading, that the Perl 6 grammar switches slangs (sub languages) when it sees one of the regex declarators. That is, a different grammar parses the guts of regex { ... } and method {...}.
So, first, is that right?
Then, just for giggles, I thought that maybe I could be inside a method block but tell it to use a different slang (see for instance, "Slangs" from the 2013 Perl 6 Advent Calendar or "Slangs Today").
However, everything I've found looks like it wants to change the grammar. Is there a way to do it without that and return a string that is treated as if it had come out of regex { ... }?
method actually-returns-a-regex {
...
}
I don't have any practical use for this. I just keep wondering about it.
First of all, the Perl 6 design documents mandate an API where regexes return a lazy list of possible matches. If Rakudo adhered to that API, you could easily write a method that acted as a regex, but parsing would be very slow (because lazy lists tend to perform much worse than a compact list of string positions (integers) that act as a backtracking stack).
Instead, Perl 6 regexes return matches. And you can do the same. Here is an example of a method that is called like a regex inside of a grammar:
grammar Foo {
token TOP { a <rest> }
method rest() {
if self.target.substr(self.pos, 1) eq 'b' {
return Match.new(
orig => self.orig,
target => self.target,
from => self.pos,
to => self.target.chars,
);
}
else {
return Match.new();
}
}
}
say Foo.parse('abc');
say Foo.parse('axc');
Method rest implements the equivalent of the regex b.*. I hope this answers your question.
Update: I might have misunderstood the question. If the question is "How can I create a regex object" (and not "how can I write code that acts like a regex", as I understood it), the answer is that you have to go through the rx// quoting construct:
my $str = 'ab.*';
my $re = rx/ <$str> /;
say 'fooabc' ~~ $re; # Output: 「abc」
I'm wondering about using constants in perl regex's. I want to do something similar to:
use constant FOO => "foo"
use constant BAR => "bar"
$somvar =~ s/prefix1_FOO/prefix2_BAR/g;
of course, in there, FOO resolves to the three letters F O O instead of expanding to the constant.
I looked online, and someone was suggesting using either ${\FOO}, or #{[FOO]} Someone else mentioned (?{FOO}). I was wondering if anyone could shed some light on which of these is correct, and if there's any advantage to any of them. Alternatively, is it better to just use a non-constant variable? (performance is a factor in my case).
There's not much in the way of reasons to use a constant over a variable. It doesn't make a great deal of difference - perl will compile a regex anyway.
For example:
#!/usr/bin/perl
use warnings;
use strict;
use Benchmark qw(:all);
use constant FOO => "foo";
use constant BAR => "bar";
my $FOO_VAR = 'foo';
my $BAR_VAR = 'bar';
sub pattern_replace_const {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
}
sub pattern_replace_var {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_$FOO_VAR/prefix2_$BAR_VAR/g;
}
cmpthese(
1_000_000,
{ 'const' => \&pattern_replace_const,
'var' => \&pattern_replace_var
}
);
Gives:
Rate const var
const 917095/s -- -1%
var 923702/s 1% --
Really not enough in it to worry about.
However it may be worth noting - you can compile a regex with qr// and do it that way, which - provided the RE is static - might improve performance (but it might not, because perl can detect static regexes, and does that itself.
Rate var const compiled
var 910498/s -- -2% -9%
const 933097/s 2% -- -7%
compiled 998502/s 10% 7% --
With code like:
my $compiled_regex = qr/prefix1_$FOO_VAR/;
sub compiled_regex {
my $somvar = "prefix1_foo test";
$somvar =~ s/$compiled_regex/prefix2_$BAR_VAR/g;
}
Honestly though - this is a micro optimisation. The regex engine is fast compared to your code, so don't worry about it. If performance is critical to your code, then the correct way of dealing with it is first write the code, and then profile it to look for hotspots to optimise.
The shown problem is due to those constants being barewords (built at compile time)
Constants defined using this module cannot be interpolated into strings like variables.
In the current implemenation (of constant pragma) they are "inlinable
subroutines" (see † ).
This problem can be solved nicely by using a module like Const::Fast
use Const::Fast;
const my $foo => 'FOO';
const my $bar => 'BAR';
my $var = 'prefix1_FOO_more';
$var =~ s/prefix1_$foo/prefix2_$bar/g;
Now they do get interpolated. Note that more complex replacements may need /e.
These are built at runtime so you can assign results of expressions to them. In particular, you can use the qr operator, for example
const my $patt => qr/$foo/i; # case-insensitive
The qr is the recommended way to build regex patterns. (It interpolates unless the delimiter is '.) The performance gain is most often tiny, but you get a proper regular expression, which can be built and used as such (and in this case a constant as well).
I recommend Const::Fast module over the other one readily, and in fact over all others at this time. See a recent article with a detailed discussion of both. Here is a review of many other options.
I strongly recommend to use a constant (of your chosen sort) for things meant to be read-only. That is good for the health of the code, and of developers who come into contact with it (yourself in the proverbial six months included).
† These being subroutines, we need to be able to run code in order to have them evaluated and replaced by given values. Can't just "interpolate" (evaluate) a variable -- it's not a variable.
A way to run code inside a string (which need be interpolated, so effectively double quoted) is to de-reference, where there's an expression in a block under a reference; then the expression is evaluated. So we need to first make that reference. So either
say "#{ [FOO] }"; # make array reference, then dereference
or
say "${ \FOO }"; # make scalar reference then dereference
prints foo. See the docs for why this works and for examples. Thus one can do the same inside a regex, and both in matching and replacement parts
s/prefix1_${\FOO}/prefix2_${\BAR}/g;
(or with #{[...]}), since they are evaluated as interpolated strings.
Which is "better"? These are tricks. There is rarely, if ever, a need for doing this. It has a very good chance to confuse the reader. So I just wouldn't recommend resorting to these at all.
As for (?{ code }), that is a regex feature, whereby code is evaluated inside a pattern (matching side only). It is complex and tricky and very rarely needed. See about it in perlretut and in perlre.
Discussing speed of these things isn't really relevant. They are certainly outside the realm of clean and idiomatic code, while you'd be hard pressed to detect runtime differences.
But if you must use one of these, I'd much rather interpolate inside a scalar reference via a trick then reach for a complex regex feature.
According to PerlMonk, you better create an already-interpolated string if you are concerned about performance:
use constant PATTERN => 'def';
my $regex = qr/${\(PATTERN)}/; #options such as /m can go here.
if ($string =~ regex) { ... }
Here is the link to the whole discussion.
use constant FOO => "foo";
use constant BAR => "bar";
my $var =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
Credit: https://www.perlmonks.org/?node_id=293323
I'm writing a simple Perl script that translates assembly instruction strings to 32-bit binary code.
I decided to handle translation grouping instruction by type (ADD and SUB are R-Type instructions and so on...) so in my code I'm doing something like this:
my $bin = &r_type($instruction) if $instruction =~ /^(?:add|s(?:ub|lt|gt))\s/;
because I want to handle add, sub, slt and sgt in the same way.
I realized however that maybe using that regular expression could be an 'overkill' for the task I'm supposed to do... could the pattern
/^(?:add|sub|slt|sgt)\s/
represent a better use of regular expressions in this case?
Thanks a lot.
Unless you are using a perl older than 5.10, the simple alternation will perform better anyway (see here), so there is no reason to try to optimize it.
Instead of placing the mnemonics buried inside regular expressions, build a dispatch table using a hash. It will be at least equally faster and your code far easier to follow:
my %emitter = (add => \&r_type,
sub => \&r_type,
slt => \&r_type,
sgt => \&r_type,
...);
if ($instruction =~ /^(\S+)/) {
my $emitter = $emitter{$1} // die "bad instruction $instruction";
$emitter->($1, $istruction);
}
else {
# error?...
}
I like salva's dispatch table (I show a lot of that in Mastering Perl), but I'll answer another aspect of the question in case you need that answer for a different problem someday.
When you want to build some alternations, some of which might be nested, you can use something like Regexp::Trie to build the alternation for you so you don't look at the ugly regex syntax:
use Regexp::Trie;
my $rt = Regexp::Trie->new;
foreach ( qw/add sub slt sgt/ ) {
$rt->add($_);
}
print $rt->regexp, "\n";
That gives you:
(?-xism:(?:add|s(?:gt|lt|ub)))
This way, you list the opcodes like Jonathan suggested, but also get the alternation. As ysth noted, you might get this for free with Perl now anyway.
Your second version is simpler, more readable, and more maintainable. The performance difference will depend on the regex implementation, but I suspect the nested version will run slower due to its increased complexity.
Yes it's overkill.
I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.