A Perl 6 Regex is a more specific type of Method, so I had the idea that maybe I could do something black-magicky in a regular method that produces the same thing. I particularly am curious about doing this without changing any grammars.
However, looking at Perl6/Grammar.nqp (which I barely understand), that this is really not an inheritance thing. I think, based on my reading, that the Perl 6 grammar switches slangs (sub languages) when it sees one of the regex declarators. That is, a different grammar parses the guts of regex { ... } and method {...}.
So, first, is that right?
Then, just for giggles, I thought that maybe I could be inside a method block but tell it to use a different slang (see for instance, "Slangs" from the 2013 Perl 6 Advent Calendar or "Slangs Today").
However, everything I've found looks like it wants to change the grammar. Is there a way to do it without that and return a string that is treated as if it had come out of regex { ... }?
method actually-returns-a-regex {
...
}
I don't have any practical use for this. I just keep wondering about it.
First of all, the Perl 6 design documents mandate an API where regexes return a lazy list of possible matches. If Rakudo adhered to that API, you could easily write a method that acted as a regex, but parsing would be very slow (because lazy lists tend to perform much worse than a compact list of string positions (integers) that act as a backtracking stack).
Instead, Perl 6 regexes return matches. And you can do the same. Here is an example of a method that is called like a regex inside of a grammar:
grammar Foo {
token TOP { a <rest> }
method rest() {
if self.target.substr(self.pos, 1) eq 'b' {
return Match.new(
orig => self.orig,
target => self.target,
from => self.pos,
to => self.target.chars,
);
}
else {
return Match.new();
}
}
}
say Foo.parse('abc');
say Foo.parse('axc');
Method rest implements the equivalent of the regex b.*. I hope this answers your question.
Update: I might have misunderstood the question. If the question is "How can I create a regex object" (and not "how can I write code that acts like a regex", as I understood it), the answer is that you have to go through the rx// quoting construct:
my $str = 'ab.*';
my $re = rx/ <$str> /;
say 'fooabc' ~~ $re; # Output: 「abc」
Related
I am building a lex program that will analyze something like the following...
function myFunc {
if a = b {
print "Cool"
}
}
Is it possible, specifically using flex, to create a regex that will single out everything in the first { }
so i will get
{ if a = b { print "Cool" } }
instead of
{ if a = b { print "Cool" }
Currently in my flex file i have this regex
{[^\0]*}
One problem with what you are trying to do is that RegEx is greedy by default (could do some tricks to change that, but you'll still have problems), and you will match more than intended if you run this on a file with multiple functions in it. The reason is that most programming languages are Type 1 grammars in the Chomsky hierarchy, or context-sensitive grammars, and RegEx is a Type 2 (context-free) grammar. It is fundamentally impossible to directly parse the former using the later without a LOT of work. The full explanation for that is ... long. But it boils down to in context sensitive grammars the meaning of a given element can change depending on where you are in the input, while in a context-free grammar every element has exactly one meaning. In your case, you don't want to match any ole' }, you want to match the corresponding } to an open {, which involves counting the number of { and } you have seen so far.
If you really want to do code parsing without having to re-invent the wheel, the plow, fire, steel, and all the way up to electricity, I would recommend that you go check out AnTLR over on GitHub. AnTLR will allow you to create a grammar (if one does not already exist) for the language you are trying to parse and provide the parsed source code to you in the form of a Parse Tree. Parse trees are very, very easy to use and AnTLR has grammars already for almost every language imaginable, and plugins for several languages.
Other than that, both the online regex tester I used and Notepad++ with your sample code matched everything. You could try the RegEx {.*} which also matches everything.
I had an programming interview few days ago, I am required to write a piece of code in Perl with the functionality described in the title, after a while, I came up with the following solution:
sub startWithUppercaseLetter {
return #_[0] =~ m/^[A-Z]/;
}
The interviewer seems unhappy with this solution, can anybody give a better one? thanks
I would write
sub starts_with_capital {
shift =~ /^[A-Z]/;
}
Your own solution doesn't survive use warnings, giving
Scalar value #_[0] better written as $_[0]
and it is bad practice to use upper case letters in local identifiers.
I would really think this is not a good use of the title since your regular expression will return empty matches or matches (what do you want as a definition of the problem to solve). The person could also imagine having type this function name over and over again to check if something has a Capital.
So many ways to do it in Perl.
return #_[0] if /^[A-Z]/;
return;
The m really is not needed as you only want the start of the string and any new lines etc, as you are concerned only if first character starts. Your way, yes can have an empty match and works the same; make it readable for interviews or provide two examples : long hand as above and then short hand.
recently I have become friendly with regular expressions and used them to over come a number of tasks very efficiently. As with most perl TIMTOWTDI has clouded my judgement. There are times I can use equality operator or binding operator. However are there times where it is more appropriate to use one over the other?
Firstly the simplified case
my $name = 'Chris';
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
So in this case this is the most simplified, where using the equality is less typing, however in this simplified example is there any benefit to one or the other.
In a slightly more complex example
my $name = 'Christopher';
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
here the binding operator is less typing. However I am not sure of the benefit either may hold over the other.
So is the general rule if you are matching an entire string with a fixed value to use equality operator and if your matching a string with a pattern for example any 5 digits string /\d{5}/ then use binding operator.
Is it inappropriate to use binding operator in the above examples. I appreciate that these examples are just made up and may not reflect a real life problem. However they were the ones i thought of to try to explain my question.
however in this simplified example is there any benefit to one or the other.
Well, they're not equivalent. /^Chris$/ matches Chris and Chris followed by a newline.
If you had used an equivalent pattern (/^Chris\z/), the difference would have been performance. A single string comparison will be faster than a regex match. It's also clearer.
For more complicated comparisons, you generally want to go with what's simpler, clearer, and more maintainable. Address performance (by using profile and running benchmarks) when it becomes an issue.
I would expect slightly (if at all) better performance from the eq operator because the regular expression might require a compilation phase as well as analysis before coming up with its determination.
So in the case:
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
... I would expect the first statement to be fastest.
In the second example, however, you have to consider the summed times of the failed cases where you've provided a logical OR:
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
... here things are less cut and dried. Sure, eq may be faster, but are two eqs faster than a regular expression which doesn't have to backtrack (in this example)? I can't be so sure.
Usually you won't have to consider the performance benefits. So you can't argue one is "better" than the other - I'd usually encourage code clarity in this situation. But it's important to realise that eq is very unforgiving while regular expressions are very flexible - allowing for case-insensitive searches, anchoring to just the beginning, etc. When you do hit some code in which comparison speed is critical then ultimately you'll want to benchmark.
The power of regular expressions is realized in its variability.
When you give a regex engine a template, you "suggest" match outcomes to the engine.
Inernally, its the same C "strncmp()" and such as you would do as in Perl, ie: $str eq "asdf", both are templates.
However, you cannot describe variablilty very well with just a language, thats why regular expression engines exist.
There is an overhead to "eterring" the engine, ie: reset variables, state tracking etc..
But after that, the engine will outperform any combination of language constructs you can
concieve of. Not by a little, but by a huge, huge percentage.
Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)
The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};
I'm writing a simple Perl script that translates assembly instruction strings to 32-bit binary code.
I decided to handle translation grouping instruction by type (ADD and SUB are R-Type instructions and so on...) so in my code I'm doing something like this:
my $bin = &r_type($instruction) if $instruction =~ /^(?:add|s(?:ub|lt|gt))\s/;
because I want to handle add, sub, slt and sgt in the same way.
I realized however that maybe using that regular expression could be an 'overkill' for the task I'm supposed to do... could the pattern
/^(?:add|sub|slt|sgt)\s/
represent a better use of regular expressions in this case?
Thanks a lot.
Unless you are using a perl older than 5.10, the simple alternation will perform better anyway (see here), so there is no reason to try to optimize it.
Instead of placing the mnemonics buried inside regular expressions, build a dispatch table using a hash. It will be at least equally faster and your code far easier to follow:
my %emitter = (add => \&r_type,
sub => \&r_type,
slt => \&r_type,
sgt => \&r_type,
...);
if ($instruction =~ /^(\S+)/) {
my $emitter = $emitter{$1} // die "bad instruction $instruction";
$emitter->($1, $istruction);
}
else {
# error?...
}
I like salva's dispatch table (I show a lot of that in Mastering Perl), but I'll answer another aspect of the question in case you need that answer for a different problem someday.
When you want to build some alternations, some of which might be nested, you can use something like Regexp::Trie to build the alternation for you so you don't look at the ugly regex syntax:
use Regexp::Trie;
my $rt = Regexp::Trie->new;
foreach ( qw/add sub slt sgt/ ) {
$rt->add($_);
}
print $rt->regexp, "\n";
That gives you:
(?-xism:(?:add|s(?:gt|lt|ub)))
This way, you list the opcodes like Jonathan suggested, but also get the alternation. As ysth noted, you might get this for free with Perl now anyway.
Your second version is simpler, more readable, and more maintainable. The performance difference will depend on the regex implementation, but I suspect the nested version will run slower due to its increased complexity.
Yes it's overkill.