Perl regex hash to match string - regex

Right now I have the following code...
%strings = ( 'a' => 'x',
'b0' => 'y',
'b1' => 'y',
'b2' => 'y',
...
'bN' => 'y'
'c' => 'z');
....
if(grep { $_ eq $line[0] } keys %strings){
....
}
So over all I setup this hash. $line is created by reading a file. I then look to see if the first string in the line is contained within my hash. This code works perfectly. However, my problem arises with the fact that in the hash, b is growing. For instance right now I have to explicitly list out b0 - b63. This is 64 different definitions that all just need to have the same value. Is there a way to have a regex for the hash key like b\/d\?

If you want to use a regular expression, nothing prevents you from doing so:
%strings = (
'a' => 'x',
'b\d+' => 'y',
'c' => 'z'
);
...
if( grep { $line[0] =~ /^$_$/ } keys %strings ) {
...
}
The ^ and $ are necessary to make sure the full string $line[0] matches and not only a part of it.
Bear in mind that this will be much slower than the eq comparison. On the other hand, the number of expressions to evaluate by grep will be much lower, so you may want to profile different options if the speed of execution is an issue.
Also, keep in mind that you may want to refine the regular expression. For instance, ^b\d{1,2}$ will match a b followed by one or two digits. Or even ^b[1-6]?\d$...

If I undestood you correctly,
b\d+
This will match "b" followed by any string of only numbers.

my %strings = ('a' => 'x',
map{("b$_" , 'y') } 0..63,
'c' => 'z');
should do the trick ;)
if it is what you want
if you need to add a 'b value' later in the code, you still can do $strings{"b$value"} = 'y'; to add the new value in the hash

Related

Raku: effect of capture markers is lost "higher up"

The following Raku script:
#!/usr/bin/env raku
use v6.d;
grammar MyGrammar
{
rule TOP { <keyword> '=' <value> }
token keyword { \w+ }
token value { <strvalue> | <numvalue> }
token strvalue { '"' <( <-["]>* )> '"' }
token numvalue { '-'? \d+ [ '.' \d* ]? }
}
say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');
has the following output:
「foo = 42」
keyword => 「foo」
value => 「42」
numvalue => 「42」
「bar = "Hello, World!"」
keyword => 「bar」
value => 「"Hello, World!"」
strvalue => 「Hello, World!」
For the second item, note that strvalue contains the string value without quotes, as intended with the capture markets <( ... )>.
However, to my surprise, the quotes are included in value.
Is there a way around this?
TL;DR Use "multiple dispatch".[1,2] See #user0721090601's answer for a thorough explanation of why things are as they are. See #p6steve's for a really smart change to your grammar if you want your number syntax to match Raku's.
A multiple dispatch solution
Is there a way around this?
One way is to switch to explicit multiple dispatch.
You currently have a value token which calls specifically named value variants:
token value { <strvalue> | <numvalue> }
Replace that with:
proto token value {*}
and then rename the called tokens according to grammar multiple dispatch targeting rules, so the grammar becomes:
grammar MyGrammar
{
rule TOP { <keyword> '=' <value> }
token keyword { \w+ }
proto token value {*}
token value:str { '"' <( <-["]>* )> '"' }
token value:num { '-'? \d+ [ '.' \d* ]? }
}
say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');
This displays:
「foo = 42」
keyword => 「foo」
value => 「42」
「bar = "Hello, World!"」
keyword => 「bar」
value => 「Hello, World!」
This doesn't capture the individual alternations by default. We can stick with "multiple dispatch" but reintroduce naming of the sub-captures:
grammar MyGrammar
{
rule TOP { <keyword> '=' <value> }
token keyword { \w+ }
proto token value { * }
token value:str { '"' <( $<strvalue>=(<-["]>*) )> '"' }
token value:num { $<numvalue>=('-'? \d+ [ '.' \d* ]?) }
}
say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');
displays:
「foo = 42」
keyword => 「foo」
value => 「42」
numvalue => 「42」
「bar = "Hello, World!"」
keyword => 「bar」
value => 「Hello, World!」
strvalue => 「Hello, World!」
Surprises
to my surprise, the quotes are included in value.
I too was initially surprised.[3]
But the current behaviour also makes sense to me in at least the following senses:
The existing behaviour has merit in some circumstances;
It wouldn't be surprising if I was expecting it, which I think I might well have done in some other circumstances;
It's not easy to see how one would get the current behaviour if it was wanted but instead worked as you (and I) initially expected;
There's a solution, as covered above.
Footnotes
[1] Use of multiple dispatch[2] is a solution, but seems overly complex imo given the original problem. Perhaps there's a simpler solution. Perhaps someone will provide it in another answer to your question. If not, I would hope that we one day have at least one much simpler solution. However, I wouldn't be surprised if we don't get one for many years. We have the above solution, and there's plenty else to do.
[2] While you can declare, say, method value:foo { ... } and write a method (provided each such method returns a match object), I don't think Rakudo uses the usual multiple method dispatch mechanism to dispatch to non-method rule alternations but instead uses an NFA.
[3] Some might argue that it "should", "could", or "would" "be for the best" if Raku did as we expected. I find I think my best thoughts if I generally avoid [sh|c|w]oulding about bugs/features unless I'm willing to take any and all downsides that others raise into consideration and am willing to help do the work needed to get things done. So I'll just say that I'm currently seeing it as 10% bug, 90% feature, but "could" swing to 100% bug or 100% feature depending on whether I'd want that behaviour or not in a given scenario, and depending on what others think.
The <( and )> capture markers only work within a given a given token. Basically, each token returns a Match object that says "I matched the original string from index X (.from) to index Y (.to)", which is taken into account when stringifying Match objects. That's what's happening with your strvalue token:
my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;
my $start = $m<value><strvalue>.from; # 7
my $end = $m<value><strvalue>.to; # 20
say $text.substr: $start, $end - $start; # Hello, World!
You'll notice that there are only two numbers: a start and finish value. This mens that when you look at the value token you have, it can't create a discontiguous match. So it's .from is set to 6, and its .to to 21.
There are two ways around this: by using (a) an actions object or (b) a multitoken. Both have their advantages, and depending on how you want to use this in a larger project, you might want to opt for one or the other.
While you can technically define actions directly within a grammar, it's much easier to do them via a separate class. So we might have for you:
class MyActions {
method TOP ($/) { make $<keyword>.made => $<value>.made }
method keyword ($/) { make ~$/ }
method value ($/) { make ($<numvalue> // $<strvalue>).made }
method numvalue ($/) { make +$/ }
method strvalue ($/) { make ~$/ }
}
Each level make to pass values up to whatever token includes it. And the enclosing token has access to their values via the .made method. This is really nice when, instead of working with pure string values, you want to process them first in someway and create an object or similar.
To parse, you just do:
my $m = MyGrammar.parse: $text, :actions(MyActions);
say $m.made; # bar => Hello, World!
Which is actually a Pair object. You could change the exact result by modifying the TOP method.
The second way you can work around things is to use a multi token. It's fairly common in developing grammars to use something akin to
token foo { <option-A> | <option-B> }
But as you can see from the actions class, it requires us to check and see which one was actually matched. Instead, if the alternation can acceptable by done with |, you can use a multitoken:
proto token foo { * }
multi token:sym<A> { ... }
multi token:sym<B> { ... }
When you use <foo> in your grammar, it will match either of the two multi versions as if it had been in the baseline <foo>. Even better, if you're using an actions class, you can similarly just use $<foo> and know it's there without any conditionals or other checks.
In your case, it would look like this:
grammar MyGrammar
{
rule TOP { <keyword> '=' <value> }
token keyword { \w+ }
proto token value { * }
multi token value:sym<str> { '"' <( <-["]>* )> '"' }
multi token value:sym<num> { '-'? \d+ [ '.' \d* ]? }
}
Now we can access things as you were originally expecting, without using an actions object:
my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;
say $m; # 「bar = "Hello, World!"」
# keyword => 「bar」
# value => 「Hello, World!」
say $m<value>; # 「Hello, World!」
For reference, you can combine both techniques. Here's how I would now write the actions object given the multi token:
class MyActions {
method TOP ($/) { make $<keyword>.made => $<value>.made }
method keyword ($/) { make ~$/ }
method value:sym<str> ($/) { make ~$/ }
method value:sym<num> ($/) { make +$/ }
}
Which is a bit more grokkable at first look.
Rather than rolling your own token value:str & token value:num you may want to use Regex Boolean check for Num (+) and Str (~) matching - as explained to me here and documented here
token number { \S+ <?{ defined +"$/" }> }
token string { \S+ <?{ defined ~"$/" }> }

Split string by . but ignore whats inside []

I have a string:
s='articles[zone.id=1].comments[user.status=active].user'
Looking to split (via split(some_regex_here)). The split needs to occur on every period other than those inside the bracketed substring.
Expected output:
["articles[zone.id=1]", "comments[user.status=active]", "user"]
How would I go about this? Or is there something else besides split(), I should be looking at?
Try this,
s.split(/\.(?![^\[]*\])/)
I got this result,
2.3.2 :061 > s.split(/\.(?![^\[]*\])/)
=> ["articles[zone.id=1]", "comments[user.status=active]", "user"]
You can also test it here:
https://rubular.com/r/LaxEFQZJ0ygA3j
I assume the problem is to split on periods that are not within matching brackets.
Here is a non-regex solution that works with any number of nested brackets. I've assumed the brackets are all matched, but it would not be difficult to check that.
def split_it(s)
left_brackets = 0
s.each_char.with_object(['']) do |c,a|
if c == '.' && left_brackets.zero?
a << '' unless a.last.empty?
else
case c
when ']' then left_brackets -= 1
when '[' then left_brackets += 1
end
a.last << c
end
end.tap { |a| a.pop if a.last.empty? }
end
split_it '.articles[zone.id=[user.loc=1]].comments[user.status=active].user'
#=> ["articles[zone.id=[user.loc=1]]", "comments[user.status=active]", "user"]

Why condition returns True using regular expressions for finding special characters in the string?

I need to validate the variable names:
name = ["2w2", " variable", "variable0", "va[riable0", "var_1__Int", "a", "qq-q"]
And just names "variable0", "var_1__Int" and "a" are correct.
I could Identify most of "wrong" name of variables using regex:
import re
if re.match("^\d|\W|.*-|[()[]{}]", name):
print(False)
else:
print(True)
However, I still become True result for va[riable0. Why is it the case?
I control for all type of parentheses.
.match() checks for a match only at the beginning of the string, while .search() checks for a match anywhere in the string.
You can also simplify your regex to this and call search() method:
^\d|\W
That basically checks whether first character is digit or a non-word is anywhere in the input.
RegEx Demo
Code Demo
Code:
>>> name = ["2w2", " variable", "variable0", "va[riable0", "var_1__Int", "a", "qq-q"]
>>> pattern = re.compile(r'^\d|\W')
>>> for str in name:
... if pattern.search(str):
... print(str + ' => False')
... else:
... print(str + ' => True')
...
2w2 => False
variable => False
variable0 => True
va[riable0 => False
var_1__Int => True
a => True
qq-q => False
Your expression is:
"^\d|\W|.*-|[()[]{}]"
But re.match() matches from the beginning of the string always, so your ^ is unnecessary, but you need a $ at the end, to make sure the entire input string matches, and not just a prefix.

Perl hash substitution with special characters in keys

My current script will take an expression, ex:
my $expression = '( a || b || c )';
and go through each boolean combination of inputs using sub/replace, like so:
my $keys = join '|', keys %stimhash;
$expression =~ s/($keys)\b/$stimhash{$1}/g;
So for example expression may hold,
( 0 || 1 || 0 )
This works great.
However, I would like to allow the variables (also in %stimhash) to contain a tag, *.
my $expression = '( a* || b* || c* )';
Also, printing the keys of the stimhash returns:
a*|b*|c*
It is not properly substituting/replacing with the extra special character, *.
It gives this warning:
Use of uninitialized value within %stimhash in substitution iterator
I tried using quotemeta() but did not have good results so far.
It will drop the values. An example after the substitution looks like:
( * || * || * )
Any suggestions are appreciated,
John
Problem 1
You use the pattern a* thinking it will match only a*, but a* means "0 or more a". You can use quotemeta to convert text into a regex pattern that matches that text.
Replace
my $keys = join '|', keys %stimhash;
with
my $keys = join '|', map quotemeta, keys %stimhash;
Problem 2
\b
is basically
(?<!\w)(?=\w)|(?<=\w)(?!\w)
But * (like the space) isn't a word character. The solution might be to replace
s/($keys)\b/$stimhash{$1}/g
with
s/($keys)(?![\w*])/$stimhash{$1}/g
though the following make more sense to me
s/(?<![\w*])($keys)(?![\w*])/$stimhash{$1}/g
Personally, I'd use
s{([\w*]+)}{ $stimhash{$1} // $1 }eg

Scala - how to filter list with two chars

I have a char List is Scala where I want to remove all chars that are not parentheses. The problem is I only seem to be able to do this for one character, eg:
var parens = chars.filter(_ == '(')
If I try this:
var parens = chars.filter(_ == '(').filter(_ == ')')
..I get nothing since I am filtering it once, then a second time which removes everything. How can I filter a character List (not a string list) for multiple chars?
If you need/want a functional solution then try this:
val givenList = List('(', '[', ']', '}', ')')
val acceptedChars = List('(', ')')
givenList filter acceptedChars.contains // or givenList.filter(acceptedChars.contains)
Now you can add whatever chars you like to the seconds list on which you wanna filter the given list without changing filter call. If you want to leave chars that are not in the acceptedList just change to filterNot. Another advantage of this aproach, is that you do not need to write big lambda functions combining all the chars on which you wanna filter like: x => x == '(' || x == ')' || etc.
Update
Like senia proposed in the comment you can also use shorter version with Set just change function acceptedChars.contains with a Set of given chars:
givenList.filter(Set('(', ')'))
This will remove all characters that are not parentheses:
val parens = chars.filter(c=>c=='('||c==')')
The following is what I tested in scala console:
scala> val chars = List('a', 'b', '(', 'c', ')', ')')
chars: List[Char] = List(a, b, (, c, ), ))
scala> val parens = chars.filter(c=>c=='('||c==')')
parens: List[Char] = List((, ), ))
The reason that your code removes everything is that... the first filter (chars.filter(_ == '(')) removes all the characters that are not (, which means only ( remains. Applying filter(_ == ')') to this result returns empty list.