In Ruby, how can the Regexp#~ unary operator be aliased? - regex

Playing with the freedom that Ruby offers in its base features, I found rather easy to alias most operator used in the language, but the Regexp#~ unary prefix operator is trickier.
A first naïve approach would be to alias it in the Regexp class itself
class Regexp
alias hit ~# # remember that # stands for "prefix version"
# Note that a simple `alias_method :hit, :~#` will give the same result
end
As it was pointed in some answer bellow, this approach is somehow functionnal with the dot notation calling form, like /needle/.hit. However trying to execute hit /needle/ will raise undefined method hit' for main:Object (NoMethodError)`
So an other naïve approach would be to define this very method in Object, something like
class Object
def ~#(pattern)
pattern =~ $_
end
end
However, this won’t work, as the $_ global variable is in fact locally binded and won’t keep the value it has in the calling context, that is $_ is always nil in the previous snippet.
So the question is, is it possible to have the expression hit /needle/ to restitute the same result as ~ /needle/?

Works just fine for me:
class Regexp
alias_method :hit, :~ # both of them work
# alias hit ~ # both of them work
end
$_ = "input data"
/at/.hit #=> 7
~/at/ #=> 7
/at/.hit #=> 7
~/at/ #=> 7

So, as the completed question now inhibits it, the main hindrance is the narrow scope of $_. That’s where trace_var can come to the rescue:
trace_var :$_, proc { |nub|
$last_explicitly_read_line = nub
#puts "$_ is now '#{nub}'"
}
def reach(pattern)
$last_explicitly_read_line =~ pattern
end
def first
$_ = "It’s needless to despair."
end
def second
first
p reach /needle/
$_ = 'What a needlework!'
p reach /needle/
end
p reach /needle/
second
p reach /needle/
$_ = nil
p reach /needle/
So the basic idea is to stash the value of $_ each time it is changed in an other variable that will be accessible in other subsequent calling context. Here it was implemented with a an other global variable (not locally binded, unlike $_ of course), but the same result could be obtained with other implementations, like defining a class variable on Object.
One could also try to use something like binding_of_caller or binding_ninja, but my own approach of doing so failed, and also of course it comes with additional dependencies which have their own limitations.

Related

How to fix 'Bareword found' issue in perl eval()

The following code returns "Bareword found where operator expected at (eval 1) line 1, near "*,out" (Missing operator before out?)"
$val = 0;
$name = "abc";
$myStr = '$val = ($name =~ in.*,out [)';
eval($myStr);
As per my understanding, I can resolve this issue by wrapping "in.*,out [" block with '//'s.
But that "in.*,out [" can be varied. (eg: user inputs). and users may miss giving '//'s. therefore, is there any other way to handle this issue.? (eg : return 0 if eval() is trying to return that 'Bareword found where ...')
The magic of (string) eval -- and the danger -- is that it turns a heap of dummy characters into code, compiles and runs it. So can one then use '$x = ,hi'? Well, no, of course, when that string is considered code then that's a loose comma operator there, a syntax eror; and a "bareword" hi.† The string must yield valid code
In a string eval, the value of the expression (which is itself determined within scalar context) is first parsed, and if there were no errors, executed as a block within the lexical context of the current Perl program.
So that string in the question as it stands would be just (badly) invalid code, which won't compile, period. If the in.*,out [ part of the string is in quotes of some sort, then that is legitimate and the =~ operator will take it as a pattern and you have a regex. But then of course why not use regex's normal pattern delimiters, like // (or m{}, etc).
And whichever way that string gets acquired it'll be in a variable, no? So you can have /$input/ in the eval and populate that $input beforehand.
But, above all, are you certain that there is no other way? There always is. The string-eval is complex and tricky and hard to use right and nigh impossible to justify -- and dangerous. It runs arbitrary code! That can break things badly even without any bad intent.
I'd strongly suggest to consider other solutions. Also, it is unclear why there'd be need for eval in the first place -- as you only need the regex pattern as user input (not code) you can have that very regex in normal code with a pattern in a variable, which is populated earlier when the user input is supplied. (Note that taking a pattern from the user may lead to trouble as well.)
† A problem if you're into warnings, and we all are.
The following isn't valid Perl code:
$val = ($name =~ in.*,out [)
You want the following:
$val = $name =~ /in.*,out \[/
(The parens weren't harmful, but didn't help either.)
If the pattern is user-supplied, you can use the following:
$val = $name =~ /$pattern/
(No eval EXPR needed!)
Note from the correction that the pattern in the question isn't correct. You can catch such errors using eval BLOCK
eval { $val = $name =~ /$pattern/ };
die("Bad pattern \"$pattern\" provided: $#") if $#;
A note about user-provided patterns: The above won't let the user execute arbitrary code, but it won't protect you from patterns that would take longer than the lifespan of the universe to complete.

Why is division parsed as regular expression?

This is part of my code:
my $suma = U::round $item->{ suma }; # line 36
$ts += $suma;
$tnds += U::round $suma /6;
}
return( $ts, $tnds );
}
sub create { #line 46
my( $c ) = shift;
my $info = $c->req->json;
my $header = #$info[0];
my $details = #$info[1];
my $agre = D::T Agreement => $header->{ agreement_id };
my( $total_suma, $total_nds ) = total( $details );
my $saldo = 0;
my $iid = #$details[0]->{ period };
my $interval = D::T Period => $iid //7; # line 58
# This is first Invoice if operator do not provide activation date
my $is_first = !$details->[0]{valid_from} && $iid && $interval;
When this module is loaded I gen an error:
Can't load application from file "lib/MaitreD/Controller/ManualDocument.pm line 38, near "my $interval = D::T Period => $iid /"
Unknown regexp modifier "/6" at lib/MaitreD/Controller/ManualDocument.pm line 38, at end of line
Global symbol "$pkg" requires explicit package name (did you forget to declare "my $pkg"?) at lib/MaitreD/Controller/ManualDocument.pm line 41.
...
Is this indirect object call guilty?
Because when I put parentheses at U::round( $suma /6 ) there is no errors
Here are some thoughts on this, and a plausible explanation. A simple reproduction
perl -wE'sub tt { say "#_" }; $v = 7; tt $v /3'
gives me
Search pattern not terminated at -e line 1.
So it tries to parse a regex in that subroutine call, as stated, and the question is: why?
With parenthesis around argument(s) it works as expected. With more arguments following it it fails the same way, but with arguments preceding it it works
perl -wE'sub tt { say "#_" }; $v = 7; tt $v /3, 3' # fails the same way
perl -wE'sub tt { say "#_" }; $v = 7; tt 3, $v /3' # works
Equipping the tt sub with a prototype doesn't change any of this.
By the error it appears that the / triggers the search for the closing delimiter and once it's not found the whole thing fails. So why is this interpreted as a regex and not division?
It seems that tt $v are grouped in parsing, and interpreted as a sub and its arguments, since they're followed by a space; then /3 is taken separately and then that does look like a regex.† That would still fail as a syntax error but perhaps the regex parsing failure comes first.
Then the difference between other comma-separated terms coming before or after is clear -- with tt 3, ... the following $v /3 is a term for the next argument, and is parsed as division.
This still leaves another issue. All builtins that I tried don't have this problem, be they list or unary operators, with a variety of prototypes (push, chr, splice, etc) -- except for print, which does have the same looking problem. And which fails both with and without parens.
perl -wE'$v=110; say for unpack "A1A1", $v /2' #--> 5 5
perl -wE'$v=200; say chr $v /2' #--> d
perl -wE'$v=3; push #ary, $v /2; say "#ary"' #--> 1.5
perl -wE'$v = 7; say $v /3' # fails, the same way
perl -wE'$v = 7; say( $v /3 )' # fails as well, same way
A difference is that print obeys "special" parsing rules, and which allow the first argument to be a filehandle. (Also, it has no prototype but that doesn't appear to matter.)
Then the expression print $v /3... can indeed be parsed as print filehandle EXPR, and the EXPR starting with / is parsed as a regex. The same works with parenthesis.‡
All this involves some guesswork as I don't know how the parser does it. But it is clearly a matter of details of how a subroutine call is parsed, what (accidentally?) includes print as well.
An obvious remedy of using parens on (user-defined) subroutines is reasonable in my view. The other fix is to be consistent with spaces around math operators, to either not have them on either side or to use them on both sides -- that is fine as well, even as it's itchy (spaces? really?).
I don't know what to say about there being a problem with say( $v /3 ) though.
A couple more comments on the question.
By the text of the error message in the question, Unknown regexp modifier "/6", it appears that there the / is taken as the closing delimiter, unlike in the example above. And there is more in that message, which is unclear. In the end, we do have a very similar parsing question.
As for
Is this indirect object call guilty?
I don't see an indirect object call there, only a normal subroutine call. Also, the example from this answer displays very similar behavior and rules out the indirect object syntax.
† Another possibility may be that $v /3 is parsed as a term, since it follows the (identifiable!) subroutine name tt. Then, the regex binding operator =~ binds more tightly than the division, and here it is implied by clearly attempting to bind to $_ by default.
I find this less likely, and it also can't explain the behavior of builtins, print in particular.
‡
Then one can infer that other builtins with an optional comma-less first argument (and so without a prototype) go the same way but I can't readily think of any.
Perl thinks that the symbol / is a start of a regular expression and not a division operator. https://perldoc.perl.org/perlre - You can check the perldoc for regular expressions.
You can try adding a whitespace character before 6 like so: $tnds += U::round $suma / 6;

Why does the match operator's "match-only-once" optimization only apply with the "?" delimiter?

From the docs (perldoc -f m)
If ? is the delimiter, then a match-only-once rule applies, described in m?*PATTERN*? below.
The "match-only-once rule" doesn't' seem to be defined anywhere, but it seems to be a real optimization,
use Benchmark qw(:all) ;
use constant HAYSTACK => "this is a test string";
my $needle = "test";
cmpthese(-1, {
'questionmark' => sub { if ( HAYSTACK =~ m?$needle?n ) { 1 } },
'backslash' => sub { if ( HAYSTACK =~ m/$needle/n ) { 1 } },
});
With the results,
Rate backslash questionmark
backslash 9267717/s -- -57%
questionmark 21588328/s 133% --
This makes me wonder why is the behavior in m// in scalar context such that it even needs this behavior? Let's take for example the output
perl -E'say "FOOOOOO" =~ m/O/' # returns 1
If it's not even counting the O what does it do after the first match such that it's twice as slow?
The "match-only-once rule" doesn't' seem to be defined anywhere, […]
"A match-only-once rule" is a description of the rule — it's a rule saying that m?PATTERN? matches only once — not an official name that you can use to search. The text that you quote is pulled from the perlop manpage, so when it says "described in m?*PATTERN*? below", it's referring to this part of that manpage:
m?PATTERN?msixpodualngc
This is just like the m/PATTERN/ search, except that it matches only once between calls to the reset() operator. This is a useful optimization when you want to see only the first occurrence of something in each file of a set of files, for instance. Only m?? patterns local to the current package are reset.
while (<>) {
if (m?^$?) {
# blank line between header and body
}
} continue {
reset if eof; # clear m?? status for next file
}
Another example switched the first "latin1" encoding it finds to "utf8" in a pod file:
s//utf8/ if m? ^ =encoding \h+ \K latin1 ?x;
This makes me wonder why is the behavior in m// in scalar context such that it even needs this behavior?
Even in scalar context, m// or m?? may be called many times between resets, and if so then the two behave differently. (You can see this in the first snippet above. It's also the reason that your benchmarks give different performance results: the version with m?$needle?n only does a regex match the first time the function is called — it just returns 'no match' on all subsequent calls — whereas the version with m/$needle/n does a regex match every time.)
The confusion here is that "once" in "match-only-once" is in reference to the calling context of the m?? not in reference to matching once the needle inside the haystack, and ignoring subsequent matches of the needle inside the haystack. So if m?? is called many times without reset, only the first one that matches will return the match.
sub foo { return "foo" =~ m?o? };
say foo(); # 1
say foo(); # undef
reset();
say foo(); # 1

Why is a Regexp object considered to be "falsy" in Ruby?

Ruby has a universal idea of "truthiness" and "falsiness".
Ruby does have two specific classes for Boolean objects, TrueClass and FalseClass, with singleton instances denoted by the special variables true and false, respectively.
However, truthiness and falsiness are not limited to instances of those two classes, the concept is universal and applies to every single object in Ruby. Every object is either truthy or falsy. The rules are very simple. In particular, only two objects are falsy:
nil, the singleton instance of NilClass and
false, the singleton instance of FalseClass
Every single other object is truthy. This includes even objects that are considered falsy in other programming languages, such as
the Integer 0,
the Float 0.0,
the empty String '',
the empty Array [],
the empty Hash {},
These rules are built into the language and are not user-definable. There is no to_bool implicit conversion or anything similar.
Here is a quote from the ISO Ruby Language Specification:
6.6 Boolean values
An object is classified into either a trueish object or a falseish object.
Only false and nil are falseish objects. false is the only instance of the class FalseClass (see 15.2.6), to which a false-expression evaluates (see 11.5.4.8.3). nil is the only instance of the class NilClass (see 15.2.4), to which a nil-expression evaluates (see 11.5.4.8.2).
Objects other than false and nil are classified into trueish objects. true is the only instance of the class TrueClass (see 15.2.5), to which a true-expression evaluates (see 11.5.4.8.3).
The executable Ruby/Spec seems to agree:
it "considers a non-nil and non-boolean object in expression result as true" do
if mock('x')
123
else
456
end.should == 123
end
According to those two sources, I would assume that Regexps are also truthy, but according to my tests, they aren't:
if // then 'Regexps are truthy' else 'Regexps are falsy' end
#=> 'Regexps are falsy'
I tested this on YARV 2.7.0-preview1, TruffleRuby 19.2.0.1, and JRuby 9.2.8.0. All three implementations agree with each other and disagree with the ISO Ruby Language Specification and my interpretation of the Ruby/Spec.
More precisely, Regexp objects that are the result of evaluating Regexp literals are falsy, whereas Regexp objects that are the result of some other expression are truthy:
r = //
if r then 'Regexps are truthy' else 'Regexps are falsy' end
#=> 'Regexps are truthy'
Is this a bug, or desired behavior?
This isn’t a bug. What is happening is Ruby is rewriting the code so that
if /foo/
whatever
end
effectively becomes
if /foo/ =~ $_
whatever
end
If you are running this code in a normal script (and not using the -e option) then you should see a warning:
warning: regex literal in condition
This is probably somewhat confusing most of the time, which is why the warning is given, but can be useful for one lines using the -e option. For example you can print all lines matching a given regexp from a file with
$ ruby -ne 'print if /foo/' filename
(The default argument for print is $_ as well.)
This is the result of (as far as I can tell) an undocumented feature of the ruby language, which is best explained by this spec:
it "matches against $_ (last input) in a conditional if no explicit matchee provided" do
-> {
eval <<-EOR
$_ = nil
(true if /foo/).should_not == true
$_ = "foo"
(true if /foo/).should == true
EOR
}.should complain(/regex literal in condition/)
end
You can generally think of $_ as the "last string read by gets"
To make matters even more confusing, $_ (along with $-) is not a global variable; it has local scope.
When a ruby script starts, $_ == nil.
So, the code:
// ? 'Regexps are truthy' : 'Regexps are falsey'
Is being interpreted like:
(// =~ nil) ? 'Regexps are truthy' : 'Regexps are falsey'
...Which returns falsey.
On the other hand, for a non-literal regexp (e.g. r = // or Regexp.new('')), this special interpretation does not apply.
// is truthy; just like all other object in ruby besides nil and false.
Unless running a ruby script directly on the command line (i.e. with the -e flag), the ruby parser will display a warning against such usage:
warning: regex literal in condition
You could make use of this behaviour in a script, with something like:
puts "Do you want to play again?"
gets
# (user enters e.g. 'Yes' or 'No')
/y/i ? play_again : back_to_menu
...But it would be more normal to assign a local variable to the result of gets and perform the regex check against this value explicitly.
I'm not aware of any use case for performing this check with an empty regex, especially when defined as a literal value. The result you've highlighted would indeed catch most ruby developers off-guard.

Making a dynamic hash of arrays in foreach in perl based on regex

So I'm trying to make a hash of arrays based on a regex inside a foreach.
I'm getting some file paths, and they are of the format:
longfilepath/name.action.gz
so basically there will be files with the same name but diffent actions, so I want to make a hash with keys of name that are arrays of actions. I'm apparently doing something wrong as I keep getting this error when I run the code:
Not an ARRAY reference at ....the file I'm writing in
Which I don't get since I'm checking to see if its set, and if not declaring it as an array. I'm still getting used to perl, so I'm guessing my problem is something simple.
I should also say, that I've verified my regex is generating both the 'name' and 'action' strings properly so the problem is definitely in my foreach;
Thanks for your help. :)
My code is thus.
my %my_hash;
my $file_paths = glom("/this/is/mypath/*.*\.gz");
foreach my $path (#$bdr_paths){
$path =~ m"\/([^\/\.]+)\.([^\.]+)\.gz";
print STDERR "=>".Dumper($1)."\n\r";
print STDERR "=>".Dumper($2)."\n\r";
#add the entity type to a hash with the recipe as the key
if($my_hash{$1})
{
push($my_hash{$1}, $2);
}
else
{
$my_hash{$1} = ($2);
}
}
It’s glob, not glom. In glob expressions, the period is no metacharacter. → glob '/this/is/mypath/*.gz'.
The whole reason of using alternate regex delimiters is to avoid unneccessary escapes. The forward slash is no regex metacharacter, but a delimiter. Inside charclasses, many operators loose their specialness; no need to escape the period. Ergo m!/([^/.]+)\.([^.]+)\.gz!.
Don't append \n\r to your output. ① The Dumper function already appends a newline. ② If you are on a OS that expects a CRLF, then use the :crlf PerlIO layer, which transforms all \ns to a CRLF. You can add layers via binmode STDOUT, ':crlf'. ③ If you are doing networking, it might be better to specify the exact bytes you want to emit, e.g. \x0A\x0D or \012\015. (But in this case, also remove all PerlIO layers).
Using references as first arg to push doesn't work on perls older than v5.14.
Don't manually check whether you populated a slot in your hash or not; if it is undef and used as an arrayref, an array reference is automatically created there. This is known as autovivification. Of course, this requires you to perform this dereference (and skip the short form for push).
In Perl, parens only sort out precedence, and create list context when used on the LHS of an assignment. They do not create arrays. To create an anonymous array reference, use brackets: [$var]. Using parens like you do is useless; $x = $y and $y = ($y) are absolutely identical.
So you either want
push #{ $my_hash{$1} }, $2;
or
if ($my_hash{$1}) {
push $my_hash{$1}, $2;
} else {
$my_hash{$1} = [$2];
}
Edit: Three things I overlooked.
If glob is used in scalar context, it turns into an iterator. This is usually unwanted, unless when used in a while(my $path = glob(...)) { ... } like fashion. Otherwise it is more difficult to make sure the iterator is exhausted. Rather, use glob in list context to get all matches at once: my #paths = glob(...).
Where does $bdr_paths come from? What is inside?
Always check that a regex actually matched. This can avoid subtle bugs, as the captures $1 etc. keep their value until the next successful match.
When you say $my_hash{$1} = ($2); it evaluates it in list context and stores the last object of the list in the hash.
my %h;
$h{a} = ('foo');
$h{b} = ['bar'];
$h{c} = ('foo', 'bar', 'bat'); # Will cause warning if 'use warnings;'
print Dumper(\%h);
Gives
$VAR1 = {
'c' => 'bat',
'b' => [
'bar'
],
'a' => 'foo'
};
You can see that is stored as the value and not an array reference. So you can store an anonymous array ref with $my_hash{$1} = [$2]; Then you push onto it with push( #{ $my_hash{$1} }, $2);