How can I create a qr// in Perl 5.12 from C? - regex

This has been working for me in 5.8 and 5.10, but in 5.12 my code creates this weird non-qr object:
# running "print Dumper($regex)"
$VAR1 = bless( do{\(my $o = '')}, 'Regexp' );
Whereas printing a qr// not created by my code looks like this:
# running "print Dumper(qr/foo/i)"
$VAR1 = qr/(?i-xsm:foo)/;
My code is basically:
REGEXP *rx = re_compile(pattern, flags);
SV *regex = sv_2mortal(newSVpv("",0));
sv_magic(regex, (SV*)rx, PERL_MAGIC_qr, 0, 0);
stash = gv_stashpv("Regexp", 0);
sv_bless(newRV((SV*)regex), stash);
Anyone know how to correctly create a regex from a string in 5.12?

Thanks for putting me on the right track, guys, it turns out I was seriously overthinking this. They just cut out the magic line and don't create the PV.
This is all you need to do in Perl 5.12:
REGEXP *rx = re_compile(pattern, flags);
SV *regex = newRV((SV*)rx);
stash = gv_stashpv("Regexp", 0);
sv_bless(regex, stash);

Take a look at the comments in this answer by hobbs. I've copied it below for ease of reading:
Regex objects actually get slightly more "core" in 5.12.0, as they're now references to scalars of type REGEXP rather than references to scalars with magic. This is, however, completely invisible to user code, unless you manage to bypass overloaded stringification, in which case you'll notice that regexes now print as Regexp=REGEXP(0x1234567) instead of Regexp=SCALAR(0x1234567)
I'm not especially familiar with XS, but I suspect you can't use a scalar value any more to create your regex.

Perl 5.12 changed regexps to be first class objects, which you find as part of the tangential discussion in How do I check if a scalar has a compiled regex it in?.
I'm not an XS person, so I don't know what you need to change in your code to make it work out. Searching for 'REGEXP' in the perl sources shows the fixes they made to the core modules to use the new stuff.

Related

What Regex do I use in Find/Replace to find all instances of a declaration string?

I was handed some very badly written vb.Net code today and asked to migrate it to use ODP.Net. To shortcut this a little, I used Find/Replace to set all of the command variables to BindByName = true. Based on the first few code files though, I though all of these were named "cmd". Unfortunately, they aren't; the original author of the code actually named all of their commands after their purpose, even though they only used one OracleCommand per function. They also decided that using was apparently not worth doing, either.
Dim cmGetStatus As New OracleCommand
cmGetStatus.CommandType = CommandType.StoredProcedure
cmd.BindByName = True `<--this was added by my previous replace with a regex
What regex could I use to grab all instances of "Dim ____ as New OracleCommand" and replace the variable name with "cmd"? What about the same sort of replacement on all instances of "_____.CommandType"? This would save me at least 8 hours of manual edits.
Search: (Dim ).*( As New OracleCommand)
Replace: $1cmd$2
Search: .*( = CommandType.StoredProcedure)
Replace: cmd$1
Group replacements are done with $1, $2, etc.

Conditional RegExp Replace - if reference is found, then write something else

Two cases
1. Key<A, M> desc = newKey();
2. Property<B, N> type = newKey("type", B.bar);
The RegExp and replace
find: (?:Key|Property)<(.*), (.*)> (.*) = newKey\((.*)\);
rep.: Foo<C$1, $2> $3 = pl.nP("$3", $2.class); // ($4)
The Result
1. Foo<CA, M> desc = pl.nP("desc", M.class); //
2. Foo<CB, N> type = pl.nP("type", N.class); // ("type", B.bar)
The Problem:
Now I want to avoid the empty comment at the line 1.
Is there a way to write the $4 and the stuff around it only if $4
isn't empty?
You could remove empty comments afterwards with another regular expression.
EDIT
Another solution would be to deal with the special case separately (... = newKey\(\)).
Perhaps you could automate this process with a simple script, if the tedium of repetitive typing becomes too great(eg. when dealing with multiple conditionals).
As far as I know, there isn't any 'intelligence' built into the replace field in Sublime Text; all you can do is to assemble the captured pieces to your liking.
While skimming through a few Google search results yesterday, I found an article about conditional patterns in Perl, but nothing pertaining to the problem at hand.
For the sake of full disclosure, I should say that I am in no sense an expert in the field, so I could be wrong. I do however have some experience with the Python API for
Sublime Text. It might be possible to implement this functionality yourself, if it doesn't already exist within the plethora of extensions available.
I'm sorry if this sounds like a very long-winded 'uh uh', but I'll be on the lookout for a general solution.

How do I check if a scalar has a compiled regex in it with Perl?

Let's say I have a subroutine/method that a user can call to test some data that (as an example) might look like this:
sub test_output {
my ($self, $test) = #_;
my $output = $self->long_process_to_get_data();
if ($output =~ /\Q$test/) {
$self->assert_something();
}
else {
$self->do_something_else();
}
}
Normally, $test is a string, which we're looking for anywhere in the output. This was an interface put together to make calling it very easy. However, we've found that sometimes, a straight string is problematic - for example, a large, possibly varying number of spaces...a pattern, if you will. Thus, I'd like to let them pass in a regex as an option. I could just do:
$output =~ $test
if I could assume that it's always a regex, but ah, but the backwards compatibility! If they pass in a string, it still needs to test it like a raw string.
So in that case, I'll need to test to see if $test is a regex. Is there any good facility for detecting whether or not a scalar has a compiled regex in it?
As hobbs points out, if you're sure that you'll be on 5.10 or later, you can use the built-in check:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
However, I don't always have that option. In general, I do this by checking against a prototype value with ref:
if( ref $scalar eq ref qr// ) { ... }
One of the reasons I started doing it this way was that I could never remember the type name for a regex reference. I can't even remember it now. It's not uppercase like the rest of them, either, because it's really one of the packages implemented in the perl source code (in regcomp.c if you care to see it).
If you have to do that a lot, you can make that prototype value a constant using your favorite constant creator:
use constant REGEX_TYPE => ref qr//;
I talk about this at length in Effective Perl Programming as "Item 59: Compare values to prototypes".
If you want to try it both ways, you can use a version check on perl:
if( $] < 5.010 ) { warn "upgrade now!\n"; ... do it my way ... }
else { ... use is_regex ... }
As of perl 5.10.0 there's a direct, non-tricky way to do this:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
is_regexp uses the same internal test that perl uses, which means that unlike ref, it won't be fooled if, for some strange reason, you decide to bless a regex object into a class other than Regexp (yes, that's possible).
In the future (or right now, if you can ship code with a 5.10.0 requirement) this should be considered the standard answer to the problem. Not only because it avoids a tricky edge-case, but also because it has the advantage of saying exactly what it means. Expressive code is a good thing.
See the ref built-in.

grep replacement with extensive regular expression implementation

I have been using grepWin for general searching of files, and wingrep when I want to do replacements or what-have-you.
GrepWin has an extensive implementation of regular expressions, however doesn't do replacements (as mentioned above).
Wingrep does replacements, however has a severely limited range of regular expression implementation.
Does anyone know of any (preferably free) grep tools for windows that does replacement AND has a reasonable implementation of regular expressions?
Thanks in advance.
I think perl at the command line is the answer you are looking for. Widely portable, powerful regex support.
Let's say that you have the following file:
foo
bar
baz
quux
you can use
perl -pne 's/quux/splat!/' -i /tmp/foo
to produce
foo
bar
baz
splat!
The magic is in Perl's command line switches:
-e: execute the next argument as a perl command.
-n: execute the command on every line
-p: print the results of the command, without issuing an explicit
'print' statement.
-i: make substitutions in place. overwrite the document with the
output of your command... use with caution.
I use Cygwin quite a lot for this sort of task.
Unfortunately it has the world's most unintuitive installer, but once it's installed correctly it's very usable... well apart from a few minor issues with copy and paste and the odd issue with line-endings.
The good thing is that all the tools work like on a real GNU system, so if you're already familiar with Linux or similar, you don't have to learn anything new (apart from how to use that crazy installer).
Overall I think the advantages make up for the few usability issues.
If you are on Windows, you can use vbscript (requires no downloads). It comes with regex. eg change "one" to "ONE"
Set objFS=CreateObject("Scripting.FileSystemObject")
Set WshShell = WScript.CreateObject("WScript.Shell")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
strFileContents = objFile.ReadAll
Set objRE = New RegExp
objRE.Global = True
objRE.IgnoreCase = False
objRE.Pattern = "one"
strFileContents = objRE.Replace(strFileContents,"ONE") 'simple replacement
WScript.Echo strFileContents
output
C:\test>type file
one
two one two
three
C:\test>cscript //nologo test.vbs file
ONE
two ONE two
three
You can read up vbscript doc to learn more on using regex

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)