Using Ruby gsub with regex as replacement - regex

Ruby gsub supports using regex as pattern to detect input
and it also may allow to use match group number in replacement
for example, if that's a regex detecting lowercase letters at the beginning of any word, and puts a x before it and a y after it
this would give perfect result:
"testing gsub".gsub(/(?<=\b)[a-z]/,'x\0y')
#=> "xtyesting xgysub"
But if I want to use regex to convert this match group to uppercase
in normal regex, one can normally do this \U\$0 as explained here
unfortunately when I try like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,'\U\0')
#=> "\\Utesting \\Ugsub"
also, if I try using raw regex in replacement field like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,/\U\0/)`
I get type error:
TypeError (no implicit conversion of Regexp into String)
I'm totally aware of the option to do it using maps like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,&:upcase)
But unfortunately, the rules (pattern, replacement) are being loaded from a .yaml file and they are applied to string this way:
input.gsub(rule['pattern'], rule['replacement'])
and I am not able to store &:upcase in .yaml to be taken as a raw string
A workaround I may do is to detect if upcase is the replacement got "upcase"
and do it this way
"testing gsub".gsub(/(?<=\b)[a-z]/) {|l| l.send("upcase")}
But I don't want to modify this logic:
input.gsub(rule['pattern'], rule['replacement'])
If there is a workaround to either use regex in gsub replacement, or to store methods like &:upcase in YAML without being loaded as a string, it'd be perfect.
Thanks!

TL;DR
You can't do what you want the way you want. This is documented in the Onigmo source. You'll have to use a different approach, or refactor other areas of your code to simulate the behavior you want.
Escapes Like \U Not Available in Ruby
Special escapes like \U are extensions to GNU sed or ported from the PCRE library. They are not part of Ruby's current regular expression engine. The Onigmo source clearly mentions that these escapes are missing:
A-3. Missing features compared with perl 5.18.0
+ \N{name}, \N{U+xxxx}, \N
+ \l,\u,\L,\U, \C
+ \v, \V, \h, \H
+ (?{code})
+ (??{code})
+ (?|...)
+ (?[])
+ (*VERB:ARG)
Other Approaches
You can do what you want in a number of different ways, such as using the block form of String#gsub to call String#upcase on each match. For example:
"testing gsub".gsub(/\b\p{Lower}+/) { |m| m.upcase }
#=> "TESTING GSUB"
You will also have to use the block form if you want to reliably reference certain match variables like $& or $1, as the variables might otherwise refer to text from previous matches. For illustration, consider:
"foo bar".gsub /\b\p{Lower}+/, "#{$&.upcase}"
#=> "BAR BAR"
As this is primarily an X/Y problem, you may be happier with the answers you receive if you post a related question with an example of your YAML source and your current code for parsing your regular expression matches/substitutions. Perhaps there's a way to wrap or refactor your code that you haven't considered, but you aren't going to be able to solve this the way you want.

Related

The regular expression used in JavaScript does not work in Java

replace(/[.?+^$[\]\\(){}|-]/g, '\\$&');
But it doesn't work in Java.
So I changed the code as follows.
replace(/[.?+^$[\\]\\\\(){}|-]/g, '\\\\$&');
It doesn't work when I change it. Please help me :(
In Java, replace does not take a regex in the constructor, for that you need replaceFirst.
But as you are using the /g flag in Javascript for all replacements, you can use replaceAll.
In Javascript, this part $& in the replacement points to the full match.
So you want to replace the full match (which is one of these characters [.?+^$[\]\\(){}|-]) prepended by a \
In Java you can use $0 instead to refer to the full match.
You can also escape the opening square bracket in the character class \\[
For example
System.out.println("{test?test^}".replaceAll("[.?+^$\\[\\]\\\\()\\{}|-]", "\\\\$0"));
See a Java demo
Output
\{test\?test\^\}
The same output in Javascript
console.log("{test?test^}".replace(/[.?+^$[\]\\(){}|-]/g, '\\$&'));

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Lua pattern parentheses and 0 or 1 occurrence

I'm trying to match a string against a pattern, but there's one thing I haven't managed to figure out. In a regex I'd do this:
Strings:
en
eng
engl
engli
englis
english
Pattern:
^en(g(l(i(s(h?)?)?)?)?)?$
I want all strings to be a match.
In Lua pattern matching I can't get this to work.
Even a simpler example like this won't work:
Strings:
fly
flying
Pattern:
^fly(ing)?$
Does anybody know how to do this?
You can't make match-groups optional (or repeat them) using Lua's quantifiers ?, *, + and -.
In the pattern (%d+)?, the question mark "looses" its special meaning and will simply match the literal ? as you can see by executing the following lines of code:
text = "a?"
first_match = text:match("((%w+)?)")
print(first_match)
which will print:
a?
AFAIK, the closest you can come in Lua would be to use the pattern:
^eng?l?i?s?h?$
which (of course) matches string like "enh", "enls", ... as well.
In Lua, the parentheses are only used for capturing. They don't create atoms.
The closest you can get to the patterns you want is:
'^flyi?n?g?$'
'^en?g?l?i?s?h?$'
If you need the full power of a regular expression engine, there are bindings to common engines available for Lua. There's also LPeg, a library for creating PEGs, which comes with a regular expression engine as an example (not sure how powerful it is).

do we ever use regex to find regex expressions?

let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?
(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.
What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.
Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.
Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.

How to remove a small part of the string in the big string using RegExp

Hey guys, I don't know RegExp yet. I know a lil about it but I'm not experience user.
Supposed that I run a RegExp match on a website, the matches are:
Data: Informations
Data: Liberty
Then I want to extract only Informations and Liberty, I don't want the Data: part.
Does Data: always appear at the begining of a line?
Can there be multiple spaces between the : and the next word?
Do you know about groups?
What do you want: lazy matching vs greedy matching?
If so, you can use (with lazy matching):
^Data:\s+(.*?)$
With character classes:
^Data:\s+(\w+)$
if you know that it'll always be a word. Try this website.
Can't be absolutely sure without knowing more about the potential matches, but this should be at least a good starting point:
Data: (.*)$
That will return everything after "Data: " to the end of the line.
Search for a regular expression like
Data: (.*)
Then use the "first submatch", which is often referred to by "$1" or "\1", depending on the language you are using.
Regular expression engines support what are commonly called "capturing groups". If you surround a pattern or part of a pattern with (), the part of the string matched by that part of the regular expression will be captured.
The command(s) you use to do the matching will determine how to get these captured values. They may be stored in special variables (eg: $1, $2) or you may be able to specify the names of the variables either embedded within the regular expression or as arguments to the regular expression command. Exactly how depends on what language you are using.
So, read up on the regexp commands for the language of your choice and look for the term "capturing groups" or maybe just "groups".