re.escape() equivalent in Julia? - regex

I have a bunch of abbreviations I'd like to use in RegEx matches, but they contain lots of regex reserved characters (like . ? $).
In Python you're able to return an escaped (regex safe) string using re.escape. For example:
re.escape("Are U.S. Pythons worth any $$$?") will return 'Are\\ U\\.S\\.\\ Pythons\\ worth\\ any\\ \\$\\$\\$\\?'
From my (little) experience with Julia so far, I can tell there's probably a much more straightforward way of doing this in Julia, by I couldn't find any previous answers on the topic.

Julia uses the PCRE2 library underneath, and uses its regex-quoting syntax to automatically escape special characters when you join a Regex with a normal String. For eg.
julia> r"\w+\s*" * raw"Are U.S. Pythons worth any $$$?"
r"(?:\w+\s*)\QAre U.S. Pythons worth any $$$?\E"
Here we've used a raw string to make sure that none of the characters are interpreted as special, including the $s.
If we needed interpolation, we can also use a normal String literal instead. In this case, the interpolation will be done, and then the quoting with \Q ... \E.
julia> snake = "Python"
"Python"
julia> r"\w+\s*" * "Are U.S. $snake worth any money?"
r"(?:\w+\s*)\QAre U.S. Python worth any money?\E"
So you can place the part of the regex you wish to be quoted in a normal String, and they'll be quoted automatically when you join them up with a Regex.
You can even do it directly within the regex yourself - \Q starts a region where none of the regex-special characters are interpreted as special, and \E ends that region. Everything within such a region is treated literally by the regex engine.

Related

Using Ruby gsub with regex as replacement

Ruby gsub supports using regex as pattern to detect input
and it also may allow to use match group number in replacement
for example, if that's a regex detecting lowercase letters at the beginning of any word, and puts a x before it and a y after it
this would give perfect result:
"testing gsub".gsub(/(?<=\b)[a-z]/,'x\0y')
#=> "xtyesting xgysub"
But if I want to use regex to convert this match group to uppercase
in normal regex, one can normally do this \U\$0 as explained here
unfortunately when I try like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,'\U\0')
#=> "\\Utesting \\Ugsub"
also, if I try using raw regex in replacement field like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,/\U\0/)`
I get type error:
TypeError (no implicit conversion of Regexp into String)
I'm totally aware of the option to do it using maps like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,&:upcase)
But unfortunately, the rules (pattern, replacement) are being loaded from a .yaml file and they are applied to string this way:
input.gsub(rule['pattern'], rule['replacement'])
and I am not able to store &:upcase in .yaml to be taken as a raw string
A workaround I may do is to detect if upcase is the replacement got "upcase"
and do it this way
"testing gsub".gsub(/(?<=\b)[a-z]/) {|l| l.send("upcase")}
But I don't want to modify this logic:
input.gsub(rule['pattern'], rule['replacement'])
If there is a workaround to either use regex in gsub replacement, or to store methods like &:upcase in YAML without being loaded as a string, it'd be perfect.
Thanks!
TL;DR
You can't do what you want the way you want. This is documented in the Onigmo source. You'll have to use a different approach, or refactor other areas of your code to simulate the behavior you want.
Escapes Like \U Not Available in Ruby
Special escapes like \U are extensions to GNU sed or ported from the PCRE library. They are not part of Ruby's current regular expression engine. The Onigmo source clearly mentions that these escapes are missing:
A-3. Missing features compared with perl 5.18.0
+ \N{name}, \N{U+xxxx}, \N
+ \l,\u,\L,\U, \C
+ \v, \V, \h, \H
+ (?{code})
+ (??{code})
+ (?|...)
+ (?[])
+ (*VERB:ARG)
Other Approaches
You can do what you want in a number of different ways, such as using the block form of String#gsub to call String#upcase on each match. For example:
"testing gsub".gsub(/\b\p{Lower}+/) { |m| m.upcase }
#=> "TESTING GSUB"
You will also have to use the block form if you want to reliably reference certain match variables like $& or $1, as the variables might otherwise refer to text from previous matches. For illustration, consider:
"foo bar".gsub /\b\p{Lower}+/, "#{$&.upcase}"
#=> "BAR BAR"
As this is primarily an X/Y problem, you may be happier with the answers you receive if you post a related question with an example of your YAML source and your current code for parsing your regular expression matches/substitutions. Perhaps there's a way to wrap or refactor your code that you haven't considered, but you aren't going to be able to solve this the way you want.

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

white space in Regular expression

I making use of this software, dk-brics-automaton to get number of states
of regular expressions. Now ,for example I have this type of RE:
^SEARCH\s+[^\n]{10}
When I insert it below as a string, the compiler say that invalid escape sequence
RegExp r = new RegExp("^SEARCH\s+[^\n]{10}", ALL);
where ALL is a certain FLAG
when I use double back slashes before small s, then the compiler accepts it
as a string where as over here \s means space but I am confused when I will make use of
double back slashes then it will consider just back slash and "s" where as I meant white space.
Now, I have thousands of such regular expressions for which I want to compute finite automaton
states.So, does that mean that I have to add manually back slashes in all the RE?
Here is a link where they have explained something related to this but I am not getting it:
http://www.brics.dk/automaton/doc/index.html
Please help me if anyone has some past experience in this software or if you have any idea to solve this issue.
I had another look at that documentation. "automaton" is a java package, therefor I think you have to treat them like java regexes. So just double every backslash inside a regex.
The thing here is, Java does not know "raw" strings. So you have to escape for two levels. The first level that evaluates escape sequences is the string level.
The string does not know an escape sequence \s, that is the error. \n is fine, the string evaluates it and stores instead the two characters \ (0x5C) and n (0x6E) the character 0x0A.
Then the string is stored and handed over to the regex constructor. Here happens the next round of escape sequence evaluation.
So if you want to escape for the regex level, then you have to double the backslashes. The string level will evaluate the \\ to \ and so the regex level gets the correct escape sequences.

Syntax highlighting for regular expressions in Vim

Whenever I look at regular expressions of any complexity, my eyes start to water. Is there any existing solution for giving different colors to the different kinds of symbols in a regex expression?
Ideally I'd like different highlighting for literal characters, escape sequences, class codes, anchors, modifiers, lookaheads, etc. Obviously the syntax changes slightly across languages, but that is a wrinkle to be dealt with later.
Bonus points if this can somehow coexist with the syntax highlighting Vim does for whatever language is using the regex.
Does this exist somewhere, or should I be out there making it?
Regular expressions might not be syntax-highlighted, but you can look into making them more readable by other means.
Some languages allow you to break regular expressions across multiple lines (perl, C#, Javascript). Once you do this, you can format it so it's more readable to ordinary eyes. Here's an example of what I mean.
You can also use the advanced (?x) syntax explained here in some languages. Here's an example:
(?x: # Find word-looking things without vowels, if they contain an "s"
\b # word boundary
[^b-df-hj-np-tv-z]* # nonvowels only (zero or more)
s # there must be an "s"
[^b-df-hj-np-tv-z]* # nonvowels only (zero or more)
\b # word boundary
)
EDIT:
As Al pointed out, you can also use string concatenation if all else fails. Here's an example:
regex = "" # Find word-looking things without vowels, if they contain an "s"
+ "\b" # word boundary
+ "[^b-df-hj-np-tv-z]*" # nonvowels only (zero or more)
+ "s" # there must be an "s"
+ "[^b-df-hj-np-tv-z]*" # nonvowels only (zero or more)
+ "\b"; # word boundary
This Vim plugin claims to do syntax higlighting:
http://www.vim.org/scripts/script.php?script_id=1091
I don't think it's exactly what you want, but I guess it's adaptable for your own use.
Vim already has syntax highlighting for perl regular expressions. Even if you don't know perl itself, you can still write your regex in perl (open a new buffer, set the filetype to perl and insert '/regex/') and the regex will work in many other languages such as PHP, Javascript or Python where they have used the PCRE library or copied Perl's syntax.
In a vimscript file, you can insert the following line of code to get syntax highlighting for regex:
let testvar =~ "\(foo\|bar\)"
You can play around with the regex in double-quotes until you have it working.
It is very difficult to write syntax highlighting for regex in some languages because the regex are written inside quoted strings (unlike Perl and Javascript where they are part of the syntax). To give you an idea, this syntax script for PHP does highlight regex inside double- and single-quoted strings, but the code to highlight just the regex is longer than most languages' entire syntax scripts.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.