do we ever use regex to find regex expressions? - regex

let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?

(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.

What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.

Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.

Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Regex for a syntax colouring scheme

I'm working on a syntax coloring scheme for my favourite programming language, OOREXX. The language isn't important, as my question is purely about a REGEX.
Simple description: A regex to match any of a bunch of words, but they must have a "~" prefix or a "(" suffix or both
Full description:
I want to match any of a bunch or words. They are the names of functions. This is easy, something like:
(stream|Strip|Substr) etc.
But the word "strip" (for example) might occur in my code when not a function name:
Strip = 1 -- Set variable "Strip" to 1
So, I need to be more precise. The function names must have either a leading "~" or a trailing "(" or both
This is where my REGEX skill fails. I could get around this in my colouring scheme by using two elements, one to catch "~strip" and one to catch "strip(" but that means duplicating, and maintaining, the list of function names. That goes against the grain...
Simply use alternation. In case lookbehinds are supported, you can use
(?<=~)strip|strip(?=\()
If you want something fancy and your regex engine supports lookbehind and if clauses, you can avoid alternation - though it won't be any more performant, e.g.
((?<=~))?strip(?(1)|(?=\())
And if you don't have lookbehinds, you can still use grouping and extract from the captured groups, e.g.
~(strip)|(strip)\(
I recommend (over & over) using http://regexr.com to test Regular Expressions. I am not affiliated with them, but I program regular expressions 8 hours a day (sometimes)... It's a nice tool for practicing them.... But to answer your question (in Java)...
Also Make sure to view the screen-capture-image after the code below.
// If there is a matching function name within this string, this will
// return that name, otherwise, it will return null.
public static String functionName(String functionNameStr)
{
// This Regular Expression Groups the symbols before, or after, or both!
// No, really, that's what it says...
String RE = "(~\\w+|\\w+\\)|~\\w+\\))";
// NOTE: In Java, escape characters need to be Escaped Twice!
// ALSO NOTE: This version puts a "precedence" on catching both symbols!
// RE = "(~\\w+\\)|~\\w+|\\w+\\))"
// Since the ~func-name) is listed first, if both symbols are included,
// it will catch that too. Maybe this is relevant to your code/question.
Pattern P1 = Pattern.compile(RE);
Matcher m = P1.matcher(functionNameStr);
if (m.find()) return m.group();
else return null;
}
Click Here to see Screen Capture Image of Regular Expressions processor

List of allowed characters from regular expression

Does someone know about some way how to extract allowed characters from regular expression and construct user friendly message?
For example, by providing regular expression
^[a-zA-Z0-9&\-\+_\.\s]{1,10}$
to get something like
a-z A-Z 0-9 & - + _ . with spaces
I am using java. I can imagine that it could be too complicated or even impossible to cover all types of regular expressions, but maybe you know about some library, tool or algorithm that could help.
Thanks
Yes. It can be done.
What you need is:
Turn your regexp body into a string.
Parse that string (with a regex for instance) that will output the desired list.
Apply possible regexp options (such as ignore case to the result).
This is tedious work if you're not VERY familiar with Regexp. I actually have code in production doing just that, but it's proprietary so I can't post it here and it's not in Java.
I guess you should first ask yourself whether there is no simpler solution for your problem. If for instance your regexp is a constant, you could associate it with a by-hand list of accepted characters.
If your input is a character-class like the one you provided, you could match it with the expression
([^\\]-[^\\]|\\.|[^^$[\]])
that will give you a list of elements like "a-z", "\+", "_" that you could then tidy up a little further, e.g., removing the "\", and then print it nicely formatted.
And you could extract the length information using
{([0-9]+)(,([0-9]+))?}
that accepts {1,10} as well as {10} with the "from" and "to" values being captured each in their own group.
That should get you started.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

Regular Expression to exclude set of Keywords

I want an expression that will fail when it encounters words such as "boon.ini" and "http". The goal would be to take this expression and be able to construct for any set of keywords.
^(?:(?!boon\.ini|http).)*$\r?\n?
(taken from RegexBuddy's library) will match any line that does not contain boon.ini and/or http. Is that what you wanted?
An alternative expression that could be used:
^(?!.*IgnoreMe).*$
^ = indicates start of line
$ = indicates the end of the line
(?! Expression) = indicates zero width look ahead negative match on the expression
The ^ at the front is needed, otherwise when evaluated the negative look ahead could start from somewhere within/beyond the 'IgnoreMe' text - and make a match where you don't want it too.
e.g. If you use the regex:
(?!.*IgnoreMe).*$
With the input "Hello IgnoreMe Please", this will will result in something like: "gnoreMe Please" as the negative look ahead finds that there is no complete string 'IgnoreMe' after the 'I'.
Rather than negating the result within the expression, you should do it in your code. That way, the expression becomes pretty simple.
\b(boon\.ini|http)\b
Would return true if boon.ini or http was anywhere in your string. It won't match words like httpd or httpxyzzy because of the \b, or word boundaries. If you want, you could just remove them and it will match those too. To add more keywords, just add more pipes.
\b(boon\.ini|http|foo|bar)\b
you might be well served by writing a regex that will succeed when it encounters the words you're looking for, and then invert the condition.
For instance, in perl you'd use:
if (!/boon\.ini|http/) {
# the string passed!
}
^[^£]*$
The above expression will restrict only the pound symbol from the string. This will allow all characters except string.
Which language/regexp library? I thought you question was around ASP.NET in which case you can see the "negative lookhead" section of this article:
http://msdn.microsoft.com/en-us/library/ms972966.aspx
Strictly speaking negation of a regular expression, still defines a regular language but there are very few libraries/languages/tool that allow to express it.
Negative lookahed may serve you the same but the actual syntax depends on what you are using. Tim's answer is an example with (?...)
I used this (based on Tim Pietzcker answer) to exclude non-production subdomain URLs for Google Analytics profile filters:
^\w+-*\w*\.(?!(?:alpha(123)*\.|beta(123)*\.|preprod\.)domain\.com).*$
You can see the context here: Regex to Exclude Multiple Words