Regex for a syntax colouring scheme - regex

I'm working on a syntax coloring scheme for my favourite programming language, OOREXX. The language isn't important, as my question is purely about a REGEX.
Simple description: A regex to match any of a bunch of words, but they must have a "~" prefix or a "(" suffix or both
Full description:
I want to match any of a bunch or words. They are the names of functions. This is easy, something like:
(stream|Strip|Substr) etc.
But the word "strip" (for example) might occur in my code when not a function name:
Strip = 1 -- Set variable "Strip" to 1
So, I need to be more precise. The function names must have either a leading "~" or a trailing "(" or both
This is where my REGEX skill fails. I could get around this in my colouring scheme by using two elements, one to catch "~strip" and one to catch "strip(" but that means duplicating, and maintaining, the list of function names. That goes against the grain...

Simply use alternation. In case lookbehinds are supported, you can use
(?<=~)strip|strip(?=\()
If you want something fancy and your regex engine supports lookbehind and if clauses, you can avoid alternation - though it won't be any more performant, e.g.
((?<=~))?strip(?(1)|(?=\())
And if you don't have lookbehinds, you can still use grouping and extract from the captured groups, e.g.
~(strip)|(strip)\(

I recommend (over & over) using http://regexr.com to test Regular Expressions. I am not affiliated with them, but I program regular expressions 8 hours a day (sometimes)... It's a nice tool for practicing them.... But to answer your question (in Java)...
Also Make sure to view the screen-capture-image after the code below.
// If there is a matching function name within this string, this will
// return that name, otherwise, it will return null.
public static String functionName(String functionNameStr)
{
// This Regular Expression Groups the symbols before, or after, or both!
// No, really, that's what it says...
String RE = "(~\\w+|\\w+\\)|~\\w+\\))";
// NOTE: In Java, escape characters need to be Escaped Twice!
// ALSO NOTE: This version puts a "precedence" on catching both symbols!
// RE = "(~\\w+\\)|~\\w+|\\w+\\))"
// Since the ~func-name) is listed first, if both symbols are included,
// it will catch that too. Maybe this is relevant to your code/question.
Pattern P1 = Pattern.compile(RE);
Matcher m = P1.matcher(functionNameStr);
if (m.find()) return m.group();
else return null;
}
Click Here to see Screen Capture Image of Regular Expressions processor

Related

Is there way to repeat specific regex previous used [duplicate]

Say I have a regex matching a hexadecimal 32 bit number:
([0-9a-fA-F]{1,8})
When I construct a regex where I need to match this multiple times, e.g.
(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})
Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?
I'd imagine something like (warning, invented syntax!)
(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})
where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.
Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.
RegEx Subroutines
When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.
Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).
Syntax
Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.
Standard RegEx
Any: [abc][abc][abc]
Subroutine, by Name
Perl:     (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby:   (?<name>[abc])\g<name>\g<name>
Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby:          ([abc])\g<1>\g<1>
Subroutine, by Relative Position
Perl:     ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby:   ([abc])\g<-1>\g<-1>
Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)
Examples
Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)
Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))
And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))
More Info
http://regular-expressions.info/subroutine.html
http://regex101.com/
Why not do something like this, not really shorter but a bit more maintainable.
String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");
If you want more self documenting code i would assign the number regex string to a properly named const variable.
.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.
Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:
C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var block = "[0-9a-fA-F]{1,8}";
var pattern = $#"(?<from>{block})\s*:\s*(?<to>{block})";
Console.WriteLine(Regex.IsMatch("12345678 :87654321", pattern));
}
}
The $#"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $#"(?:{block}){{5}}" to repeat a block 5 times).
For older C# versions, use string.Format:
var pattern = string.Format(#"(?<from>{0})\s*:\s*(?<to>{0})", block);
as is suggested in Mattias's answer.
If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?
string f = #"fc\d+/";
string e = #"\d+";
Regex regexObj = new Regex(f+e);
Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.
e.g.
/\b([a-z])\w+\1\b/
Will only match : text, spaces in the above text :
This is a sample text which is not the title since it does not end with 2 spaces.
There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:
(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
To reuse regex named capture group use this syntax: \k<name> or \k'name'
So the answer is:
(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>
More info: http://www.regular-expressions.info/named.html

Regex global search without using the global flag

I'm using software that only allows a single line regular expression for filtering and it doesn't allow the global modifier to capture all patterns in the string. Currently, my expression is only returning the first instance.
Is there another way to capture all instances of the pattern in the string?
Expression: (captures hi-res jpg urls)
\{\"hiRes\"\:\"([A-Za-z0-9%\/_:.-]+)\"\,\"thumb
String:
'colorImages': { 'initial': [{"hiRes":"http://sub.website.com/images/I/81OJ6qwKxyL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/41NQRigTUdL._SS40_.jpg","large":"http://sub.website.com/images/I/41NQRigTUdL.jpg","main":{"http://sub.website.com/images/I/81OJ6qwKxyL._SY355_.jpg":[272,355],"http://sub.website.com/images/I/81OJ6qwKxyL._SY450_.jpg":[345,450],"http://sub.website.com/images/I/81OJ6qwKxyL._SY550_.jpg":[422,550],"http://sub.website.com/images/I/81OJ6qwKxyL._SY606_.jpg":[465,606],"http://sub.website.com/images/I/81OJ6qwKxyL._SY679_.jpg":[521,679]},"variant":"MAIN"},{"hiRes":"http://sub.website.com/images/I/71oHZNvsLbL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/31lHNGD-ZDL._SS40_.jpg","large":"http://sub.website.com/images/I/31lHNGD-ZDL.jpg","main":{"http://sub.website.com/images/I/71oHZNvsLbL._SY355_.jpg":[197,355],"http://sub.website.com/images/I/71oHZNvsLbL._SY450_.jpg":[249,450],"http://sub.website.com/images/I/71oHZNvsLbL._SY550_.jpg":[305,550],"http://sub.website.com/images/I/71oHZNvsLbL._SY606_.jpg":[336,606],"http://sub.website.com/images/I/71oHZNvsLbL._SY679_.jpg":[376,679]},"variant":"PT01"},{"hiRes":"http://sub.website.com/images/I/91VCJAcIPEL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51G1gCkOFzL._SS40_.jpg","large":"http://sub.website.com/images/I/51G1gCkOFzL.jpg","main":{"http://sub.website.com/images/I/91VCJAcIPEL._SX355_.jpg":[355,341],"http://sub.website.com/images/I/91VCJAcIPEL._SX450_.jpg":[450,433],"http://sub.website.com/images/I/91VCJAcIPEL._SX425_.jpg":[425,409],"http://sub.website.com/images/I/91VCJAcIPEL._SX466_.jpg":[466,448],"http://sub.website.com/images/I/91VCJAcIPEL._SX522_.jpg":[522,502]},"variant":"PT02"},{"hiRes":"http://sub.website.com/images/I/912B68GN4aL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51elravQx6L._SS40_.jpg","large":"http://sub.websi
An interesting question. In my understanding, the global flag cannot be "emulated" with other Regex syntax features.
One could try to emulate the global flag by a Regex repetition. You could expand your Regex so that it would match all appearances of "hiRes":... in a repetition loop. But then, you would see that although several URLs would be matched because of the loop, only the last appearance would be captured.
Switching on the global flag does more than just "continue looking". It switches on collecting more than one capture in an array. Having just a Regex loop does not do the same.
I'd like to show two examples what this means. To test the examples, use e.g. https://regex101.com/.
Here is a simple example, first with the global flag:
Given text: a i b i c i
Regex: /(i)/g
Result: array of three strings, [0]="i" Pos.2, [1]="i" Pos.6, [2]="i Pos.10"
Now without the global flag. To match more, we must add a repetition to the Regex that embraces several "i", and a condition that ignores text between two "i". Like this:
Given text: a i b i c i
Regex: /(?:(i)[^i]*)+/
Result: array of one string, [0]="i" Pos.10
This seems puzzling first, but it is correct. The Regex matches from position 2 until 10. And from that match, it captures the last "i" at position 10. So the repetition in the Regex causes not several captures but a longer matching. This is very different from what the global flag does.
To be precise, this behavior is called "greedy". It tries to match as much as possible. With the "U" flag or with certain quantifiers, you can make the Regex "ungreedy". In that case in the example above, your "ungreedily" captured "i" will be that of position 2.
As a more complex example, just enhance your initial Regex. It must ignore text from the URL until the next "hiRes", and a repetition be put around. Here it is:
/\{(?:"hiRes":"([A-Za-z0-9%\/_:.-]+)"(?:[^"]|"(?!hiRes))*)+/
The second part means: match as many as possible that is not a quota, or that is a quota not followed by hiRes. Like this, this syntax will dig until the begin of the next "hiRes". And then the repetition comes in and it starts over with "hiRes".
Try it out. It will capture only the last URL in your text.
Finally, this tutorial is very comprehensive: http://www.regular-expressions.info/

Lua pattern parentheses and 0 or 1 occurrence

I'm trying to match a string against a pattern, but there's one thing I haven't managed to figure out. In a regex I'd do this:
Strings:
en
eng
engl
engli
englis
english
Pattern:
^en(g(l(i(s(h?)?)?)?)?)?$
I want all strings to be a match.
In Lua pattern matching I can't get this to work.
Even a simpler example like this won't work:
Strings:
fly
flying
Pattern:
^fly(ing)?$
Does anybody know how to do this?
You can't make match-groups optional (or repeat them) using Lua's quantifiers ?, *, + and -.
In the pattern (%d+)?, the question mark "looses" its special meaning and will simply match the literal ? as you can see by executing the following lines of code:
text = "a?"
first_match = text:match("((%w+)?)")
print(first_match)
which will print:
a?
AFAIK, the closest you can come in Lua would be to use the pattern:
^eng?l?i?s?h?$
which (of course) matches string like "enh", "enls", ... as well.
In Lua, the parentheses are only used for capturing. They don't create atoms.
The closest you can get to the patterns you want is:
'^flyi?n?g?$'
'^en?g?l?i?s?h?$'
If you need the full power of a regular expression engine, there are bindings to common engines available for Lua. There's also LPeg, a library for creating PEGs, which comes with a regular expression engine as an example (not sure how powerful it is).

do we ever use regex to find regex expressions?

let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?
(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.
What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.
Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.
Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.

What are good regular expressions?

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english
Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.
Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression
Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.
If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.
If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/
As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info
These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined
Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$