Say I have a regex matching a hexadecimal 32 bit number:
([0-9a-fA-F]{1,8})
When I construct a regex where I need to match this multiple times, e.g.
(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})
Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?
I'd imagine something like (warning, invented syntax!)
(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})
where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.
Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.
RegEx Subroutines
When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.
Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).
Syntax
Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.
Standard RegEx
Any: [abc][abc][abc]
Subroutine, by Name
Perl: (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby: (?<name>[abc])\g<name>\g<name>
Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby: ([abc])\g<1>\g<1>
Subroutine, by Relative Position
Perl: ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby: ([abc])\g<-1>\g<-1>
Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)
Examples
Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)
Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))
And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))
More Info
http://regular-expressions.info/subroutine.html
http://regex101.com/
Why not do something like this, not really shorter but a bit more maintainable.
String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");
If you want more self documenting code i would assign the number regex string to a properly named const variable.
.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.
Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:
C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var block = "[0-9a-fA-F]{1,8}";
var pattern = $#"(?<from>{block})\s*:\s*(?<to>{block})";
Console.WriteLine(Regex.IsMatch("12345678 :87654321", pattern));
}
}
The $#"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $#"(?:{block}){{5}}" to repeat a block 5 times).
For older C# versions, use string.Format:
var pattern = string.Format(#"(?<from>{0})\s*:\s*(?<to>{0})", block);
as is suggested in Mattias's answer.
If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?
string f = #"fc\d+/";
string e = #"\d+";
Regex regexObj = new Regex(f+e);
Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.
e.g.
/\b([a-z])\w+\1\b/
Will only match : text, spaces in the above text :
This is a sample text which is not the title since it does not end with 2 spaces.
There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:
(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
To reuse regex named capture group use this syntax: \k<name> or \k'name'
So the answer is:
(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>
More info: http://www.regular-expressions.info/named.html
Related
Ruby gsub supports using regex as pattern to detect input
and it also may allow to use match group number in replacement
for example, if that's a regex detecting lowercase letters at the beginning of any word, and puts a x before it and a y after it
this would give perfect result:
"testing gsub".gsub(/(?<=\b)[a-z]/,'x\0y')
#=> "xtyesting xgysub"
But if I want to use regex to convert this match group to uppercase
in normal regex, one can normally do this \U\$0 as explained here
unfortunately when I try like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,'\U\0')
#=> "\\Utesting \\Ugsub"
also, if I try using raw regex in replacement field like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,/\U\0/)`
I get type error:
TypeError (no implicit conversion of Regexp into String)
I'm totally aware of the option to do it using maps like this:
"testing gsub".gsub(/(?<=\b)[a-z]/,&:upcase)
But unfortunately, the rules (pattern, replacement) are being loaded from a .yaml file and they are applied to string this way:
input.gsub(rule['pattern'], rule['replacement'])
and I am not able to store &:upcase in .yaml to be taken as a raw string
A workaround I may do is to detect if upcase is the replacement got "upcase"
and do it this way
"testing gsub".gsub(/(?<=\b)[a-z]/) {|l| l.send("upcase")}
But I don't want to modify this logic:
input.gsub(rule['pattern'], rule['replacement'])
If there is a workaround to either use regex in gsub replacement, or to store methods like &:upcase in YAML without being loaded as a string, it'd be perfect.
Thanks!
TL;DR
You can't do what you want the way you want. This is documented in the Onigmo source. You'll have to use a different approach, or refactor other areas of your code to simulate the behavior you want.
Escapes Like \U Not Available in Ruby
Special escapes like \U are extensions to GNU sed or ported from the PCRE library. They are not part of Ruby's current regular expression engine. The Onigmo source clearly mentions that these escapes are missing:
A-3. Missing features compared with perl 5.18.0
+ \N{name}, \N{U+xxxx}, \N
+ \l,\u,\L,\U, \C
+ \v, \V, \h, \H
+ (?{code})
+ (??{code})
+ (?|...)
+ (?[])
+ (*VERB:ARG)
Other Approaches
You can do what you want in a number of different ways, such as using the block form of String#gsub to call String#upcase on each match. For example:
"testing gsub".gsub(/\b\p{Lower}+/) { |m| m.upcase }
#=> "TESTING GSUB"
You will also have to use the block form if you want to reliably reference certain match variables like $& or $1, as the variables might otherwise refer to text from previous matches. For illustration, consider:
"foo bar".gsub /\b\p{Lower}+/, "#{$&.upcase}"
#=> "BAR BAR"
As this is primarily an X/Y problem, you may be happier with the answers you receive if you post a related question with an example of your YAML source and your current code for parsing your regular expression matches/substitutions. Perhaps there's a way to wrap or refactor your code that you haven't considered, but you aren't going to be able to solve this the way you want.
I'm working on a syntax coloring scheme for my favourite programming language, OOREXX. The language isn't important, as my question is purely about a REGEX.
Simple description: A regex to match any of a bunch of words, but they must have a "~" prefix or a "(" suffix or both
Full description:
I want to match any of a bunch or words. They are the names of functions. This is easy, something like:
(stream|Strip|Substr) etc.
But the word "strip" (for example) might occur in my code when not a function name:
Strip = 1 -- Set variable "Strip" to 1
So, I need to be more precise. The function names must have either a leading "~" or a trailing "(" or both
This is where my REGEX skill fails. I could get around this in my colouring scheme by using two elements, one to catch "~strip" and one to catch "strip(" but that means duplicating, and maintaining, the list of function names. That goes against the grain...
Simply use alternation. In case lookbehinds are supported, you can use
(?<=~)strip|strip(?=\()
If you want something fancy and your regex engine supports lookbehind and if clauses, you can avoid alternation - though it won't be any more performant, e.g.
((?<=~))?strip(?(1)|(?=\())
And if you don't have lookbehinds, you can still use grouping and extract from the captured groups, e.g.
~(strip)|(strip)\(
I recommend (over & over) using http://regexr.com to test Regular Expressions. I am not affiliated with them, but I program regular expressions 8 hours a day (sometimes)... It's a nice tool for practicing them.... But to answer your question (in Java)...
Also Make sure to view the screen-capture-image after the code below.
// If there is a matching function name within this string, this will
// return that name, otherwise, it will return null.
public static String functionName(String functionNameStr)
{
// This Regular Expression Groups the symbols before, or after, or both!
// No, really, that's what it says...
String RE = "(~\\w+|\\w+\\)|~\\w+\\))";
// NOTE: In Java, escape characters need to be Escaped Twice!
// ALSO NOTE: This version puts a "precedence" on catching both symbols!
// RE = "(~\\w+\\)|~\\w+|\\w+\\))"
// Since the ~func-name) is listed first, if both symbols are included,
// it will catch that too. Maybe this is relevant to your code/question.
Pattern P1 = Pattern.compile(RE);
Matcher m = P1.matcher(functionNameStr);
if (m.find()) return m.group();
else return null;
}
Click Here to see Screen Capture Image of Regular Expressions processor
How come for something that simple I can't find an answer after looking one hour in the internet?
I have this sentence:
HeLLo woRLd HOw are YoU
I want to capture all groups that consist of two following capital letters
[A-Z]{2}
The regex above works but capture only LL (the first two capital letters) while I want LL in one group and in the other groups also RL HO
Most regular expression engines expose some way to make your expression global. This means that your expression will applied multiple times. This global flag is usually denoted with the /g marker at the end of your expression. This is your regular expression without the /g flag, while this is what happens when you apply said flag.
Different languages expose such functionality differently, in C# for instance, this is done through the Regex.Matches syntax. In Java, you use while(matcher.find()), which keeps providing sub strings which match the pattern provided.
EDIT: I am not a Python person, but judging from the example available here, you could do something like so:
it = re.finditer(r"[A-Z]{2}", "HeLLo woRLd HOw are YoU")
for match in it:
print "'{g}' was found between the indices {s}".format(g=match.group(), s=match.span())
You can not have multiple groups in this case, but you can have multiple matches. Add the global flag to your regex and use a method to match the regex.
For javscript, it would be /[A-Z]{2}/g.
The method most probably returns an Array of matches, and you can use index to access them.
I'm trying to match a string against a pattern, but there's one thing I haven't managed to figure out. In a regex I'd do this:
Strings:
en
eng
engl
engli
englis
english
Pattern:
^en(g(l(i(s(h?)?)?)?)?)?$
I want all strings to be a match.
In Lua pattern matching I can't get this to work.
Even a simpler example like this won't work:
Strings:
fly
flying
Pattern:
^fly(ing)?$
Does anybody know how to do this?
You can't make match-groups optional (or repeat them) using Lua's quantifiers ?, *, + and -.
In the pattern (%d+)?, the question mark "looses" its special meaning and will simply match the literal ? as you can see by executing the following lines of code:
text = "a?"
first_match = text:match("((%w+)?)")
print(first_match)
which will print:
a?
AFAIK, the closest you can come in Lua would be to use the pattern:
^eng?l?i?s?h?$
which (of course) matches string like "enh", "enls", ... as well.
In Lua, the parentheses are only used for capturing. They don't create atoms.
The closest you can get to the patterns you want is:
'^flyi?n?g?$'
'^en?g?l?i?s?h?$'
If you need the full power of a regular expression engine, there are bindings to common engines available for Lua. There's also LPeg, a library for creating PEGs, which comes with a regular expression engine as an example (not sure how powerful it is).
let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?
(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.
What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.
Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.
Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.