regex: using surrounding brackets as delimiters while ignoring any inside brackets - regex

I've build a complex (for me) regex to parse some file names, and it broadly works, except for a case where there are additional inside brackets.
(?'field'F[0-9]{1,4})(?'term'\(.*?\))(?'operator'_(OR|NOT|AND)_)?
In the following examples, I need to get the groups after the comment, but in the 3rd example, I am getting ((brackets) instead of ((brackets)are valid).
For the life of me I can't work out how to extend it to search for the final bracket.
C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR
Thanks
UPDATE: I'm using RegExr to experiment, then implementing in C# like this:
Regex r = new Regex(pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
foreach(Match m in r.Matches(foo))
{
//etc
}
UPDATE 2: I don't need to match up the brackets. Inside the one set of brackets can be any data, I just need it to terminate with the outside bracket.
UPDATE 3:
Another attempt, this works with extra brackets (example 3 and 4), but still fails to split out the extra terms (example 5), but unfortunatly includes the terminating ] in the group. How can I get it to search for (but not include) either )_ or )] as the delimiter, but just include the bracket?
(?'field'F[0-9]{1,4})(?'term'\(.*?\)[\]])(?'operator'_(OR|NOT|AND)_)?
Final update: I've decided it's not worth the effort in trying to parse this stupid format, so I'm going to ditch support for it and do something more productive with my time. Thank you all for your help, I have now seen the light!

Matching nested parenthesis with regex is a) not possible*, or b) results in a regex that is unmaintainable.
If you're simply trying to match the first ( until the last ) (not checking if the opening- and closing-parenthesis properly match), then just remove the ? after .*?.
* depending what regex flavour you're using.

Hmm, this usually isn't possible with most regex engines. Although it is possible in perl:
PerlMonks
By using a recursive regexp:
use strict;
use warnings;
my $textInner =
'(outer(inner(most "this (shouldn\'t match)" inner)))';
my $innerRe;
my $idx=0;
my(#match);
$innerRe = qr/
\(
(
(?:
[^()"]+
|
"[^"]*"
|
(??{$innerRe})
)*
)
\)(?{$match[$idx++]=$1;})
/sx;
$textInner =~ /^$innerRe/g;
print "inner: $match[0]\n";
It's also possible to do it in most regex engines provided that you want to do it to a fixed depth of bracket nesting. I wrote something in java a while ago that would construct a regex that would match brackets up to 6 deep.
Here's my java function for producing the regex:
public static String generateParensMatchStr(int depth, char openParen, char closeParen)
{
if (depth == 0)
return ".*?";
else
return "(?:\\" + openParen + generateParensMatchStr(depth - 1, openParen, closeParen) + "\\" +closeParen + "|.*?)+?";
}

here is my another test results in python
x="""C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR"""
x=re.sub("//.*","",x)
x=re.sub("(_(OR|NOT|AND)_).*?]"," \\1 \\2]",x)
x=re.findall("(?:F[0-9]{1,4}\(.*\).*(?=]))",x)
for x in x:print x
this gives
F30(green)
F21(red) _OR_ OR
F21((brackets)are valid)
F21(any old brackets)))))are valid)
F21(brackets)))))) _OR_ OR
Thats will meet your expected result?

re.findall("((?:F[0-9]{1,4}\(.*\))(?:_(?:OR|NOT|AND)_)?)+?",YOURTEXT)
gots
['F30(green)', 'F21(red)_OR_F21(blue)_NOT_F21(pink)', 'F21((brackets)are valid)', 'F21(any old brackets)))))are valid)', 'F21(brackets))))))_OR_F21(blue)']
in python, what do you think?

Try this
/(F[0-9]{1,4})(\([^_\]]+\))(?:_(OR|NOT|AND)_)?/
tested with PHP, seems to give the expected results (as long as the strings inside round brackets don't contain _ or ]).

Related

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

Regex Pattern to Match, Excluding when... / Except between

--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K
This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.
Example
I want to match five digits using \b\d{5}\b but not in three situations s1 s2 s3:
s1: Not on a line that ends with a period like this sentence.
s2: Not anywhere inside parens.
s3: Not inside a block that starts with if( and ends with //endif
I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K in PHP.
For instance
s1 (?m)(?!\d+.*?\.$)\d+
s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+
But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.
The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.
So I came here to see if someone think of a flexible recipe.
Hints are okay: you don't need to give me full code. :)
Thank you.
Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.
First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.
Surprise
Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...
Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.
A Better-Known Variation
There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).
Thanks for all the background, zx81... But what's the recipe?
Key Fact
The method returns the match in Group 1 capture. It does not care at
all about the overall match.
In fact, the trick is to match the various contexts we don't want (chaining these contexts using the | OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.
The general recipe is
Not_this_context|Not_this_either|StayAway|(WhatYouWant)
This will match Not_this_context, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.
In your case, with your digits and your three contexts to ignore, we can do:
s1|s2|s3|(\b\d+\b)
Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a | )
The whole expression can be written like this:
(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)
See this demo (but focus on the capture groups in the lower right pane.)
If you mentally try to split this regex at each | delimiter, it is actually only a series of four very simple expressions.
For flavors that support free-spacing, this reads particularly well.
(?mx)
### s1: Match line that ends with a period ###
^.*\.$
| ### OR s2: Match anything between parentheses ###
\([^\)]*\)
| ### OR s3: Match any if(...//endif block ###
if\(.*?//endif
| ### OR capture digits to Group 1 ###
(\b\d+\b)
This is exceptionally easy to read and maintain.
Extending the regex
When you want to ignore more situations s4 and s5, you add them in more alternations on the left:
s4|s5|s1|s2|s3|(\b\d+\b)
How does this work?
The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".
The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.
I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.
Debuggex Demo
Perl/PCRE Variation
In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as #CasimiretHippolyte and #HamZa. It is:
(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant
In your case:
(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b
This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant
Note that (*F), (*FAIL) and (?!) are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)
demo for this version
Applications
Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.
How can I match foo except anywhere in a tag like <a stuff...>...</a>?
How can I match foo except in an <i> tag or a javascript snippet (more conditions)?
How can I match all words that are not on this black list?
How can I ignore anything inside a SUB... END SUB block?
How can I match everything except... s1 s2 s3?
How to Program the Group 1 Captures
You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.
If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.
Alternatives
Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant) recipe, if only because clarity always wins out.
1. Replace then Match.
A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of ###. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive ### strings.
2. Lookarounds.
Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex module to replace re in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.
Recycling the regex you had for s3 in C#, the whole pattern would look like this.
(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
But by now you know I'm not recommending this, right?
Deletions
#HamZa and #Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant. You remember that the recipe to match WhatYouWant (capturing it into Group 1) was s1|s2|s3|(WhatYouWant), right? To delete all instance of WhatYouWant, you change the regex to
(s1|s2|s3)|WhatYouWant
For the replacement string, you use $1. What happens here is that for each instance of s1|s2|s3 that is matched, the replacement $1 replaces that instance with itself (referenced by $1). On the other hand, when WhatYouWant is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you #HamZa and #Jerry for suggesting this wonderful addition.
Replacements
This brings us to replacements, on which I'll touch briefly.
When replacing with nothing, see the "Deletions" trick above.
When replacing, if using Perl or PCRE, use the (*SKIP)(*F) variation mentioned above to match exactly what you want, and do a straight replacement.
In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.
Have fun!
No, wait, there's more!
Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.
Do three different matches and handle the combination of the three situations using in-program conditional logic. You don't need to handle everything in one giant regex.
EDIT: let me expand a bit because the question just became more interesting :-)
The general idea you are trying to capture here is to match against a certain regex pattern, but not when there are certain other (could be any number) patterns present in the test string. Fortunately, you can take advantage of your programming language: keep the regexes simple and just use a compound conditional. A best practice would be to capture this idea in a reusable component, so let's create a class and a method that implement it:
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class MatcherWithExceptions {
private string m_searchStr;
private Regex m_searchRegex;
private IEnumerable<Regex> m_exceptionRegexes;
public string SearchString {
get { return m_searchStr; }
set {
m_searchStr = value;
m_searchRegex = new Regex(value);
}
}
public string[] ExceptionStrings {
set { m_exceptionRegexes = from es in value select new Regex(es); }
}
public bool IsMatch(string testStr) {
return (
m_searchRegex.IsMatch(testStr)
&& !m_exceptionRegexes.Any(er => er.IsMatch(testStr))
);
}
}
public class App {
public static void Main() {
var mwe = new MatcherWithExceptions();
// Set up the matcher object.
mwe.SearchString = #"\b\d{5}\b";
mwe.ExceptionStrings = new string[] {
#"\.$"
, #"\(.*" + mwe.SearchString + #".*\)"
, #"if\(.*" + mwe.SearchString + #".*//endif"
};
var testStrs = new string[] {
"1." // False
, "11111." // False
, "(11111)" // False
, "if(11111//endif" // False
, "if(11111" // True
, "11111" // True
};
// Perform the tests.
foreach (var ts in testStrs) {
System.Console.WriteLine(mwe.IsMatch(ts));
}
}
}
So above, we set up the search string (the five digits), multiple exception strings (your s1, s2 and s3), and then try to match against several test strings. The printed results should be as shown in the comments next to each test string.
Your requirement that it's not inside parens in impossible to satify for all cases.
Namely, if you can somehow find a ( to the left and ) to the right, it doesn't always mean you are inside parens. Eg.
(....) + 55555 + (.....) - not inside parens yet there are ( and ) to left and right
Now you might think yourself clever and look for ( to the left only if you don't encounter ) before and vice versa to the right. This won't work for this case:
((.....) + 55555 + (.....)) - inside parens even though there are closing ) and ( to left and to right.
It is impossible to find out if you are inside parens using regex, as regex can't count how many parens have been opened and how many closed.
Consider this easier task: using regex, find out if all (possibly nested) parens in a string are closed, that is for every ( you need to find ). You will find out that it's impossible to solve and if you can't solve that with regex then you can't figure out if a word is inside parens for all cases, since you can't figure out at a some position in string if all preceeding ( have a corresponding ).
Hans if you don't mind I used your neighbor's washing machine called perl :)
Edited:
Below a pseudo code:
loop through input
if line contains 'if(' set skip=true
if skip= true do nothing
else
if line match '\b\d{5}\b' set s0=true
if line does not match s1 condition set s1=true
if line does not match s2 condition set s2=true
if s0,s1,s2 are true print line
if line contains '//endif' set skip=false
Given the file input.txt:
tiago#dell:~$ cat input.txt
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
And the script validator.pl:
tiago#dell:~$ cat validator.pl
#! /usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
sub validate_s0 {
my $line = $_[0];
if ( $line =~ \d{5/ ){
return "true";
}
return "false";
}
sub validate_s1 {
my $line = $_[0];
if ( $line =~ /\.$/ ){
return "false";
}
return "true";
}
sub validate_s2 {
my $line = $_[0];
if ( $line =~ /.*?\(.*\d{5.*?\).*/ ){
return "false";
}
return "true";
}
my $skip = "false";
while (<>){
my $line = $_;
if( $line =~ /if\(/ ){
$skip = "true";
}
if ( $skip eq "false" ) {
my $s0_status = validate_s0 "$line";
my $s1_status = validate_s1 "$line";
my $s2_status = validate_s2 "$line";
if ( $s0_status eq "true"){
if ( $s1_status eq "true"){
if ( $s2_status eq "true"){
print "$line";
}
}
}
}
if ( $line =~ /\/\/endif/) {
$skip="false";
}
}
Execution:
tiago#dell:~$ cat input.txt | perl validator.pl
it should match 12345
it should match 12345
it should match 12345
Not sure if this would help you or not, but I am providing a solution considering the following assumptions -
You need an elegant solution to check all the conditions
Conditions can change in future and anytime.
One condition should not depend on others.
However I considered also the following -
The file given has minimal errors in it. If it doe then my code might need some modifications to cope with that.
I used Stack to keep track of if( blocks.
Ok here is the solution -
I used C# and with it MEF (Microsoft Extensibility Framework) to implement the configurable parsers. The idea is, use a single parser to parse and a list of configurable validator classes to validate the line and return true or false based on the validation. Then you can add or remove any validator anytime or add new ones if you like. So far I have already implemented for S1, S2 and S3 you mentioned, check classes at point 3. You have to add classes for s4, s5 if you need in future.
First, Create the Interfaces -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo.Contracts
{
public interface IParser
{
String[] GetMatchedLines(String filename);
}
public interface IPatternMatcher
{
Boolean IsMatched(String line, Stack<string> stack);
}
}
Then comes the file reader and checker -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using FileParserDemo.Contracts;
using System.ComponentModel.Composition.Hosting;
using System.ComponentModel.Composition;
using System.IO;
using System.Collections;
namespace FileParserDemo.Parsers
{
public class Parser : IParser
{
[ImportMany]
IEnumerable<Lazy<IPatternMatcher>> parsers;
private CompositionContainer _container;
public void ComposeParts()
{
var catalog = new AggregateCatalog();
catalog.Catalogs.Add(new AssemblyCatalog(typeof(IParser).Assembly));
_container = new CompositionContainer(catalog);
try
{
this._container.ComposeParts(this);
}
catch
{
}
}
public String[] GetMatchedLines(String filename)
{
var matched = new List<String>();
var stack = new Stack<string>();
using (StreamReader sr = File.OpenText(filename))
{
String line = "";
while (!sr.EndOfStream)
{
line = sr.ReadLine();
var m = true;
foreach(var matcher in this.parsers){
m = m && matcher.Value.IsMatched(line, stack);
}
if (m)
{
matched.Add(line);
}
}
}
return matched.ToArray();
}
}
}
Then comes the implementation of individual checkers, the class names are self explanatory, so I don't think they need more descriptions.
using FileParserDemo.Contracts;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace FileParserDemo.PatternMatchers
{
[Export(typeof(IPatternMatcher))]
public class MatchAllNumbers : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveIfBlock : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("if\\(");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
//push the if
stack.Push(m.ToString());
}
//ignore current line, and will validate on next line with stack
return true;
}
regex = new Regex("//endif");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
stack.Pop();
}
}
return stack.Count == 0; //if stack has an item then ignoring this block
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithEndPeriod : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("(?m)(?!\\d+.*?\\.$)\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithInParenthesis : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\(.*\\d+.*\\)");
return !regex.IsMatch(line);
}
}
}
The program -
using FileParserDemo.Contracts;
using FileParserDemo.Parsers;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo
{
class Program
{
static void Main(string[] args)
{
var parser = new Parser();
parser.ComposeParts();
var matches = parser.GetMatchedLines(Path.GetFullPath("test.txt"));
foreach (var s in matches)
{
Console.WriteLine(s);
}
Console.ReadLine();
}
}
}
For testing I took #Tiago's sample file as Test.txt which had the following lines -
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
Gives the output -
it should match 12345
it should match 12345
it should match 12345
Don't know if this would help you or not, I do had a fun time playing with it.... :)
The best part with it is that, for adding a new condition all you have to do is provide an implementation of IPatternMatcher, it will automatically get called and thus will validate.
Same as #zx81's (*SKIP)(*F) but with using a negative lookahead assertion.
(?m)(?:if\(.*?\/\/endif|\([^()]*\))(*SKIP)(*F)|\b\d+\b(?!.*\.$)
DEMO
In python, i would do easily like this,
import re
string = """cat 123 sat.
I like 000 not (456) though 111 is fine
222 if( //endif if(cat==789 stuff //endif 333"""
for line in string.split('\n'): # Split the input according to the `\n` character and then iterate over the parts.
if not line.endswith('.'): # Don't consider the part which ends with a dot.
for i in re.split(r'\([^()]*\)|if\(.*?//endif', line): # Again split the part by brackets or if condition which endswith `//endif` and then iterate over the inner parts.
for j in re.findall(r'\b\d+\b', i): # Then find all the numbers which are present inside the inner parts and then loop through the fetched numbers.
print(j) # Prints the number one ny one.
Output:
000
111
222
333

Why would regex to separate filename from extension not work in ColdFusion?

I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function:
REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );
I would like this to return an array like: ["Doe, John 8.15.2012","docx"]
but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]
I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?
Does ColdFusion use a different syntax? Or am I doing something wrong?
Thanks.
Why you're not getting expected results...
The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.
It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.
How to solve the problem...
If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...
ListLast( filename , '.' )
....to get only the file extension and to get the name without extension you can do...
rematch( '.+(?=\.[^.]+$)' , filename )
This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).
To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).
(optional) Accessing captured groups...
If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.
For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).
If you wanted to use cfRegex, you can do so with your original pattern like so:
RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )
Or with named arguments:
RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )
And you get returned an array of matches, within each element being an array of the captured groups for that match.
If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.
If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.
#Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.
<cfscript>
param name="URL.filename";
sRegex = "^.+?(?=(?:\.[^.]+?)?$)";
aMatch = reMatch(sRegex, URL.filename);
writeDump(aMatch);
</cfscript>
This works on the following filename patterns:
foo.bar
foo
.htaccess
John 8.15.2012.docx
Explanation of the regex:
^ From the beginning of the string
.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.
(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.
(?: Group this stuff together, but don't remember it for a back reference.
. A dot. This is the separator between file name and file extension.
[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.
? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.
$ To the end of the string
I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.
A few more ways of achieving the same result. They all execute in roughly the same amount of time.
<cfscript>
str = 'Doe, John 8.15.2012.docx';
// sans regex
arr1 = [
reverse( listRest( reverse( str ), '.' ) ),
listLast( str, '.' )
];
// using Java String lastIndexOf()
arr2 = [
str.substring( 0, str.lastIndexOf( '.' ) ),
str.substring( str.lastIndexOf( '.' ) + 1 )
];
// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)
Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.
Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.
A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.
The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three

A regex for version number parsing

I have a version number of the following form:
version.release.modification
where version, release and modification are either a set of digits or the '*' wildcard character. Additionally, any of these numbers (and any preceding .) may be missing.
So the following are valid and parse as:
1.23.456 = version 1, release 23, modification 456
1.23 = version 1, release 23, any modification
1.23.* = version 1, release 23, any modification
1.* = version 1, any release, any modification
1 = version 1, any release, any modification
* = any version, any release, any modification
But these are not valid:
*.12
*123.1
12*
12.*.34
Can anyone provide me a not-too-complex regex to validate and retrieve the release, version and modification numbers?
I'd express the format as:
"1-3 dot-separated components, each numeric except that the last one may be *"
As a regexp, that's:
^(\d+\.)?(\d+\.)?(\*|\d+)$
[Edit to add: this solution is a concise way to validate, but it has been pointed out that extracting the values requires extra work. It's a matter of taste whether to deal with this by complicating the regexp, or by processing the matched groups.
In my solution, the groups capture the "." characters. This can be dealt with using non-capturing groups as in ajborley's answer.
Also, the rightmost group will capture the last component, even if there are fewer than three components, and so for example a two-component input results in the first and last groups capturing and the middle one undefined. I think this can be dealt with by non-greedy groups where supported.
Perl code to deal with both issues after the regexp could be something like this:
#version = ();
#groups = ($1, $2, $3);
foreach (#groups) {
next if !defined;
s/\.//;
push #version, $_;
}
($major, $minor, $mod) = (#version, "*", "*");
Which isn't really any shorter than splitting on "."
]
Use regex and now you have two problems. I would split the thing on dots ("."), then make sure that each part is either a wildcard or set of digits (regex is perfect now). If the thing is valid, you just return correct chunk of the split.
Thanks for all the responses! This is ace :)
Based on OneByOne's answer (which looked the simplest to me), I added some non-capturing groups (the '(?:' parts - thanks to VonC for introducing me to non-capturing groups!), so the groups that do capture only contain the digits or * character.
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Many thanks everyone!
This might work:
^(\*|\d+(\.\d+){0,2}(\.\*)?)$
At the top level, "*" is a special case of a valid version number. Otherwise, it starts with a number. Then there are zero, one, or two ".nn" sequences, followed by an optional ".*". This regex would accept 1.2.3.* which may or may not be permitted in your application.
The code for retrieving the matched sequences, especially the (\.\d+){0,2} part, will depend on your particular regex library.
My 2 cents: I had this scenario: I had to parse version numbers out of a string literal.
(I know this is very different from the original question, but googling to find a regex for parsing version number showed this thread at the top, so adding this answer here)
So the string literal would be something like: "Service version 1.2.35.564 is running!"
I had to parse the 1.2.35.564 out of this literal. Taking a cue from #ajborley, my regex is as follows:
(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)
A small C# snippet to test this looks like below:
void Main()
{
Regex regEx = new Regex(#"(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)", RegexOptions.Compiled);
Match version = regEx.Match("The Service SuperService 2.1.309.0) is Running!");
version.Value.Dump("Version using RegEx"); // Prints 2.1.309.0
}
I had a requirement to search/match for version numbers, that follows maven convention or even just single digit. But no qualifier in any case. It was peculiar, it took me time then I came up with this:
'^[0-9][0-9.]*$'
This makes sure the version,
Starts with a digit
Can have any number of digit
Only digits and '.' are allowed
One drawback is that version can even end with '.' But it can handle indefinite length of version (crazy versioning if you want to call it that)
Matches:
1.2.3
1.09.5
3.4.4.5.7.8.8.
23.6.209.234.3
If you are not unhappy with '.' ending, may be you can combine with endswith logic
Don't know what platform you're on but in .NET there's the System.Version class that will parse "n.n.n.n" version numbers for you.
I've seen a lot of answers, but... i have a new one. It works for me at least. I've added a new restriction. Version numbers can't start (major, minor or patch) with any zeros followed by others.
01.0.0 is not valid
1.0.0 is valid
10.0.10 is valid
1.0.0000 is not valid
^(?:(0\\.|([1-9]+\\d*)\\.))+(?:(0\\.|([1-9]+\\d*)\\.))+((0|([1-9]+\\d*)))$
It's based in a previous one. But i see this solution better... for me ;)
Enjoy!!!
I tend to agree with split suggestion.
Ive created a "tester" for your problem in perl
#!/usr/bin/perl -w
#strings = ( "1.2.3", "1.2.*", "1.*","*" );
%regexp = ( svrist => qr/(?:(\d+)\.(\d+)\.(\d+)|(\d+)\.(\d+)|(\d+))?(?:\.\*)?/,
onebyone => qr/^(\d+\.)?(\d+\.)?(\*|\d+)$/,
greg => qr/^(\*|\d+(\.\d+){0,2}(\.\*)?)$/,
vonc => qr/^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$/,
ajb => qr/^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$/,
jrudolph => qr/^(((\d+)\.)?(\d+)\.)?(\d+|\*)$/
);
foreach my $r (keys %regexp){
my $reg = $regexp{$r};
print "Using $r regexp\n";
foreach my $s (#strings){
print "$s : ";
if ($s =~m/$reg/){
my ($main, $maj, $min,$rev,$ex1,$ex2,$ex3) = ("any","any","any","any","any","any","any");
$main = $1 if ($1 && $1 ne "*") ;
$maj = $2 if ($2 && $2 ne "*") ;
$min = $3 if ($3 && $3 ne "*") ;
$rev = $4 if ($4 && $4 ne "*") ;
$ex1 = $5 if ($5 && $5 ne "*") ;
$ex2 = $6 if ($6 && $6 ne "*") ;
$ex3 = $7 if ($7 && $7 ne "*") ;
print "$main $maj $min $rev $ex1 $ex2 $ex3\n";
}else{
print " nomatch\n";
}
}
print "------------------------\n";
}
Current output:
> perl regex.pl
Using onebyone regexp
1.2.3 : 1. 2. 3 any any any any
1.2.* : 1. 2. any any any any any
1.* : 1. any any any any any any
* : any any any any any any any
------------------------
Using svrist regexp
1.2.3 : 1 2 3 any any any any
1.2.* : any any any 1 2 any any
1.* : any any any any any 1 any
* : any any any any any any any
------------------------
Using vonc regexp
1.2.3 : 1.2. 3 any any any any any
1.2.* : 1. 2 .* any any any any
1.* : any any any 1 any any any
* : any any any any any any any
------------------------
Using ajb regexp
1.2.3 : 1 2 3 any any any any
1.2.* : 1 2 any any any any any
1.* : 1 any any any any any any
* : any any any any any any any
------------------------
Using jrudolph regexp
1.2.3 : 1.2. 1. 1 2 3 any any
1.2.* : 1.2. 1. 1 2 any any any
1.* : 1. any any 1 any any any
* : any any any any any any any
------------------------
Using greg regexp
1.2.3 : 1.2.3 .3 any any any any any
1.2.* : 1.2.* .2 .* any any any any
1.* : 1.* any .* any any any any
* : any any any any any any any
------------------------
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Perhaps a more concise one could be :
^(?:(\d+)\.){0,2}(\*|\d+)$
This can then be enhanced to 1.2.3.4.5.* or restricted exactly to X.Y.Z using * or {2} instead of {0,2}
This should work for what you stipulated. It hinges on the wild card position and is a nested regex:
^((\*)|([0-9]+(\.((\*)|([0-9]+(\.((\*)|([0-9]+)))?)))?))$
For parsing version numbers that follow these rules:
- Are only digits and dots
- Cannot start or end with a dot
- Cannot be two dots together
This one did the trick to me.
^(\d+)((\.{1}\d+)*)(\.{0})$
Valid cases are:
1, 0.1, 1.2.1
Another try:
^(((\d+)\.)?(\d+)\.)?(\d+|\*)$
This gives the three parts in groups 4,5,6 BUT:
They are aligned to the right. So the first non-null one of 4,5 or 6 gives the version field.
1.2.3 gives 1,2,3
1.2.* gives 1,2,*
1.2 gives null,1,2
*** gives null,null,*
1.* gives null,1,*
My take on this, as a good exercise - vparse, which has a tiny source, with a simple function:
function parseVersion(v) {
var m = v.match(/\d*\.|\d+/g) || [];
v = {
major: +m[0] || 0,
minor: +m[1] || 0,
patch: +m[2] || 0,
build: +m[3] || 0
};
v.isEmpty = !v.major && !v.minor && !v.patch && !v.build;
v.parsed = [v.major, v.minor, v.patch, v.build];
v.text = v.parsed.join('.');
return v;
}
Sometimes version numbers might contain alphanumeric minor information (e.g. 1.2.0b or 1.2.0-beta). In this case I am using this regex:
([0-9]{1,4}(\.[0-9a-z]{1,6}){1,5})
(?ms)^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$
Does exactly match your 6 first examples, and rejects the 4 others
group 1: major or major.minor or '*'
group 2 if exists: minor or *
group 3 if exists: *
You can remove '(?ms)'
I used it to indicate to this regexp to be applied on multi-lines through QuickRex
This matches 1.2.3.* too
^(*|\d+(.\d+){0,2}(.*)?)$
I would propose the less elegant:
(*|\d+(.\d+)?(.*)?)|\d+.\d+.\d+)
Keep in mind regexp are greedy, so if you are just searching within the version number string and not within a bigger text, use ^ and $ to mark start and end of your string.
The regexp from Greg seems to work fine (just gave it a quick try in my editor), but depending on your library/language the first part can still match the "*" within the wrong version numbers. Maybe I am missing something, as I haven't used Regexp for a year or so.
This should make sure you can only find correct version numbers:
^(\*|\d+(\.\d+)*(\.\*)?)$
edit: actually greg added them already and even improved his solution, I am too slow :)
It seems pretty hard to have a regex that does exactly what you want (i.e. accept only the cases that you need and reject all others and return some groups for the three components). I've give it a try and come up with this:
^(\*|(\d+(\.(\d+(\.(\d+|\*))?|\*))?))$
IMO (I've not tested extensively) this should work fine as a validator for the input, but the problem is that this regex doesn't offer a way of retrieving the components. For that you still have to do a split on period.
This solution is not all-in-one, but most times in programming it doesn't need to. Of course this depends on other restrictions that you might have in your code.
Specifying XSD elements:
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\..*)?"/>
</xs:restriction>
</xs:simpleType>
One more solution:
^[1-9][\d]*(.[1-9][\d]*)*(.\*)?|\*$
I found this, and it works for me:
/(\^|\~?)(\d|x|\*)+\.(\d|x|\*)+\.(\d|x|\*)+
/^([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})(\-(alpha|beta|rc|HP|CP|SP|hp|cp|sp)[1-9]\d*)?(\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?(\.[0-9a-zA-Z]+)?$/
A normal version: ([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})
A Pre-release or patched version: (\-(alpha|beta|rc|EP|HP|CP|SP|ep|hp|cp|sp)[1-9]\d*)? (Extension Pack, Hotfix Pack, Coolfix Pack, Service Pack)
Customized version: (\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?
Internal version: (\.[0-9a-zA-Z]+)?