A regex for version number parsing - regex

I have a version number of the following form:
version.release.modification
where version, release and modification are either a set of digits or the '*' wildcard character. Additionally, any of these numbers (and any preceding .) may be missing.
So the following are valid and parse as:
1.23.456 = version 1, release 23, modification 456
1.23 = version 1, release 23, any modification
1.23.* = version 1, release 23, any modification
1.* = version 1, any release, any modification
1 = version 1, any release, any modification
* = any version, any release, any modification
But these are not valid:
*.12
*123.1
12*
12.*.34
Can anyone provide me a not-too-complex regex to validate and retrieve the release, version and modification numbers?

I'd express the format as:
"1-3 dot-separated components, each numeric except that the last one may be *"
As a regexp, that's:
^(\d+\.)?(\d+\.)?(\*|\d+)$
[Edit to add: this solution is a concise way to validate, but it has been pointed out that extracting the values requires extra work. It's a matter of taste whether to deal with this by complicating the regexp, or by processing the matched groups.
In my solution, the groups capture the "." characters. This can be dealt with using non-capturing groups as in ajborley's answer.
Also, the rightmost group will capture the last component, even if there are fewer than three components, and so for example a two-component input results in the first and last groups capturing and the middle one undefined. I think this can be dealt with by non-greedy groups where supported.
Perl code to deal with both issues after the regexp could be something like this:
#version = ();
#groups = ($1, $2, $3);
foreach (#groups) {
next if !defined;
s/\.//;
push #version, $_;
}
($major, $minor, $mod) = (#version, "*", "*");
Which isn't really any shorter than splitting on "."
]

Use regex and now you have two problems. I would split the thing on dots ("."), then make sure that each part is either a wildcard or set of digits (regex is perfect now). If the thing is valid, you just return correct chunk of the split.

Thanks for all the responses! This is ace :)
Based on OneByOne's answer (which looked the simplest to me), I added some non-capturing groups (the '(?:' parts - thanks to VonC for introducing me to non-capturing groups!), so the groups that do capture only contain the digits or * character.
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Many thanks everyone!

This might work:
^(\*|\d+(\.\d+){0,2}(\.\*)?)$
At the top level, "*" is a special case of a valid version number. Otherwise, it starts with a number. Then there are zero, one, or two ".nn" sequences, followed by an optional ".*". This regex would accept 1.2.3.* which may or may not be permitted in your application.
The code for retrieving the matched sequences, especially the (\.\d+){0,2} part, will depend on your particular regex library.

My 2 cents: I had this scenario: I had to parse version numbers out of a string literal.
(I know this is very different from the original question, but googling to find a regex for parsing version number showed this thread at the top, so adding this answer here)
So the string literal would be something like: "Service version 1.2.35.564 is running!"
I had to parse the 1.2.35.564 out of this literal. Taking a cue from #ajborley, my regex is as follows:
(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)
A small C# snippet to test this looks like below:
void Main()
{
Regex regEx = new Regex(#"(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)", RegexOptions.Compiled);
Match version = regEx.Match("The Service SuperService 2.1.309.0) is Running!");
version.Value.Dump("Version using RegEx"); // Prints 2.1.309.0
}

I had a requirement to search/match for version numbers, that follows maven convention or even just single digit. But no qualifier in any case. It was peculiar, it took me time then I came up with this:
'^[0-9][0-9.]*$'
This makes sure the version,
Starts with a digit
Can have any number of digit
Only digits and '.' are allowed
One drawback is that version can even end with '.' But it can handle indefinite length of version (crazy versioning if you want to call it that)
Matches:
1.2.3
1.09.5
3.4.4.5.7.8.8.
23.6.209.234.3
If you are not unhappy with '.' ending, may be you can combine with endswith logic

Don't know what platform you're on but in .NET there's the System.Version class that will parse "n.n.n.n" version numbers for you.

I've seen a lot of answers, but... i have a new one. It works for me at least. I've added a new restriction. Version numbers can't start (major, minor or patch) with any zeros followed by others.
01.0.0 is not valid
1.0.0 is valid
10.0.10 is valid
1.0.0000 is not valid
^(?:(0\\.|([1-9]+\\d*)\\.))+(?:(0\\.|([1-9]+\\d*)\\.))+((0|([1-9]+\\d*)))$
It's based in a previous one. But i see this solution better... for me ;)
Enjoy!!!

I tend to agree with split suggestion.
Ive created a "tester" for your problem in perl
#!/usr/bin/perl -w
#strings = ( "1.2.3", "1.2.*", "1.*","*" );
%regexp = ( svrist => qr/(?:(\d+)\.(\d+)\.(\d+)|(\d+)\.(\d+)|(\d+))?(?:\.\*)?/,
onebyone => qr/^(\d+\.)?(\d+\.)?(\*|\d+)$/,
greg => qr/^(\*|\d+(\.\d+){0,2}(\.\*)?)$/,
vonc => qr/^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$/,
ajb => qr/^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$/,
jrudolph => qr/^(((\d+)\.)?(\d+)\.)?(\d+|\*)$/
);
foreach my $r (keys %regexp){
my $reg = $regexp{$r};
print "Using $r regexp\n";
foreach my $s (#strings){
print "$s : ";
if ($s =~m/$reg/){
my ($main, $maj, $min,$rev,$ex1,$ex2,$ex3) = ("any","any","any","any","any","any","any");
$main = $1 if ($1 && $1 ne "*") ;
$maj = $2 if ($2 && $2 ne "*") ;
$min = $3 if ($3 && $3 ne "*") ;
$rev = $4 if ($4 && $4 ne "*") ;
$ex1 = $5 if ($5 && $5 ne "*") ;
$ex2 = $6 if ($6 && $6 ne "*") ;
$ex3 = $7 if ($7 && $7 ne "*") ;
print "$main $maj $min $rev $ex1 $ex2 $ex3\n";
}else{
print " nomatch\n";
}
}
print "------------------------\n";
}
Current output:
> perl regex.pl
Using onebyone regexp
1.2.3 : 1. 2. 3 any any any any
1.2.* : 1. 2. any any any any any
1.* : 1. any any any any any any
* : any any any any any any any
------------------------
Using svrist regexp
1.2.3 : 1 2 3 any any any any
1.2.* : any any any 1 2 any any
1.* : any any any any any 1 any
* : any any any any any any any
------------------------
Using vonc regexp
1.2.3 : 1.2. 3 any any any any any
1.2.* : 1. 2 .* any any any any
1.* : any any any 1 any any any
* : any any any any any any any
------------------------
Using ajb regexp
1.2.3 : 1 2 3 any any any any
1.2.* : 1 2 any any any any any
1.* : 1 any any any any any any
* : any any any any any any any
------------------------
Using jrudolph regexp
1.2.3 : 1.2. 1. 1 2 3 any any
1.2.* : 1.2. 1. 1 2 any any any
1.* : 1. any any 1 any any any
* : any any any any any any any
------------------------
Using greg regexp
1.2.3 : 1.2.3 .3 any any any any any
1.2.* : 1.2.* .2 .* any any any any
1.* : 1.* any .* any any any any
* : any any any any any any any
------------------------

^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Perhaps a more concise one could be :
^(?:(\d+)\.){0,2}(\*|\d+)$
This can then be enhanced to 1.2.3.4.5.* or restricted exactly to X.Y.Z using * or {2} instead of {0,2}

This should work for what you stipulated. It hinges on the wild card position and is a nested regex:
^((\*)|([0-9]+(\.((\*)|([0-9]+(\.((\*)|([0-9]+)))?)))?))$

For parsing version numbers that follow these rules:
- Are only digits and dots
- Cannot start or end with a dot
- Cannot be two dots together
This one did the trick to me.
^(\d+)((\.{1}\d+)*)(\.{0})$
Valid cases are:
1, 0.1, 1.2.1

Another try:
^(((\d+)\.)?(\d+)\.)?(\d+|\*)$
This gives the three parts in groups 4,5,6 BUT:
They are aligned to the right. So the first non-null one of 4,5 or 6 gives the version field.
1.2.3 gives 1,2,3
1.2.* gives 1,2,*
1.2 gives null,1,2
*** gives null,null,*
1.* gives null,1,*

My take on this, as a good exercise - vparse, which has a tiny source, with a simple function:
function parseVersion(v) {
var m = v.match(/\d*\.|\d+/g) || [];
v = {
major: +m[0] || 0,
minor: +m[1] || 0,
patch: +m[2] || 0,
build: +m[3] || 0
};
v.isEmpty = !v.major && !v.minor && !v.patch && !v.build;
v.parsed = [v.major, v.minor, v.patch, v.build];
v.text = v.parsed.join('.');
return v;
}

Sometimes version numbers might contain alphanumeric minor information (e.g. 1.2.0b or 1.2.0-beta). In this case I am using this regex:
([0-9]{1,4}(\.[0-9a-z]{1,6}){1,5})

(?ms)^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$
Does exactly match your 6 first examples, and rejects the 4 others
group 1: major or major.minor or '*'
group 2 if exists: minor or *
group 3 if exists: *
You can remove '(?ms)'
I used it to indicate to this regexp to be applied on multi-lines through QuickRex

This matches 1.2.3.* too
^(*|\d+(.\d+){0,2}(.*)?)$
I would propose the less elegant:
(*|\d+(.\d+)?(.*)?)|\d+.\d+.\d+)

Keep in mind regexp are greedy, so if you are just searching within the version number string and not within a bigger text, use ^ and $ to mark start and end of your string.
The regexp from Greg seems to work fine (just gave it a quick try in my editor), but depending on your library/language the first part can still match the "*" within the wrong version numbers. Maybe I am missing something, as I haven't used Regexp for a year or so.
This should make sure you can only find correct version numbers:
^(\*|\d+(\.\d+)*(\.\*)?)$
edit: actually greg added them already and even improved his solution, I am too slow :)

It seems pretty hard to have a regex that does exactly what you want (i.e. accept only the cases that you need and reject all others and return some groups for the three components). I've give it a try and come up with this:
^(\*|(\d+(\.(\d+(\.(\d+|\*))?|\*))?))$
IMO (I've not tested extensively) this should work fine as a validator for the input, but the problem is that this regex doesn't offer a way of retrieving the components. For that you still have to do a split on period.
This solution is not all-in-one, but most times in programming it doesn't need to. Of course this depends on other restrictions that you might have in your code.

Specifying XSD elements:
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\..*)?"/>
</xs:restriction>
</xs:simpleType>

One more solution:
^[1-9][\d]*(.[1-9][\d]*)*(.\*)?|\*$

I found this, and it works for me:
/(\^|\~?)(\d|x|\*)+\.(\d|x|\*)+\.(\d|x|\*)+

/^([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})(\-(alpha|beta|rc|HP|CP|SP|hp|cp|sp)[1-9]\d*)?(\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?(\.[0-9a-zA-Z]+)?$/
A normal version: ([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})
A Pre-release or patched version: (\-(alpha|beta|rc|EP|HP|CP|SP|ep|hp|cp|sp)[1-9]\d*)? (Extension Pack, Hotfix Pack, Coolfix Pack, Service Pack)
Customized version: (\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?
Internal version: (\.[0-9a-zA-Z]+)?

Related

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

How to use the regular expression group In Lugaru Epsilon (Editor) when the total match exceeds 9 groups?

This is about regular expression replacement in the Epsilon editor. I have a csv file that I wanted to replace the texts with a certain pattern.
The pattern replacement works perfectly when I use #1, #2 etc., in the replacement group.
But, when I enter #10, its the first group that got placed here. How to use the matching group greater than 9?
(Well, a very late answer, I realize now; anyway...)
I'm not able to find the documentation, but I think that only 1 (0 for the whole patter) to 9 (at least in interactive command) are supported.
I found this code, in src/searc.e:
...
char *with;
...
if (*with != '#')
insert(*with);
else if (isdigit(*++with)) {
bufnum = orig;
group = *with - '0';
buf_xfer(tmp, find_group(group, 1),
find_group(group, 0));
bufnum = tmp;
} else {
...
It seems to me the only the first character after # is considered.
You may try to mail to support#lugaru.com for further clarification, I found Steven Doerfler always very helpfull (Epsilon 14 is now in beta, it could be an opportunity to improve the documentation.)

How do I group regular expressions past the 9th backreference?

Ok so I am trying to group past the 9th backreference in notepad++. The wiki says that I can use group naming to go past the 9th reference. However, I can't seem to get the syntax right to do the match. I am starting off with just two groups to make it simple.
Sample Data
1000,1000
Regex.
(?'a'[0-9]*),([0-9]*)
According to the docs I need to do the following.
(?<some name>...), (?'some name'...),(?(some name)...)
Names this group some name.
However, the result is that it can't find my text. Any suggestions?
You can simply reference groups > 9 in the same way as those < 10
i.e $10 is the tenth group.
For (naive) example:
String:
abcdefghijklmnopqrstuvwxyz
Regex find:
(?:a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)
Replace:
$10
Result:
kqrstuvwxyz
My test was performed in Notepad++ v6.1.2 and gave the result I expected.
Update: This still works as of v7.5.6
SarcasticSully resurrected this to ask the question:
"What if you want to replace with the 1st group followed by the character '0'?"
To do this change the replace to:
$1\x30
Which is replacing with group 1 and the hex character 30 - which is a 0 in ascii.
A very belated answer to help others who land here from Google (as I did). Named backreferences in notepad++ substitutions look like this: $+{name}. For whatever reason.
There's a deviation from standard regex gotcha here, though... named backreferences are also given numbers. In standard regex, if you have (.*)(?<name> & )(.*), you'd replace with $1${name}$2 to get the exact same line you started with. In notepad++, you would have to use $1$+{name}$3.
Example: I needed to clean up a Visual Studio .sln file for mismatched configurations. The text I needed to replace looked like this:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = Release|Any CPU
My search RegEx:
^(\s*\{[^}]*\}\.)(?<config>[a-zA-Z0-9]+\|[a-zA-Z0-9 ]+)*(\..+=\s*)(.*)$
My replacement RegEx:
$1$+{config}$3$+{config}
The result:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = QA|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = QA|x86
Hope this helps someone.
The usual syntax of referencing groups with \x will interpret \10 as a reference to group 1 followed by a 0.
You need to use instead the alternative syntax of $x with $10.
Note : Some people seem to doubt there's ever any reason to have 10 groups.
I have a simple one, I wanted to rename a group of files named <name_start>DDMMYYYY_TIME_DDMMYYYY_TIME<name_end> as <name_start>YYYYMMDD_TIME_YYYYMMDD_TIME<name_end>, and ended with replacing my input matches with : rename "\1" "\2\5\4\3_\6_\9\8\7_$10" since name_start and name_end were not always constant.
OK, matching is no problem, your example matches for me in the current Notepad++. This is an important point. To use PCRE regex in Notepad++, you need a Version >= 6.0.
The other point is, where do you want to use the backreference? I can use named backreferences without problems within the regex, but not in the replacement string.
means
(?'a'[0-9]*),([0-9]*),\g{a}
will match
1000,1001,1000
But I don't know a way to use named groups or groups > 9 in the replacement string.
Do you really need more than 9 backreferences in the replacement string? If you just need more than 9 groups, but not all of them in the replacement, then make the groups you don't need to reuse non-capturing groups, by adding a ?: at the start of the group.
(?:[0-9]*),([0-9]*),(?:[0-9]*),([0-9]*)
group 1 group 2

Parse labeled param strings with Regex

Can anyone help me with this one?
My objective here is to grab some info from a text file, present the user with it and ask for values to replace that info so to generate a new output. So I thought of using regular expressions.
My variables would be of the format: {#<num>[|<value>]}.
Here are some examples:
{#1}<br>
{#2|label}<br>
{#3|label|help}<br>
{#4|label|help|something else}<br><br>
So after some research and experimenting, I came up with this expression: \{\#(\d{1,})(?:\|{1}(.+))*\}
which works pretty well on most of the ocasions, except when on something like this:
{#1} some text {#2|label} some more text {#3|label|help}
In this case variables 2 & 3 are matched on a single occurrence rather than on 2 separate matches...
I've already tried to use lookahead commands for the trailing } of the expression, but I didn't manage to get it.
I'm targeting this expression for using into C#, should that further help anyone...
I like the results from this one:
\{\#(\d+)(?:|\|(.+?))\}
This returns 3 groups. The second group is the number (1, 2, 3) and the third group is the arguments ('label', 'label|help').
I prefer to remove the * in favor of | in order to capture all the arguments after the first pipe in the last grouping.
A regular expression which can be used would be something like
\{\#(\d+)(?:\|([^|}]+))*\}
This will prevent reading over any closing }.
Another possible solution (with slightly different behaviour) would be to use a non-greedy matcher (.+?) instead of the greedy version (.+).
Note: I also removed the {1} and replaced {1,} with + which are equivalent in your case.
Try this:
\{\#(\d+)(?:\|[^|}]+)*\}
In C#:
MatchCollection matches = Regex.Matches(mystring,
#"\{\#(\d+)(?:\|[^|}]+)*\}");
It prevents the label and help from eating the | or }.
match[0].Value => {#1}
match[0].Groups[0].Value => {#1}
match[0].Groups[1].Value => 1
match[1].Value => {#2|label}
match[1].Groups[0].Value => {#2|label}
match[1].Groups[1].Value => 2
match[2].Value => {#3|label|help}
match[2].Groups[0].Value => {#3|label|help}
match[2].Groups[1].Value => 3

regex: using surrounding brackets as delimiters while ignoring any inside brackets

I've build a complex (for me) regex to parse some file names, and it broadly works, except for a case where there are additional inside brackets.
(?'field'F[0-9]{1,4})(?'term'\(.*?\))(?'operator'_(OR|NOT|AND)_)?
In the following examples, I need to get the groups after the comment, but in the 3rd example, I am getting ((brackets) instead of ((brackets)are valid).
For the life of me I can't work out how to extend it to search for the final bracket.
C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR
Thanks
UPDATE: I'm using RegExr to experiment, then implementing in C# like this:
Regex r = new Regex(pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
foreach(Match m in r.Matches(foo))
{
//etc
}
UPDATE 2: I don't need to match up the brackets. Inside the one set of brackets can be any data, I just need it to terminate with the outside bracket.
UPDATE 3:
Another attempt, this works with extra brackets (example 3 and 4), but still fails to split out the extra terms (example 5), but unfortunatly includes the terminating ] in the group. How can I get it to search for (but not include) either )_ or )] as the delimiter, but just include the bracket?
(?'field'F[0-9]{1,4})(?'term'\(.*?\)[\]])(?'operator'_(OR|NOT|AND)_)?
Final update: I've decided it's not worth the effort in trying to parse this stupid format, so I'm going to ditch support for it and do something more productive with my time. Thank you all for your help, I have now seen the light!
Matching nested parenthesis with regex is a) not possible*, or b) results in a regex that is unmaintainable.
If you're simply trying to match the first ( until the last ) (not checking if the opening- and closing-parenthesis properly match), then just remove the ? after .*?.
* depending what regex flavour you're using.
Hmm, this usually isn't possible with most regex engines. Although it is possible in perl:
PerlMonks
By using a recursive regexp:
use strict;
use warnings;
my $textInner =
'(outer(inner(most "this (shouldn\'t match)" inner)))';
my $innerRe;
my $idx=0;
my(#match);
$innerRe = qr/
\(
(
(?:
[^()"]+
|
"[^"]*"
|
(??{$innerRe})
)*
)
\)(?{$match[$idx++]=$1;})
/sx;
$textInner =~ /^$innerRe/g;
print "inner: $match[0]\n";
It's also possible to do it in most regex engines provided that you want to do it to a fixed depth of bracket nesting. I wrote something in java a while ago that would construct a regex that would match brackets up to 6 deep.
Here's my java function for producing the regex:
public static String generateParensMatchStr(int depth, char openParen, char closeParen)
{
if (depth == 0)
return ".*?";
else
return "(?:\\" + openParen + generateParensMatchStr(depth - 1, openParen, closeParen) + "\\" +closeParen + "|.*?)+?";
}
here is my another test results in python
x="""C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR"""
x=re.sub("//.*","",x)
x=re.sub("(_(OR|NOT|AND)_).*?]"," \\1 \\2]",x)
x=re.findall("(?:F[0-9]{1,4}\(.*\).*(?=]))",x)
for x in x:print x
this gives
F30(green)
F21(red) _OR_ OR
F21((brackets)are valid)
F21(any old brackets)))))are valid)
F21(brackets)))))) _OR_ OR
Thats will meet your expected result?
re.findall("((?:F[0-9]{1,4}\(.*\))(?:_(?:OR|NOT|AND)_)?)+?",YOURTEXT)
gots
['F30(green)', 'F21(red)_OR_F21(blue)_NOT_F21(pink)', 'F21((brackets)are valid)', 'F21(any old brackets)))))are valid)', 'F21(brackets))))))_OR_F21(blue)']
in python, what do you think?
Try this
/(F[0-9]{1,4})(\([^_\]]+\))(?:_(OR|NOT|AND)_)?/
tested with PHP, seems to give the expected results (as long as the strings inside round brackets don't contain _ or ]).