Regex Help, How do I make order of expressions not matter? - regex

I can't figure out how to get the order of the incoming string parameters (price,merchant,category) will not matter to the regex. My regex matches the parts of the string but not the string as a whole. I need to be able to add \A \Z to it.
Pattern:
(,?price:(;?(((\d+(\.\d+)?)|min)-((\d+(\.\d+)?)|max))|\d+)+){0,1}(,?merchant:\d+){0,1}(,?category:\d+){0,1}
Sample Strings:
price:1.00-max;3-12;23.34-12.19,category:3
merchant:25,price:1.00-max;3-12;23.34-12.19,category:3
price:1.00-max;3-12;23.34-12.19,category:3,merchant:25
category:3,price:1.00-max;3-12;23.34-12.19,merchant:25
Note: I'm going to add ?: to all my groups after I get it working.

You should probably just parse this string through normal parsing. Split it at the commas, then split each of those pieces into two by the colons. You can store validation regexes if you'd like to check each of those inputs individually.
If you do it through regex, you'll probably have to end up saying "this combination OR this combination OR this combination", which will hurt real bad.

You have three options:
You can enumerate all the possible orders. For 3 variables there are 6 possibilities. Obviously this doesn't scale;
You can accept possible duplicates; or
You can break the string up and then parse it.
(2) means something like:
/(\b(price|category|merchant)=(...).*?)*/
The real problem you're facing here is that you're trying to parse what is essentially a non-regular language with a regular expression. A regular expression describes a DFSM (deterministic finite state machine) or DFA (deterministic finite automaton). Regular languages have no concept of state so the expression can't "remember" what else there has been.
To get to that you have to add a "memory" usually in the form of a stack, which yields a PDA (pushdown automaton).
It's exactly the same problem people face when they try and parse HTML with regexes and get stuck on tag nesting issues and similar.
Basically you accept some edge conditions (like repeated values), split the string by comma and then parse or you're just using the wrong tool for the job.

How about don't try and do it all with one Cthulhugex?
/price:([^,]*)/
/merchant:([^,]*)/
/category:([^,]*)/

$string=<<<EOF
price:1.00-max;3-12;23.34-12.19,category:3
merchant:25,price:1.00-max;3-12;23.34-12.19,category:3
price:1.00-max;3-12;23.34-12.19,category:3,merchant:25
category:3,price:1.00-max;3-12;23.34-12.19,merchant:25
EOF;
$s = preg_replace("/\n+/",",",$string);
$s = explode(",",$s);
print_r($s);
output
$ php test.php
Array
(
[0] => price:1.00-max;3-12;23.34-12.19
[1] => category:3
[2] => merchant:25
[3] => price:1.00-max;3-12;23.34-12.19
[4] => category:3
[5] => price:1.00-max;3-12;23.34-12.19
[6] => category:3
[7] => merchant:25
[8] => category:3
[9] => price:1.00-max;3-12;23.34-12.19
[10] => merchant:25
)

Related

mvc phone number regex "or"

Im trying to make validation for phone number or cellphone number with country prefix (or without)
for ex:
1. 55-123-1234(home num) or 055-123-1234(cell phone) => [2,3][3][4]
2. or +999-55-123-1234 => +[3][2][3][4]
For now Im using the following regex: [RegularExpression(#"^([0-9]{2,3})[-. ]?[0-9]{3}[-. ]?([0-9]{4,6})$" but it covers only 1.
the last [3][4] will always be so my question is if there is a way to write => ([2,3]) or (+[3] [2])
The validation needs to cover (+[3] [2]) [3] [4] or ([2,3]) [3] [4] make it valid
If there is any way to do add "or" between (+[3][2]) to ([2,3])?
or maybe there is other way to it to make it valid?
Thanks in advance
OR in regex is done with the | character.
+[3][2] or [2,3] is then written : \+\d{3}\-\d{2}|\d{2,3}
So for your complete regex, you can try the following :
(?:\+\d{3}[-. ]?\d{2}|\d{2,3})[-. ]?\d{3}[-. ]?\d{4}
Demo here

Preg_match for items in a list

EDIT: The answer and comment below make me think that I didn't explain this clearly... I am looking for a regular expression that matches multiple occurrences of a list. For example, I might want to take ABCBCBCBCBCD and I want to get the array [BC, BC, BC, BC, BC] from it. I don't know how many items will be in the list. If it is ABCD, I want the list [bc]. If it is ABCBCD, I want [bc, bc]. I thouht I could use /A(BC)+D/ to match all occurrences of BC, but that is not working.
The original question...
I have a set of very large data files. Per file, I only want a list of items out of it. The information I'm looking for has the format:
...<RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>...
The ... means that there is tons of text before and after this set. I can easily get the first item listed using the regex
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)~', $data, $matches);
Then, $matches[1] has "Amoxicillin, ". I tried to get all matches in the list using:
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)+~', $data, $matches);
That doesn't work. I get no matches. What is the syntax for "Multiple matches for the preceding sequence between ( and )"?
Of note, this is what is in $matches:
Array (
[0] => <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN>
[1] => <LN ID=531123>Amoxicillin</LN>
)
So, it looked at both items in the list, but only returned the first one. What I want is:
Array (
[0] => <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN>
[1] => <LN ID=531123>Amoxicillin</LN>
[2] => <LN ID=441655>Akikacin</LN>
)
Is this what you are looking for?
preg_match_all("/(\<RXCUI\>.*\<\/LN\>)/", $input_lines, $output_array);
http://www.phpliveregex.com/p/fpc
After a lot of research, it appears that this cannot be done with a single preg_match function. It requires two passes. The first will pull the entire match from the beginning to the end of the list. The second will break the list into the matches that are desired.
The first pass (assume $s = ...<RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>...)
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)+</ERS>~', $s, $match1);
Now, $match1[0] = <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>
I can use preg_match_all to get just what I want between the RXCUI and ERS elements
preg_match_all('~<LN[^>]*>[^<]*</LN>~', $match1[0], $match2);
Now, $match2[0] will contain an array:
[0] => <LN ID=531123>Amoxicillin</LN>
[1] => <LN ID=441656>Amikacin</LN>
It doesn't matter how many LN lines there are, the second preg_match_all will return them all.
This could be simplified a great deal if you could ensure that there are no LN elements anywhere else in the original document. I know that they are are LN elements that are not part of the RXCUI section. So, I can't just look for those.

How to split array of strings from two sides?

I have an array of strings (n=1000) in this format:
strings<-c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz",
"GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz",
"GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
I'm wondering what may be a easy way to get this:
strings2<-c(2201_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL,
2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL,
2203_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL)
which means to trim off "GSM1234567" from the front and ".gz" from the end.
Just a gsub solution that matches strings that starts ^ with digits and alphabetical symbols, zero or more times *, until a _ is encountered and (more precisely "or") pieces or strings that have .gz at the end $.
gsub("^([[:alnum:]]*_)|(\\.gz)$", "", strings)
[1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
[2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
[3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"
Edit
I forget to escape the second point.
strings <- c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz", "GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz", "GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
strings2 <- lapply(strings, function (x) substr(x, 12, 58))
You can do this using sub:
sub('[^_]+_(.*)\\.gz', '\\1', strings)
# [1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
# [2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
# [3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"
Try:
gsub('^[^_]+_|\\.[^.]*$','',strings)
I strongly suggest doing this in two steps. The other solutions work but are completely unreadable: they don’t express the intent of your code. Here it is, clearly expressed:
trimmed_prefix = sub('^GSM\\d+_', '', strings)
strings2 = sub('\\.gz$', '', trimmed_prefix)
But admittedly this can be expressed in one step, and wouldn’t look too badly, as follows:
strings2 = sub('^GSM\\d+_(.*)\\.gz$', '\\1', strings)
In general, think carefully about the patterns you actually want to match: your question says to match the prefix “GSM1234567” but your example contradicts that. I’d generally choose a pattern that’s as specific as possible to avoid accidentally matching faulty input.

Trying to build a regular expression to check pattern

a) Start and end with a number
b) Hyphen should start and end with a number
c) Comma should start and end with a number
d) Range of number should be from 1-31
[Edit: Need this rule in the regex, thanks Ed-Heal!]
e) If a number starts with a hyphen (-), it cannot end with any other character other than a comma AND follow all rules listed above.
E.g. 2-2,1 OR 2,2-1 is valid while 1-1-1-1 is not valid
E.g.
a) 1-5,5,15-29
b) 1,28,1-31,15
c) 15,25,3 [Edit: Replaced 56 with 3, thanks for pointing it out Brian!]
d) 1-24,5-6,2-9
Tried this but it passes even if the string starts with a comma:
/^[0-9]*(?:-[0-9]+)*(?:,[0-9]+)*$/
How about this? This will check rules a, b and c, at least, but does not check rule d.
/^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$/
If you need to ensure that all the numbers are in the range 1-31, then the expression will get a whole lot uglier:
/^([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?(,([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?)*$/
Note that your example c contains a number, 56, that does not fall within the range 1-31, so it will not pass the second expression.
try this
^\d+(-\d+)?(,\d+(-\d+)?)*$
DEMO
Here is my workings
Numbers:
0|([1-9][0-9]*) call this expression A Note this expression treats zero as a special case and prevents numbers starting with a zero eg 0000001234
Number or a range:
A|(A-A) call this expression B (i.e (0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))
Comma operator
B(,B)*
Putting this togher should do the trick and we get
((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*))))(,((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))))*
You can abbreviatge this with \d for [0-9]
The other approaches have not restricted the allowed range of numbers. This allows 1 through 31 only, and seems simpler than some of the monstrosities people have come up with ...
^([12][0-9]?|3[01]?|[4-9])([-,]([12][0-9]?|3[01]?|[4-9]))*$
There is no check for sensible ranges; adding that would make the expression significantly more complex. In the end you might be better off with a simpler regex and implementing sanity checks in code.
I propose the following regex:
(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$
It looks awful but it isn't :) In fact the construction (?<name>...){0} allows us to define a named regex and to say that it doesn't match where it is defined. Thus I defined a pattern for numbers called number and a pattern for what I called a thing i.e. a range or number called thing. Next I know that your expression is a sequence of those things, so I use the named regex thing to build it with the construct \g<thing>. It gives (\g<thing>,)*\g<thing>. That's easy to read and understand. If you allow whitespaces to be non significant in your regex, you could even indent it like this:
(?<number>[1-9]|[12]\d|3[01]){0}
(?<thing>\g<number>-\g<number>|\g<number>){0}
^(\g<thing>,)*\g<thing>$/
I tested it with Ruby 1.9.2. Your regex engine should support named groups to allow that kind of clarity.
irb(main):001:0> s1 = '1-5,5,15-29'
=> "1-5,5,15-29"
irb(main):002:0> s2 = '1,28,1-31,15'
=> "1,28,1-31,15"
irb(main):003:0> s3 = '15,25,3'
=> "15,25,3"
irb(main):004:0> s4 = '1-24,5-6,2-9'
=> "1-24,5-6,2-9"
irb(main):005:0> r = /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
=> /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
irb(main):006:0> s1.match(r)
=> #<MatchData "1-5,5,15-29" number:"29" thing:"15-29">
irb(main):007:0> s2.match(r)
=> #<MatchData "1,28,1-31,15" number:"15" thing:"15">
irb(main):008:0> s3.match(r)
=> #<MatchData "15,25,3" number:"3" thing:"3">
irb(main):009:0> s4.match(r)
=> #<MatchData "1-24,5-6,2-9" number:"9" thing:"2-9">
irb(main):010:0> '1-1-1-1'.match(r)
=> nil
Using the same logic in my previous answer but limiting the range
A becomes [1-9]\d|3[01]
B becomes ([1-9]\d|3[01])|(([1-9]\d|3[01])-([1-9]\d|3[01]))
Overall expression
(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01])))(,(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01]))))*
An optimal Regex for this topic could be:
^(?'int'[1-2]?[1-9]|3[01])((,\g'int')|(-\g'int'(?=$|,)))*$
demo

Parse labeled param strings with Regex

Can anyone help me with this one?
My objective here is to grab some info from a text file, present the user with it and ask for values to replace that info so to generate a new output. So I thought of using regular expressions.
My variables would be of the format: {#<num>[|<value>]}.
Here are some examples:
{#1}<br>
{#2|label}<br>
{#3|label|help}<br>
{#4|label|help|something else}<br><br>
So after some research and experimenting, I came up with this expression: \{\#(\d{1,})(?:\|{1}(.+))*\}
which works pretty well on most of the ocasions, except when on something like this:
{#1} some text {#2|label} some more text {#3|label|help}
In this case variables 2 & 3 are matched on a single occurrence rather than on 2 separate matches...
I've already tried to use lookahead commands for the trailing } of the expression, but I didn't manage to get it.
I'm targeting this expression for using into C#, should that further help anyone...
I like the results from this one:
\{\#(\d+)(?:|\|(.+?))\}
This returns 3 groups. The second group is the number (1, 2, 3) and the third group is the arguments ('label', 'label|help').
I prefer to remove the * in favor of | in order to capture all the arguments after the first pipe in the last grouping.
A regular expression which can be used would be something like
\{\#(\d+)(?:\|([^|}]+))*\}
This will prevent reading over any closing }.
Another possible solution (with slightly different behaviour) would be to use a non-greedy matcher (.+?) instead of the greedy version (.+).
Note: I also removed the {1} and replaced {1,} with + which are equivalent in your case.
Try this:
\{\#(\d+)(?:\|[^|}]+)*\}
In C#:
MatchCollection matches = Regex.Matches(mystring,
#"\{\#(\d+)(?:\|[^|}]+)*\}");
It prevents the label and help from eating the | or }.
match[0].Value => {#1}
match[0].Groups[0].Value => {#1}
match[0].Groups[1].Value => 1
match[1].Value => {#2|label}
match[1].Groups[0].Value => {#2|label}
match[1].Groups[1].Value => 2
match[2].Value => {#3|label|help}
match[2].Groups[0].Value => {#3|label|help}
match[2].Groups[1].Value => 3