Trying to build a regular expression to check pattern - regex

a) Start and end with a number
b) Hyphen should start and end with a number
c) Comma should start and end with a number
d) Range of number should be from 1-31
[Edit: Need this rule in the regex, thanks Ed-Heal!]
e) If a number starts with a hyphen (-), it cannot end with any other character other than a comma AND follow all rules listed above.
E.g. 2-2,1 OR 2,2-1 is valid while 1-1-1-1 is not valid
E.g.
a) 1-5,5,15-29
b) 1,28,1-31,15
c) 15,25,3 [Edit: Replaced 56 with 3, thanks for pointing it out Brian!]
d) 1-24,5-6,2-9
Tried this but it passes even if the string starts with a comma:
/^[0-9]*(?:-[0-9]+)*(?:,[0-9]+)*$/

How about this? This will check rules a, b and c, at least, but does not check rule d.
/^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$/
If you need to ensure that all the numbers are in the range 1-31, then the expression will get a whole lot uglier:
/^([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?(,([1-9]|[12][0-9]|3[01])(-([1-9]|[12][0-9]|3[01]))?)*$/
Note that your example c contains a number, 56, that does not fall within the range 1-31, so it will not pass the second expression.

try this
^\d+(-\d+)?(,\d+(-\d+)?)*$
DEMO

Here is my workings
Numbers:
0|([1-9][0-9]*) call this expression A Note this expression treats zero as a special case and prevents numbers starting with a zero eg 0000001234
Number or a range:
A|(A-A) call this expression B (i.e (0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))
Comma operator
B(,B)*
Putting this togher should do the trick and we get
((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*))))(,((0|([1-9][0-9]*))|((0|([1-9][0-9]*))-(0|([1-9][0-9]*)))))*
You can abbreviatge this with \d for [0-9]

The other approaches have not restricted the allowed range of numbers. This allows 1 through 31 only, and seems simpler than some of the monstrosities people have come up with ...
^([12][0-9]?|3[01]?|[4-9])([-,]([12][0-9]?|3[01]?|[4-9]))*$
There is no check for sensible ranges; adding that would make the expression significantly more complex. In the end you might be better off with a simpler regex and implementing sanity checks in code.

I propose the following regex:
(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$
It looks awful but it isn't :) In fact the construction (?<name>...){0} allows us to define a named regex and to say that it doesn't match where it is defined. Thus I defined a pattern for numbers called number and a pattern for what I called a thing i.e. a range or number called thing. Next I know that your expression is a sequence of those things, so I use the named regex thing to build it with the construct \g<thing>. It gives (\g<thing>,)*\g<thing>. That's easy to read and understand. If you allow whitespaces to be non significant in your regex, you could even indent it like this:
(?<number>[1-9]|[12]\d|3[01]){0}
(?<thing>\g<number>-\g<number>|\g<number>){0}
^(\g<thing>,)*\g<thing>$/
I tested it with Ruby 1.9.2. Your regex engine should support named groups to allow that kind of clarity.
irb(main):001:0> s1 = '1-5,5,15-29'
=> "1-5,5,15-29"
irb(main):002:0> s2 = '1,28,1-31,15'
=> "1,28,1-31,15"
irb(main):003:0> s3 = '15,25,3'
=> "15,25,3"
irb(main):004:0> s4 = '1-24,5-6,2-9'
=> "1-24,5-6,2-9"
irb(main):005:0> r = /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
=> /(?<number>[1-9]|[12]\d|3[01]){0}(?<thing>\g<number>-\g<number>|\g<number>){0}^(\g<thing>,)*\g<thing>$/
irb(main):006:0> s1.match(r)
=> #<MatchData "1-5,5,15-29" number:"29" thing:"15-29">
irb(main):007:0> s2.match(r)
=> #<MatchData "1,28,1-31,15" number:"15" thing:"15">
irb(main):008:0> s3.match(r)
=> #<MatchData "15,25,3" number:"3" thing:"3">
irb(main):009:0> s4.match(r)
=> #<MatchData "1-24,5-6,2-9" number:"9" thing:"2-9">
irb(main):010:0> '1-1-1-1'.match(r)
=> nil

Using the same logic in my previous answer but limiting the range
A becomes [1-9]\d|3[01]
B becomes ([1-9]\d|3[01])|(([1-9]\d|3[01])-([1-9]\d|3[01]))
Overall expression
(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01])))(,(([12]\d|3[01])|(([12]\d|3[01])-([12]\d|3[01]))))*

An optimal Regex for this topic could be:
^(?'int'[1-2]?[1-9]|3[01])((,\g'int')|(-\g'int'(?=$|,)))*$
demo

Related

Regular expression for non-consecutive characters

I'm trying to create a regular expression that validates the following requirements:
Simultaneous use of Cyrillic and numbers is possible (without spaces and special characters)
Simultaneous use of Latin and numbers is possible (without spaces and special characters)
Simultaneous use of Cyrillic and Latin characters is not possible
The first letter must be capitalized, cannot be a number
Sequence length - from 2 to 16 digits inclusive
It is impossible to use 3 or more identical symbols in a row
I am using the following solution:
(?:([A-Z][A-Za-z0-9]{1,15}|[А-Я][А-ЯЁа-яё0-9]{1,15}))$
How do I change the regex to match the last requirement?
I use Google Sheets, in which it is impossible to use negative lookahead.
Sorry for my English.
I don't you can do this with a single regex without lookbehinds.
But there are workarounds for the "don't repeat same character 3 times" functionality.
The workarounds could be simpler if RE2 supported backreferences, but it does not. So the resulting rule will be longer.
You may define a column ValidNoThreeRepeats like this:
=
NOT(
OR(
AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));
AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));
AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));
AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));
AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));
AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));
AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));
AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));
AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));
AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));
AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));
AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));
AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))
)
)
Or in a compacted way like this:
=NOT(OR(AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))))
The idea is to have a rule that compares 1st, 2nd and 3rd character, then another rule that compares 2nd, 3rd, 4th, then another rule for 3rd, 4th, 5th, and so on and so forth. You join this rules with an OR, since if any of those match, it means that at some place some repetition exists. Finally, you negate the whole expresion with a NOT
Than you can check that both your regex and that column are valid.
Donno with which script language you're using
If's in PHP code form,I'd be using `Filter_var($param1, FILTER_VALIDATE..., FILTER_FLAG..)` if i were in your shoes .
It makes your way into both **validating** n **sanitizing** your snippet.
**PEACE**.

+ is supposed to be greedy, so why am I getting a lazy result?

Why does the following regex return 101 instead of 1001?
console.log(new RegExp(/1(0+)1/).exec('101001')[0]);
I thought that + was greedy, so the longer of the two matches should be returned.
IMO this is different from Using javascript regexp to find the first AND longest match because I don't care about the first, just the longest. Can someone correct my definition of greedy? For example, what is the difference between the above snippet and the classic "oops, too greedy" example of new RegExp(/<(.+)>/).exec('<b>a</b>')[0] giving b>a</b?
(Note: This seems to be language-agnostic (it also happens in Perl), but just for ease of running it in-browser I've used JavaScript here.)
Regex always reads from left to right! It will not look for something longer. In the case of multiple matches, you have to re-execute the regex to get them, and compare their lengths yourself.
Greedy means up to the rightmost occurrence, it never means the longest in the input string.
Regex itself is not the correct tool to extract the longest match. You might get all the substrings that match your pattern, and get the longest one using the language specific means.
Since the string is parsed from left to right, 101 will get matched in 101001 first, and the rest (001) will not match (as the 101 and 1001 matches are overlapping). You might use /(?=(10+1))./g and then check the length of each Group 1 value to get the longest one.
var regex = /(?=(10+1))./g;
var str = "101001";
var m, res=[];
while ((m = regex.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res); // => ["101", "1001"]
if (res.length>0) {
console.log("The longest match:", res.sort(function (a, b) { return b.length - a.length; })[0]);
} // => 1001

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Regular expression to validate sum of numerics

I want to validate the user input,
it should accept,
1+2
1.2+56+3.5
it should not accept any alphabets, special characters other than . and +
and mainly it should not accept ++ and ..
please help me with regular expression.
This should work:
var regex = /^[0-9]+(\.[0-9]+)?(\+[0-9]+(\.[0-9]+)?)*$/;
"1+2".match(regex); // not null
"1.2+56+3.5".match(regex); // not null
"1++2".match(regex); // null
"1..2".match(regex); // null
online: http://regex101.com/r/zJ6tP7/1
Something like this should suffice
^([+]?\d+(\.\d+)?)*$
http://regex101.com/r/qE2kW1/2
Source: Validate mathematical expressions using regular expression?
Note that I'm assuming it should not accept .+ or +. as well. I'm not assuming that you require checking for multiple decimals prior to an addition, meaning this will accept 3.4.5 I'm also assuming you want it to start and end with numbers, so .5 and 4+ will fail.
(\d*[\.\+])*(\d)*
This takes any amount of number values, followed by a . or a +, any number of times, followed by any other number value.
To avoid things like 3.4.5 you'll likely need to use some sort of lookaround.
EDIT: Forgot to format regular expression.

Parse labeled param strings with Regex

Can anyone help me with this one?
My objective here is to grab some info from a text file, present the user with it and ask for values to replace that info so to generate a new output. So I thought of using regular expressions.
My variables would be of the format: {#<num>[|<value>]}.
Here are some examples:
{#1}<br>
{#2|label}<br>
{#3|label|help}<br>
{#4|label|help|something else}<br><br>
So after some research and experimenting, I came up with this expression: \{\#(\d{1,})(?:\|{1}(.+))*\}
which works pretty well on most of the ocasions, except when on something like this:
{#1} some text {#2|label} some more text {#3|label|help}
In this case variables 2 & 3 are matched on a single occurrence rather than on 2 separate matches...
I've already tried to use lookahead commands for the trailing } of the expression, but I didn't manage to get it.
I'm targeting this expression for using into C#, should that further help anyone...
I like the results from this one:
\{\#(\d+)(?:|\|(.+?))\}
This returns 3 groups. The second group is the number (1, 2, 3) and the third group is the arguments ('label', 'label|help').
I prefer to remove the * in favor of | in order to capture all the arguments after the first pipe in the last grouping.
A regular expression which can be used would be something like
\{\#(\d+)(?:\|([^|}]+))*\}
This will prevent reading over any closing }.
Another possible solution (with slightly different behaviour) would be to use a non-greedy matcher (.+?) instead of the greedy version (.+).
Note: I also removed the {1} and replaced {1,} with + which are equivalent in your case.
Try this:
\{\#(\d+)(?:\|[^|}]+)*\}
In C#:
MatchCollection matches = Regex.Matches(mystring,
#"\{\#(\d+)(?:\|[^|}]+)*\}");
It prevents the label and help from eating the | or }.
match[0].Value => {#1}
match[0].Groups[0].Value => {#1}
match[0].Groups[1].Value => 1
match[1].Value => {#2|label}
match[1].Groups[0].Value => {#2|label}
match[1].Groups[1].Value => 2
match[2].Value => {#3|label|help}
match[2].Groups[0].Value => {#3|label|help}
match[2].Groups[1].Value => 3