Regex to match [integer][colon][alphanum][colon][integer] - regex

I am attempting to match a string formatted as [integer][colon][alphanum][colon][integer]. For example, 42100:ZBA01:20. I need to split these by colon...
I'd like to learn regex, so if you could, tell me what I'm doing wrong:
This is what I've been able to come up with...
^(\d):([A-Za-z0-9_]):(\d)+$
^(\d+)$
^[a-zA-Z0-9_](:)+$
^(:)(\d+)$
At first I tried matching parts of the string, these matching the entire string. As you can tell, I'm not very familiar with regular expressions.
EDIT: The regex is for input into a desktop application. I'm was not certain what 'language' or 'type' of regex to use, so I assumed .NET .
I need to be able to identify each of those grouped characters, split by colon. So Group #1 should be the first integer, Group #2 should be the alphanumeric group, Group #3 should be an integer (ranging 1-4).
Thank you in advance,
Darius

I assume the semicolons (;) are meant to be colons (:)? All right, a bit of the basics.
^ matches the beginning of the input. That is, the regular expression will only match if it finds a match at the start of the input.
Similarly, $ matches the end of the input.
^(\d+)$ will match a string consisting only of one or more numbers. This is because the match needs to start at the beginning of the input and stop at the end of the input. In other words, the whole input needs to match (not just a part of it). The + denotes one or more matches.
With this knowledge, you'll notice that ^(\d):([A-Za-z0-9_]):(\d)+$ was actually very close to being right. This expression indicates that the whole input needs to match:
one digit;
a colon;
one word character (or an alphanumeric character as you call it);
a colon;
one or more digits.
The problem is clearly in 1 and 3. You need to add a + quantifier there to match one or more times instead of just once. Also, you want to place these quantifiers inside the capturing groups in order to get the multiple matches inside one capturing group as opposed to receiving multiple capturing groups containing single matches.
^(\d+):([A-Za-z0-9_]+):(\d+)$

You need to use quantifiers
^(\d+):([A-Za-z0-9_]+):(\d+)$
^ ^ ^
+ is quantifier that matches preceeding pattern 1 to many times
Now you can access the values by accessing the particular groups

Related

Catastrophic backtracking issue with regular expression on long names

I am currently trying to validate a regex pattern for a names list among other things.
It actually works so far except when I try to test the limits. If the name is quite long, a maximum of 128 characters is allowed and then at the end a character which is defined in an inner group, such as:. a separator e.g. Space or a puncture, catastrophic backtracking occurs. Somehow I don't quite understand that because I would assume that group one (?:[\p{L}\p{Nd}\p{Ps}])+ 1 x must be there, group (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])? is optional and if the group has to be valid at the end (?:[\p{L}\p{Nd}\p{Pe}.]). The rear 2 groups can occur more often.
Full pattern
^(?!.{129})(?!.["])(?:[\p{L}\p{Nd}\p{Ps}])+(?:(?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])?(?:[\p{L}\p{Nd}\p{Pe}.]))*$
Tests & Samples
https://regex101.com/r/6E0Khd/1
You need to re-phrase the pattern in such a way so that the consequent regex parts could not match at the same location inside the string.
You can use
^(?!.{129})(?!.")[\p{L}\p{Nd}\p{Ps}][\p{L}\p{Nd}\p{Pe}.]*(?:(?:\p{Zs}\p{P}?|\p{P}\p{Zs}?)[\p{L}\p{Nd}\p{Pe}.]+)*$
See the regex demo.
Your regex was ^<Lookahead_1><Lookahead_2><P_I>+(?:<OPT_SEP>?<P_II>)*$. You need to make sure your string only starts with a char that matches <P_I> pattern, the rest of the chars can match <P_II> pattern. So, it should look like ^<Lookahead_1><Lookahead_2><P_I><P_II>*(?:<SEP><P_II>+)*$. Note the P_I pattern is used to match the first char only, P_II pattern is added right after P_I to match zero or more chars matching that pattern, SEP pattern is now obligatory and P_II pattern is quantified with +.
I also shrunk the (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}]) pattern into (?:\p{Zs}\p{P}?|\p{P}\p{Zs}?) (it matches either a horizontal whitespace and an optional punctuation proper symbol, or an optional punctuation proper symbol followed with an optional horizontal whitespace.
Note that \p{Zs} does not match a TAB char, you may want to use [\p{Zs}\t] instead.

Regex to match Zero and Comma

I'm looking for a regex string that will capture the following text:
0, ,0,
I've tried a few variation of this but to no avail:
^[0,]+$
^[0,]
Any advice would be greatly appreciated.
Edited:
This will be used within another program that does regex pattern matching using Perl. The program reads a file with a list of entries within it. Using different profiles within the program I need to pick out entries that look like the following:
0, ,0,
These entries could also read like this:
1, ,0,
So the ideal regex I'm looking for would scan for "Does it start with a 1 or 0 immediatly followed by a comma then a space then a comma then number (0-9) and ending with a comma."
Further examples:
0, ,8,
1, ,5,
I hope that helps to clarify the request.
Thanks,
(?:[0\s]+,)+
there is a space in your string, so you need \s to match it.
Your question doesn't mention a particular regex implementation, so the answers you have received might not work for you. (Lesson: always specify the environment in which you plan to use this.)
In any reasonably modern regex variant,
[0,]+
matches a sequence of one or more characters. The character class [abc] matches a single character which is one of the enumerated characters inside the square brackets, and the quantifier + says to match the previous expression as many times as possible, but at least once.
Matching and capturing are separate concepts in some implementations. Perhaps you want to add parentheses around this regex to specify that you want to capture, not just match, the strings in the input which this regular expression describes (and in some implementations, you want to add a flag -commonly g - to say that you want all matches,not just the first).
Regex: ^(?:[0 ],)+$ or ^(?:[0\s],)+$
Details:
^ asserts position at start of the string
(?:) Non-capturing group
[] Match a single character present in the list
+ Matches between one and unlimited times
$ asserts position at the end of the string
\s matches any whitespace character
Regex demo
You need to capture spaces too with, for instance, \s:
^[0,\s]+$
\s will match all spaces characters and is the equivalent to [\r\n\t\f\v ].
See result in action here: https://regex101.com/r/g3faWA/1
You can also remove line delimiters (^ and $) if you want to match the parts of the line that contains 0 and commas even if the line contains other characters. That would give:
[0,\s]+

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Regular Expression needed for a Specific ID

I need to create a regular expression that matches an ID that has a specific format. The ID always begins with "OR" followed by 4 digits, then a dash, then another number that can be of any length. Examples of valid matches are:
OR1581-2
OR0057-101
OR0000-5312
OR3450-17371
Thanks!
Try ^OR\d{4}-\d+$.
The ^ matches the beginning of the string or line.
OR is not a special sequence and will match only those two characters in order.
\d matches any digit, and {4} is shorthand for listing the preceding group (the digit) exactly four times.
- is not a special character and will match only the hyphen.
\d matches any digit again, and the + requires the preceding group (the digit) to occur one or more times.
$ matches the end of the string or line.
If you need to find match in string that contains such ID, but also other text, then use
\bOR\d{4}-\d+\b
However if you need to verify input if is in such format, so no other text around is allowed, then go with
^OR\d{4}-\d+$

match the same unknown character multiple times

I have a regex problem I can't seem to solve. I actually don't know if regex can do this, but I need to match a range of characters n times at the end of a pattern.
eg. blahblah[A-Z]{n}
The problem is whatever character matches the ending range need to be all the same.
For example, I want to match
blahblahAAAAA
blahblahEEEEE
blahblahQQQQQ
but not
blahblahADFES
blahblahZYYYY
Is there some regex pattern that can do this?
You can use this pattern: blahblah([A-Z])\1+
The \1 is a back-reference to the first capture group, in this case ([A-Z]). And the + will match that character one or more times. To limit it you can replace the + with a specific number of repetitions using {n}, such as \1{3} which will match it three times.
If you need the entire string to match then be sure to prefix with ^ and end with $, respectively, so that the pattern becomes ^blahblah([A-Z])\1+$
You can read more about back-references here.
In most regex implementations, you can accomplish this by referencing a capture group in your regex. For your example, you can use the following to match the same uppercase character five times:
blahblah([A-Z])\1{4}
Note that to match the regex n times, you need to use \1{n-1} since one match will come from the capture group.
blahblah(.)\1*\b should work in nearly all language flavors. (.) captures one of anything, then \1* matches that (the first match) any number of times.
blahblah([A-Z]|[a-z])\1+
This should help.