Regex - Alternate between letters and numbers - regex

I am wondering how to build a regex that would match forever "D1B2C4Q3" but not "DDA1Q3" nor "D$1A2B".
That is a number must always follow a letter and vice versa. I've been working on this for a while and my current expression ^([A-Z0-9])(?!)+$ clearly does not work.

^([A-Z][0-9])+$
By combining the letters and digits into a single character class, the expression matches either in any order. You need to seperate the classes sequentially within a group.

I might actually use a simple regex pattern with a negative lookahead to prevent duplicate letters/numbers from occurring:
^(?!.*(?:[A-Z]{2,}|[0-9]{2,}))[A-Z0-9]+$
Demo
The reason I chose this approach, rather than a non lookaround one, is that we don't know a priori whether the input would start or end with a number or letter. There are actually four possible combinations of start/end, and this could make for a messy pattern.

I'm guessing maybe,
^(?!.*\d{2}|.*[A-Z]{2})[A-Z0-9]+$
might work OK, or maybe not.
Demo 1
A better approach would be:
^(?:[A-Z]\d|\d[A-Z])+$
Demo 2
Or
^(?:[A-Z]\d|\d[A-Z])*$
Or
^(?:[A-Z]\d|\d[A-Z]){1,}$
which would depend if you'd like to have an empty string valid or not.

Another idea that will match A, A1, 1A, A1A, ...
^\b\d?(?:[A-Z]\d)*[A-Z]?$
See this demo at regex101
\b the word boundary at ^ start requires at least one char (remove, if empty string valid)
\d? followed by an optional digit
(?:[A-Z]\d)* followed by any amount of (?: upper alpha followed by digit )
[A-Z]?$ ending in an optional upper alpha
If you want to accept lower alphas as well, use i flag.

Related

Why put a lookahead at the beginning, or a lookbehind at the end? [duplicate]

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

Regex lookahead ordering

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).