regex match limited list of words of max length - regex

I'm trying to match a list of 1-5 words separated by whitespace.
Each word can include either alphanumeric and "." characters and the word should have a max length of 32.
I am using the following pattern
^(\b[\w\.]{1,32}\b\s?){1,5}$
I see the following string matches even though its length is 35
111111111111111111...11111111111111
When I remove the quantifier as below then it does not match, as expected
^\b[\w\.]{1,32}\b\s?$
Why is 111111111111111111...11111111111111 matching and how do I fix the pattern so that it doesn't?

There are two things in your expression that I can see are tripping you up. The first is the word boundaries, which weren't behaving as you expected. The second was the fact that you were making the whitespace optional. That's why it matched 35 characters, because it would match 32, and start matching the the next set.
I think this should match what you need: ^([\w\.]{1,32})((\s[\w\.]{1,32}){1,4})?$.
The first group matches the first word with 1-32 characters, as you'd expect. The second group is very similar, but notice the required whitespace at the beginning. This makes it required that whitespace separates each new word and doesn't require it at the end. The whole second group is optional, so it'll match 1-5 of these words. It adds a bit of repetition and there may be ways to make it shorter, but it does the trick!
Hope that helps.

Related

how to manage duplicated characters in regex

I'm using this regex to find ALL of the following occurrences in an array:
/^.*(?=.*T)(?=.*O)(?=.*T)(?=.*A).*$/
it matches
pOTATO
mATTO
cATeTO
but also
lATO
minAreTO
AnTicO
although this last three words have just one T
how can I extract only words containing at least two Ts, one A and one O, in any order?
Since lookarounds stand their ground, once the first lookaround is tried, the next, and all subsequent ones after the first lookaround are checked from exactly the same position.
You need to use
/^(?=.*T.*T)(?=.*O)(?=.*A).*/
/^(?=.*T[^T]*T)(?=.*O)(?=.*A).*/
Note the missing .* after ^, it is not necessary as it is enough to only fire the lookaheads once at the string start position. Now, (?=.*T.*T) makes sure there are two repetitions of zero or more chars other than line break chars as many as possible followed with a T char. (?=.*T[^T]*T) makes sure there are zero or more chars other than line break chars as many as possible and then T, zero or more chars other than T and then another T.
See regex demo #1 and regex demo #2. Note that (?=.*T[^T]*T) can match more than (?=.*T.*T) since [^T] can match line break chars. To avoid that in the demo, I added \n into the negated character class.

Regular expression to match a word that contains ONLY one colon

I am new to regex, basically I'd like to check if a word has ONLY one colons or not.
If has two or more colons, it will return nothing.
if has one colon, then return as it is. (colon must be in the middle of string, not end or beginning.
(1)
a:bc:de #return nothing or error.
a:bc #return a:bc
a.b_c-12/:a.b_c-12/ #return a.b_c-12/:a.b_c-12/
(2)
My thinking is, but this is seems too complicated.
^[^:]*(\:[^:]*){1}$
^[-\w.\/]*:[-\w\/.]* #this will not throw error when there are 2 colons.
Any directions would be helpful, thank you!
This will find such "words" within a larger sentence:
(?<= |^)[^ :]+:[^ :]+(?= |$)
See live demo.
If you just want to test the whole input:
^[^ :]+:[^ :]+$
To restrict to only alphanumeric, underscore, dashes, dots, and slashes:
^[\w./-]+:[\w./-]+$
I saw this as a good opportunity to brush up on my regex skills - so might not be optimal but it is shorter than your last solution.
This is the regex pattern: /^[^:]*:[^:]*$/gm and these are the strings I am testing against: 'oneco:on' (match) and 'one:co:on', 'oneco:on:', ':oneco:on' (these should all not match)
To explain what is going on, the ^ matches the beginning of the string, the $ matches the end of the string.
The [^:] bit says that any character that is not a colon will be matched.
In summary, ^[^:] means that the first character of the string can be anything except for a colon, *: means that any number of characters can come after and be followed by a single colon. Lastly, [^:]*$ means that any number (*) of characters can follow the colon as long as they are not a colon.
To elaborate, it is because we specify the pattern to look for at the beginning and end of the string, surrounding the single colon we are looking for that only the first string 'oneco:on' is a match.

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

What would be the Regex expression to get the first letter after a group of character and some integers?

I have a string that the following structure:
ABCD123456EFGHIJ78 but sometimes it's missing a number or a character like:
ABC123456EFGHIJ78 or
ABCD123456E or
ABCD12345EFGHIJ78
etc.
That's why I need regular expressions.
What I want to extract is the first letter of the third group, in this case 'E'.
I have the following regex:
(\D+)+(\d+)+(\D{1})\3
but I don't get the letter E.
This seems to work for the example cases you provided.
^(?:[A-Za-z]+)(?:\d+)(.)
It assumes that the first group is only letters and that the second group is only digits.
There's already a nice answer.
But for the records, your initial proposal was very close to work. You just needed to say that the character matching the 3rd group can repeat several times by adding a star:
^(\D+)(\d+)(\D{1})\3*
The main weakness is that \D matches any char except digits, so also spaces. Making it more robust leads us to explicit the range of chars accepted:
^([A-Za-z]+)(\d+)([A-Za-z]{1})\3*
It's much better, but my favourite uses \w to match at the end of the pattern any non white character:
([A-Za-z]+)(\d+)([A-Za-z]{1})\w*

Condition for max character limit and on minimum character putting condition

I am trying to do do following match using regex.
The input characters should be capital letters starting from 2-10 characters.
If it's 2 characters then allow only those 2 characters which does not contain A,E,I,O,U either at first place or second place.
I tried:
[B-DF-HJ-NP-TV-XZ]{2,10}
It works well, but I am not too sure if this is the right and most efficient way to do regex here.
All credit to Jerry, for his answer:
^(?:(?![AEIOU])[A-Z]{2}|[A-Z]{3,10})$
Explanation:
^ = "start of string", and $ = "end of string". This is useful for preventing false matches (e.g. a 10-character match from an 11 character input, or "MR" matching in "AMRXYZ").
(?![AEIOU]) is a negative look-ahead for the characters A,E,I,O and U - i.e. the regex will not match if the text contains a vowel. This is only applied to the first half of the conditional "OR" (|) regex, so vowels are still allowed in longer matches.
The rest is fairly obvious, based on what you've already demonstrated an understanding about regex in your question above.