Regular Expression Postive Lookahead substring - regex

I am fairly new to regular expressions and the more and more I use them, the more I like them. I am working on a regular expression that must meet the following conditions:
Must start with an Alpha character
Out of the next three characters, at least one must be an Alpha character.
Anything after the first four characters is an automatic match.
I currently have the following regex: ^[a-zA-Z](?=.*[a-zA-Z]).{1}.*$
The issue I am running into is that my positive lookahead (?=.*[a-zA-Z]).{1} is not constrained to the next three characters following the alpha character.
I feel as if I am missing a concept here. What am I missing from this expression?
Thanks all.

The .* in your lookahead is doing that. You should limit the range here like
^[a-zA-Z](?=.{0,2}[a-zA-Z]).{1}.*$
Edit: If you want to make sure, that there are a least 4 characters in the string, you could use another lookahead like this:
^[a-zA-Z](?=.{3})(?=.{0,2}[a-zA-Z]).{1}.*$

What do you want lookahead for? Why not just use
^[a-zA-Z](..[a-zA-Z]|.[a-zA-Z].|[a-zA-Z]..)
and be happy?

You'll probably have to do a workaround. Something like:
^[a-z](?=([a-z]..|.[a-z].|..[a-z])).{3}.*
First char [a-z]
Positive lookahead, either first, or second, or third char is a-z ([a-z]..|.[a-z].|..[a-z])
Other stuff

Change the * in your lookahead to ? to get m/^[a-zA-Z](?=.?[a-zA-Z]).{1}.*$
If I am understanding your criteria, that fixes it because of the change in greediness.
These are correctly matched:
a2a3-match
2aaa-no match
Aaaa-match
a333-no match

Related

Regex - Alternate between letters and numbers

I am wondering how to build a regex that would match forever "D1B2C4Q3" but not "DDA1Q3" nor "D$1A2B".
That is a number must always follow a letter and vice versa. I've been working on this for a while and my current expression ^([A-Z0-9])(?!)+$ clearly does not work.
^([A-Z][0-9])+$
By combining the letters and digits into a single character class, the expression matches either in any order. You need to seperate the classes sequentially within a group.
I might actually use a simple regex pattern with a negative lookahead to prevent duplicate letters/numbers from occurring:
^(?!.*(?:[A-Z]{2,}|[0-9]{2,}))[A-Z0-9]+$
Demo
The reason I chose this approach, rather than a non lookaround one, is that we don't know a priori whether the input would start or end with a number or letter. There are actually four possible combinations of start/end, and this could make for a messy pattern.
I'm guessing maybe,
^(?!.*\d{2}|.*[A-Z]{2})[A-Z0-9]+$
might work OK, or maybe not.
Demo 1
A better approach would be:
^(?:[A-Z]\d|\d[A-Z])+$
Demo 2
Or
^(?:[A-Z]\d|\d[A-Z])*$
Or
^(?:[A-Z]\d|\d[A-Z]){1,}$
which would depend if you'd like to have an empty string valid or not.
Another idea that will match A, A1, 1A, A1A, ...
^\b\d?(?:[A-Z]\d)*[A-Z]?$
See this demo at regex101
\b the word boundary at ^ start requires at least one char (remove, if empty string valid)
\d? followed by an optional digit
(?:[A-Z]\d)* followed by any amount of (?: upper alpha followed by digit )
[A-Z]?$ ending in an optional upper alpha
If you want to accept lower alphas as well, use i flag.

matching in between a long sentence with keywords

target sentence:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system;$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;$(SolDir)..\..\ABC\ccc\1234\components\fds\ab_cdef_1.0\host; $(SolDir)..\..\ABC\ccc\1234\somethingelse;
how should I construct my regex to extract item contains "..\..\ABC\ccc\1234\ccc_am_system"
basically, I want to extract all those folders and may be more, they are all under \ABC\ccc\1234\ccc_am_system:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\abc;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\123\123\123\123;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;
my current regex doesn't work and I can't figure out why
\$.*ccc\\1234\.*;
Your problem is most likely that * is a greedy operator. It's greedily matching more than you intend it to. In many regex dialects, *? is the reluctant operator. I would first try using it like this:
\$.*?ccc\\1234.*?;
You can read up a bit more on greedy vs reluctant operators in this question.
If that doesn't work, you can try to be more specific with the characters you match than .. For example, you can match every non-semicolon character with an expression like this: [^;]*. You could use that idea this way:
\$[^;]*ccc\\1234[^;]*;
The below regex would store the captured strings inside group 1.
(\$.*?ccc\\1234\\.*?;)
You need to make the * quantifier to does a shortest match by adding ? next to * . And also this \.* matches a literal dot zero or more times. It's wrong.
DEMO
I found this to be the best:
\$(.[^\$;])*ccc\\1234(.[^\$;])*;
it doesn't allow any over match whatsoever, if I use ?, it still matches more $ or ; more than once for some reason, but with above expression, that will never be case. Still thanks to all those who took the time to answer my question,.

Regular Expressions, getting digit after second occurence of dot

I want to get a number after second dot in a string like that :
4.5.3. Some kind of question ? but input string might look like this as well 41.53.32. Some kind of question ? so im aiming for 3 in the first example and 32 in second example.
I'm trying to do it with
(?<=(\.\d\.))[0-9]+
and it works on 1st example, but when im trying to add (?<=(\.\d+\.))[0-9]+
it doesn't work at all.
If there is always a dot after the final number then you can use the following expression:
\d+(?=\.(?:[^\d]|$))
This will match one or more digits \d+ which are followed by a dot . then something that is either not a number [^\d] of the end-of-string $, i.e. (?=\.(?:[^\d]|$)).
Regex101 Demo
If you use PERL or PHP, you can try this pattern:
(?:\d+\.){2}\K\d+
The simplest complete answer is probably something like this:
(?<=^(?:[^.]*\.){2})\d+
If you're at all worried about performance, this one will be slightly faster:
^(?:[^.]*\.){2}(\d+)
This one will capture the desired value in capturing group 1.
If you are using an engine that doesn't support variable-length lookbehind, you'll need to use the second version.
If you wish, you can replace [^.] with \d, to only match digits.
(\d+.\d+.)\K\d+
Match digits dot digits dot digits, with the first section as a group not selected.
(?:(?:.*\.)?){2}(\d+)
the following regex should work for your use case.
check it out here

regex negative look-ahead for exactly 3 capital letters arround a char

im trying to write a regex finds all the characters that have
exactly 3 capital letters on both their sides
The following regex finds all the characters that have exactly 3 capital letters on the left side of the char, and 3 (or more) on the right:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})'
When trying to limit the right side to no more then 3 capitals using the regex:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})(?![A-Z])'
i get no results, there seems to be a fail when adding the (?![A-Z]) to the first regex.
can someone explain me the problem and suggest a way to solve it?
Thanks.
You need to put the negative lookahead inside the positive one:
(?<![A-Z])[A-Z]{3}.(?=[A-Z]{3}(?![A-Z]))
You can do that with the lookbehind, too:
(?<=(?<![A-Z])[A-Z]{3}).(?=[A-Z]{3}(?![A-Z]))
It doesn't violate the "fixed-length lookbehind" rule because lookarounds themselves don't consume any characters.
EDIT (about fixed-length lookbehind): Of all the flavors that support lookbehind, Python is the most inflexible. In most flavors (e.g. Perl, PHP, Ruby 1.9+) you could use:
(?<=^[A-Z]{3}|[^A-Z][A-Z]{3}).
...to match a character preceded by exactly three uppercase ASCII letters. The first alternative - ^[A-Z]{3} - starts looking three positions back, while the second - [^A-Z][A-Z]{3} - goes back exactly four positions. In Java, you can reduce that to:
(?<=(^|[^A-Z])[A-Z]{3}).
...because it does a little extra work at compile time to figure out that the maximum lookbehind length will be four positions. And in .NET and JGSoft, anything goes; if it's legal anywhere, it's legal in a lookbehind.
But in Python, a lookbehind subexpression has to match a single, fixed number of characters. If you've butted your head against that limitation a few times, you might not expect something like this to work:
(?<=(?<![A-Z])[A-Z]{3}).
At least I didn't. It's even more concise than the Java version; how can it work in Python? But it does work, in Python and in every other flavor that supports lookbehind.
And no, there are no similar restrictions on lookaheads, in any flavor.
Taking out the positive lookahead worked for me.
(?<![A-Z])[A-Z]{3}(.)([A-Z]{3})(?![A-Z])
'ABCdDEF' 'ABCfDEF' 'HHHhhhHHHH' 'jjJJjjJJJ' JJJjJJJ
matches
ABCdDEF
ABCfDEF
JJJjJJJ
I'm not sure how the regexp engines should work with multiple lookahead assertions, but the one you're using may have its own opinion on that.
You could as well use a single assertion as follows:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3}[^A-Z])'
The same with lookbehind:
'(?<=[^A-Z][A-Z]{3})(.)(?=[A-Z]{3}[^A-Z])'
This will have a problem matching the pattern in the beginning and in the end of the line.
I can't think of a proper solution, but there can be a dirty trick: for instance, add a space (or something else) in the beginning and the end of the whole line, then perform the matching.
$ echo 'ABCdDEF ABCfDEF HHHhhhHHHH AAAaAAAbAAA jjJJJJjJJJ JJJjJJJ' | sed 's/.*/ & /' | grep -oP '(?<=[^A-Z][A-Z]{3})(\S)(?=[A-Z]{3}[^A-Z])'
d
f
a
b
j
Note that I changed (.) to (\S) in the middle, change it back if you want the space to match.
P.S. Are you solving The Python Challenge? :)
Since the look ahead pattern is the same as the look behind pattern, you could also use the continue anchor \G:
/(?:[A-Z]{3}|\G[A-Z]*)(.)[A-Z]{3}/
A match is returned if three capitals precede a single character or where the last match left off (optionally followed by other capitals).

Regex to check that a character in range doesn't repeat

I want to match against Strings such as AhKs & AdKs (i.e. two cards Ah = Ace of Hearts). I want to match two off-suit cards with a regex, what I currently have is "^[AKQJT2-9][hscd]{2}$", but this could match hands such as AhKh (suited) and AhAh. Is there a way to possibly use backreferences to say the second [hscd] cannot be the same as the firs (similarly for [AKQJT2-9])
Not perfectly elegant, but works:
^[AKQJT2-9]([hscd])[AKQJT2-9](?!\1)[hscd]$
Try this regular expression:
^[AKQJT2-9]([hscd])[AKQJT2-9](?!\1)[hscd]$
Here a negative look-ahead assertion (?!…) is used to disallow the fourth character to be the same as the second (match of first grouping).
But if the regular expression implementation does not support look-around assertions, you will probably need to expand it to this:
^[AKQJT2-9](h[AKQJT2-9][scd]|s[AKQJT2-9][hcd]|c[AKQJT2-9][hsd]|d[AKQJT2-9][hsc])$
a negative lookahead comes to the rescue
/^[AKQJT2-9]([hscd])[AKQJT2-9](?!\1)[hscd]$/
:( too late.
Yes. Use back-reference together with a negative look-ahead.
^([AKQJT2-9])([hscd])(?!\1)(?!.\2)[AKQJT2-9][hscd]$