Multiple selections between characters using lookarounds with regex? - regex

so I need to match any number of A's and Z's that are between the string AAA and ZZZ. For example, the string AAZZAZAAAZAZAZZZAZAZ would find the match ZAZA.
My regex for that is (?<=[A]{3})[AZ]+(?=[Z]{3}), which works fine, until I get a string that has 2 or more correct matches in it. AZAAA ZZAA ZZZAZAZAAA ZZAAZZAA ZZZAZAZ (spaces added for clarity), should match both ZZAA and ZZAAZZAA, but instead it passes right through the middle and returns a single string ZZAAZZZAZAAAZZAAZZAA, which is not cool. How do I get the lookarounds to select multiple strings?

You have to make the quantifier lazy, i.e. make it match as few characters as possible. By default, the quantifier is greedy, i.e. it tries to get the longest match.
(?<=[A]{3})[AZ]+?(?=[Z]{3})
# ^
For more information: http://www.regular-expressions.info/repeat.html

Related

Regex to find if all the characters in a word are the same specific character

I have a set of words coming in one by one like aa, ##, ???, ~~~, ?~ etc
I need a regex to find if any of these words is containing only ? or only ~.
Of the above input examples, ??? and ~~~ should match but not the others.
I tried ^[\s?]*$ and ^[\s~]*$ separately and it works, I am trying to combine them.
^[\s?||~]*$ doesn't work as it also recognizes ?~ as valid.
Any help?
You can use this regex, which looks for a string starting with a ~ or a ?, and then asserts that every other character in the string is the same as the first one using a backreference (\1):
^([~?])\1+$
Demo on regex101
You need to use backreference to achived your desired result.
If you want only ~ or ? use
^([~?])\1+$
If you want any repetitive pattern, use
^(.)\1+$
Explanation (.) or ([~?]) capturing the first charactor.
Then, \1+ checking the first charactor, one or more times (backreferencing)
You want to match lines that both start and end with any number of either a tilde or questionmark. That would be ^\(~\|?\)*$. The parentheses to make a group and the vertical bar to do the 'or' need to be backslash escaped.

Positive and Negative Lookahead on matchings strings with two or more same consecutive characters [duplicate]

I can very easily write a regular expression to match a string that contains 2 consecutive repeated characters:
/(\w)\1/
How do I do the complement of that? I want to match strings that don't have 2 consecutive repeated characters. I've tried variations of the following without success:
/(\w)[^\1]/ ;doesn't work as hoped
/(?!(\w)\1)/ ;looks ahead, but some portion of the string will match
/(\w)(?!\1)/ ;again, some portion of the string will match
I don't want any language/platform specific way to take the negation of a regular expression. I want the straightforward way to do this.
The below regex would match the strings which don't have any repeated characters.
^(?!.*(\w)\1).*
(?!.*(\w)\1) negative lookahead which asserts that the string going to be matched won't contain any repeated characters. .*(\w)\1 will match the string which has repeated characters at the middle or at the start or at the end. ^(?!.*(\w)\1) matches all the starting boundaries except the one which has repeated characters. And the following .* matches all the characters exists on that particular line. Note this this matches empty strings also. If you don't want to match empty lines then change .* at the last to .+
Note that ^(?!(\w)\1) checks for the repeated characters only at the start of a string or line.
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

Regular exp to match string from beginning until certain char is met

I have some long string where i'm trying to catch a substring until a certain character is met.
Lets suppose I have the following string, and I would like to get the text until the first ampersand.
abc.8965.aghtj&hgjkiyu5.8jfhsdj
I would like to extract what is present before the ampersand so: abc.8965.aghtj
W thought this would work:
grep'^.*&{1}'
I would translate it as
^ start of string
.* match whatever chars
&{1} until the first ampersand is matched
Any advice?
I'm afraid this will take me weeks
{1} does not match the first occurrence; instead it means "match exactly one of the preceding pattern/character", which is identical to just matching the character (&{3} would match &&&).
In order to match the first occurrence of &, you need to use .*?:
grep'^.*?&'
Normally, .* is greedy, meaning it matches as much as possible. This means your pattern would match the last ampersand rather than the first one. .*? is the non-greedy version, matching as little as possible while fulfilling the pattern.
Update: That syntax may not be supported by grep. Here is another option:
'^[^&]*&'
It matches anything that is not an ampersand, up to the first ampersand.
You also may have to enable extended regular expression in grep (-E).
Try this one:
^.*?(?=&)
it won't get ampersand sign, just a text before it

Regex to match [integer][colon][alphanum][colon][integer]

I am attempting to match a string formatted as [integer][colon][alphanum][colon][integer]. For example, 42100:ZBA01:20. I need to split these by colon...
I'd like to learn regex, so if you could, tell me what I'm doing wrong:
This is what I've been able to come up with...
^(\d):([A-Za-z0-9_]):(\d)+$
^(\d+)$
^[a-zA-Z0-9_](:)+$
^(:)(\d+)$
At first I tried matching parts of the string, these matching the entire string. As you can tell, I'm not very familiar with regular expressions.
EDIT: The regex is for input into a desktop application. I'm was not certain what 'language' or 'type' of regex to use, so I assumed .NET .
I need to be able to identify each of those grouped characters, split by colon. So Group #1 should be the first integer, Group #2 should be the alphanumeric group, Group #3 should be an integer (ranging 1-4).
Thank you in advance,
Darius
I assume the semicolons (;) are meant to be colons (:)? All right, a bit of the basics.
^ matches the beginning of the input. That is, the regular expression will only match if it finds a match at the start of the input.
Similarly, $ matches the end of the input.
^(\d+)$ will match a string consisting only of one or more numbers. This is because the match needs to start at the beginning of the input and stop at the end of the input. In other words, the whole input needs to match (not just a part of it). The + denotes one or more matches.
With this knowledge, you'll notice that ^(\d):([A-Za-z0-9_]):(\d)+$ was actually very close to being right. This expression indicates that the whole input needs to match:
one digit;
a colon;
one word character (or an alphanumeric character as you call it);
a colon;
one or more digits.
The problem is clearly in 1 and 3. You need to add a + quantifier there to match one or more times instead of just once. Also, you want to place these quantifiers inside the capturing groups in order to get the multiple matches inside one capturing group as opposed to receiving multiple capturing groups containing single matches.
^(\d+):([A-Za-z0-9_]+):(\d+)$
You need to use quantifiers
^(\d+):([A-Za-z0-9_]+):(\d+)$
^ ^ ^
+ is quantifier that matches preceeding pattern 1 to many times
Now you can access the values by accessing the particular groups

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/