Regex recursion captured string - regex

I have a problem with a regex that has to capture a substring that it's already captured...
I have this regex:
(?<domain>\w+\.\w+)($|\/|\.)
And I want to capture every subdomain recursively. For example, in this string:
test1.test2.abc.def
This expression captures test1.test2 and abc.def but I need to capture:
test1.test2
test2.abc
abc.def
Do you know if there is any option to do this recursively?
Thanks!

Maybe the following:
(\.|^)(?=(\w+\.\w+))
Go with capturing group 2

You can use a positive look ahead to capture the next group.
/(\w+)\.(?=(\w+))/g
Demonstration.
Edit: JvdV's regex is more correct.
Note that \w+ is will fail to match domains like regex-tester.com and will match invalid regex_tester.com. [a-zA-Z0-9-]+ is closer to correct. See this answer for a complete regex.
It's simpler and more robust to do this by splitting on . and iterating through the pieces in pairs. For example, in Ruby...
"test1.test2.abc.def".split(".").each_cons(2) { |a|
puts a.join(".")
}
test1.test2
test2.abc
abc.def

You may use a well-known technique to extract overlapping matches, but you can't rely on \b boundaries as they can match between a non-word / word char and word / non-word char. You need unambiguous word boundaries for left and right hand contexts.
Use
(?=(?<!\w)(?<domain>\w+\.\w+)(?!\w))
See the regex demo. Details:
(?= - a positive lookahead that enables testing each location in the string and capture the part of string to the right of it
(?<!\w) - a left-hand side word boundary
(?<domain>\w+\.\w+) - Group "domain": 1+ word chars, . and 1+ word chars
(?!\w) - a right-hand side word boundary
) - end of the outer lookahead.
Another approach is to use dots as word delimiters. Then use
(?=(?<![^.])(?<domain>[^.]+\.[^.]+)(?![^.]))
See this regex demo. Adjust as you see fit.

Related

How to regex - pattern with literal asterisk

I have 2 variants of strings:
some_prefix.needed part*some_suffix
some_prefix.needed part
I need only 'needed part' to be matched.
Left boundary is always dot.
Right boundary is asterisk (if exists) or end of line.
Already tried:
/.*[.](.*)[*].*/ - is working for first case
/.*[.](.*)/ - is working for second case
How to do the same with one regex?
You can use
/\.([^*]+)/
See the regex demo.
Details
\. - a dot
([^*]+) - Group 1: any one or more chars other than a *.
You can also make sure you get the rightmost match by using .* before the pattern (as in the original regex):
/.*\.([^*]+)/
If supported, you might also use a lookbehind to assert a . to the left.
(?<=\.)[^*]+
The pattern matches:
(?<=\.) Positive lookbehind, assert . directly to the left
[^*]+ Match 1+ times any char except * using a negated character class
Regex demo

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?
There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).
Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+
This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.
You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

RegEx: Excluding a pattern from the match

I know some basics of the RegEx but not a pro in it. And I am learning it. Currently, I am using the following very very simple regex to match any digit in the given sentence.
/d
Now, I want that, all the digits except some patterns like e074663 OR e123444 OR e7736 should be excluded from the match. So for the following input,
Edit 398e997979 the Expression 9798729889 & T900980980098ext to see e081815 matches. Roll over matches or e081815 the expression e081815 for details.e081815 PCRE & JavaScript flavors of RegEx are e081815 supported. Validate your expression with Tests mode e081815.
Only bold digits should be matched and not any e081815. I tried the following without the success.
(^[e\d])(\d)
Also, going forward, some more patterns needs to be added for exclusion. For e.g. cg636553 OR cg(any digits). Any help in this regards will be much appreciated. Thanks!
Try this:
(?<!\be)(?<!\d)\d+
Test it live on regex101.com.
Explanation:
(?<!\be) # make sure we're not right after a word boundary and "e"
(?<!\d) # make sure we're not right after a digit
\d+ # match one or more digits
If you want to match individual digits, you can achieve that using the \G anchor that matches at the position after a successful match:
(?:(?<!\be)(?<=\D)|\G)\d
Test it here
Another option is to use a capturing group with lookarounds
(?:\b(?!e|cg)|(?<=\d)\D)[A-Za-z]?(\d+)
(?: Non capture group
\b(?!e|cg) Word boundary, assert what is directly to the right is not e or cg
| Or
(?<=\d)\D Match any char except a digit, asserting what is directly on the left is a digit
) Close group
[A-Za-z]? Match an optional char a-zA-Z
(\d+) Capture 1 or more digits in group 1
Regex demo

Regex: don't match if the pattern start with /

My regex (PCRE):
\b([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
is a match (the actual match is surrounded by stars) :
**this.is.an.error**
**this.IsAnerror**
**this.is.an.error**.
**this.is.an.error**(
bla **this_is-an-error**
**this.is.an.error**:
this is an (**error**)
not a match:
this.is.an.error.but.dont.match
this.is.an.error-but.dont.match
this.is.an.error/but.dont.match
this.is.an.error/
/this.is.an.error
for this sample: /this.is.an.error
I can't manage to have a condition that will reject the whole match if it starts with the character /.
every combination I've tried resulted in some partial catch (which is not the desired).
Is there any simple or fancy way to do that?
You can try to add lookabehinds at the beginning instead of a word boundary:
(?<!\/)(?<=[^\w-.])([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
Explanation:
(?<!\/) - negative lookbehind assuring there is no / before the first character;
(?<=[^\w-.]) - word boundary implementation taking into account your extended definition of characters accepted for a word [\w-.];
Demo
Prepend your regex with \/.*|:
\/.*|\b([\w-.]*error)\b(?=[^-\/.]|(?:\.\W?)?$)
Now just like before the first capturing group holds the desired part.
See live demo here
Note: I made some modifications to your regex to remove unnecessary alternations.

Regex to match and replace a character in a pattern

I would like to replace a character "?" with "fi" in a string.
I could write a generic str replace for this. But I want to replace the "?" only if it appears in between two A-Za-z character and avoid the rest
Eg., "Okay?" should be "Okay?" and not "Okayfi"
but
Modi?es should be Modifies since it has ? in middle
What have I tried?
sentence = re.sub(r"(\?)\b", "fi", sentence)
Please see here.
https://regexr.com/3nvk3
Seems to work fine in regexr. but doesnt work well in code. Am I doing something wrong?
The best approach here is to find the original text with the fi ligature and read it in with proper encoding.
Otherwise, you will have to use some workarounds.
You may use (?<=[a-zA-Z]) / (?=[A-Za-z]) lookarounds:
sentence = re.sub(r"(?<=[a-zA-Z])\?(?=[a-zA-Z])", "fi", sentence)
See the regex demo. The (?<=[a-zA-Z]) positive lookbehind matches a position immediately after an ASCII letter, and (?!=[A-Za-z]) positive lookahead matches a position immediately before an ASCII letter.
Or, you may also use a capturing group with backreferences:
sentence = re.sub(r"([a-zA-Z])\?([a-zA-Z])", r"\1fi\2", sentence)
See another regex demo. Note that \1 references the value captured with the first ([a-zA-Z]) group and \2 references the value captured into Group 2 (([a-zA-Z])).