I have this regex pattern which I made myself (I'm a noob though, and made it through following tutorials):
^([a-z0-9\p{Greek}].*)\s(Ε[0-9\p{Greek}]+|Θ)\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+$)
And I'm trying to match the following sentences:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ 1 Θ (ΑΜΦ) - ΜΑΣΤΟΡΟΚΩΣΤΑΣ
ΕΙΣΑΓΩΓΗ ΣΤΗΝ ΠΛΗΡΟΦΟΡΙΚΗ Θ (ΑΜΦ) - ΒΟΛΟΓΙΑΝΝΙΔΗΣ
And so on.
This pattern splits the string into 4 parts.
For example, for the string:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
The first match is: ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ (Subject's Name)
Second match is: Ε2 (Class)
Third match is: Ε.Β.Δ. (Room)
And the forth match is: ΔΗΜΗΤΡΙΟΥ (Teacher)
Now in some entries E*/Θ is not defined, and I want to get the 3 matches without the E*/Θ. How should I modify my pattern so that (Ε[0-9\p{Greek}]+|Θ) is an optional match?
I tried ? so far, but because in my previous matches i'm defining \s and \s it requires 2 whitespaces to get 3 matches and i only have one in my string.
I think you need to do two things:
Make .* lazy (i.e. .*?)
Enclose (?:\s(Ε[0-9\p{Greek}]+|Θ))? with a non-capturing optional group.
The regex will look like
^([a-z0-9\p{Greek}].*?)(?:\s(Ε[0-9\p{Greek}]+|Θ))?\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+)$
^^ ^^ ^
See demo
If you do not make the first .* lazy, it will eat up the second group that is optional. Making it lazy will ensure that if there is some text that can be matched by the second capturing group, it will be "set".
Note you call capture groups matches, which is wrong. Matches are whole texts matched by the entire regular expression and captures are just substrings matched by parts of regexp enclosed in unescaped round brackets. See more on capture groups at regular-expressions.info.
You can use something like:
(E[0-9\p{Greek}]+|0)?
The whole group will be optional (?).
Related
I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.
I have strings which look like this:
/xxxxx/xxxxx-xxxx-xxxx-338200.html
With my regex:
(?<=-)(\d+)(?=\.html)
It matches just the numbers before .html.
Is it possible to write a regex that matches everything that surrounds the numbers (matches the .html part and the part before the numbers)?
In your current pattern you already use a capturing group. In that case you might also match what comes before and after instead of using the lookarounds
-(\d+)\.html
To get what comes before and after the digits, you could use 2 capturing groups:
^(.*-)\d+(\.html)$
Regex demo
In the replacement use the 2 groups.
This should do the job:
.*-\d+\.html
Explanation: .* will match anything until -\d+ say it should match a - followed by a sequence of digits before a \.html (where \. represents the character .).
To capture groups, just do (.*-)(\d+)(\.html). This will put everything before the number in a group, the number in another group and everything after the number in another group.
I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?
The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.
The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols
The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.
I wish to match the end of my text and for it I have to match all the characters and the line breaks.
But I must exclude the beginning of the next capture!
What I want is to delimit the end of the pattern where the next pattern begins.
I tried to replace
[^-]
by something like
(?!-{2}\\*{3})
It doesn't work !
So I want to capture the number and I want to capture the whole paragraph (some text) between (--*** x ***)
Using this regex seems to work:
--\*{3}([\d]*)\*{3}(((?!-).*\n)*)
1st Capturing group: The digit inside the stars.
2nd Capturing group: The text between the "headers"
3rd Capturing group: The last line of the paragraph.
A link with the regex tested:
https://regex101.com/r/xJ0gC6/1
I found exactly what I wanted! :)
--\*{3}([^!*]*)\*{3}((?:(?!-{2}\*{3})(?:\n|.))*)
I must group what I want and what I don't want.
For that I must use a 'non-capture group' and a 'negative lookahead':
(?!nowant)(?:want)
Then I must use a 'non-capture group' to agregate the matching:
(?:(?!nowant)(?:want))
After, I add the quantifier '*'
(?:(?!nowant)(?:want))*
And finally, I add a 'capture group':
((?:(?!nowant)(?:want))*)
So here is the regex:
((?:(?!-{2}\*{3})(?:\n|.))*)
You can see the complete Regex here :
https://regex101.com/r/xJ0gC6/2
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.