Multiline PCRE, multiple conditions - regex

just starting out with regex and have hit a stumbling block. Hoping someone might be able to explain the workaround.
Trying to carry out a multi-line search. I wish to use "*" as the 'flag', so to speak: if a line contains an asterisk it should match. The digits at the start of the line should be output, so should the word "Match" in the linked example, excluding the asterisk itself.
I assume my use of "|" is dividing the regex into two conditions, when it actually needs to satisfy both to match.
https://regex101.com/r/Pu56bi/2
(?m)(^\d+)|(?<=\*).*$
Any help kindly appreciated.

You could use a pos. lookahead as in
^(?=.*?\*)(\d+).+?(Match)$
See your modified example on regex101.com.

If Match is always at the end of the string, you could match the digits at the start of the string, then match an * and Match at the end of the string.
Use a word boundary \b to prevent the word of digits being part of a longer word.
^(\d+)\b.*\*.*\b(Match)$
Regex demo
If there can be test after the word Match you can assert * using a positive lookahead.
^(?=.*\*)(\d+)\b.*\b(Match)\b.*$
Regex demo

Related

A regular expression to find a word and also exclude another word/string

I have a regular expression as follows:
te\b"[^Haste]"
I want to find all words ending with "te" in each segment but need to exclude the word "Haste" and possibly few other words as they are sometimes flooding the list of errors as false positives.
Any help would be gratefully appreciated :-)
I tried to look it up here and there with no success. Also, many tries on regex101 with no success.
Try this:
\b(?!(?:Haste|AAAte)\b)\w*te\b
\b word boundary.
(?!(?:Haste|AAAte)\b) that is not followed by the word Haste or AAAte.
\w* zero or more word character.
te the string te.
\b word boundary.
See regex demo
One way is to match, but not capture, what you don't want and capture what you do want. Suppose we wanted to skip over "haste" and "paste". We could then use the following regular expression.
\b(?:haste|paste|(\w*te))\b
Suppose the string were as follows.
"In the surgeon's haste to amputate he removed the wrong leg."
The string pointer maintained by the regex engine would move from left to right one character at a time until it matched a word in the sentence ending in "te". The first would be "haste". That would be matched but not captured. We therefore pay no attention to that match.
Next, "amputate" is matched by
(\w*te)
As it is captured as well we find that "amputate" is a valid match.
Demo.

Regex - returning a match without a period

I'm using the below regex string to match the word "kohls" which is located in a group of other words.
\W*((?i)kohls(?-i))\W*
It works great when the word is alone, but if the word is in a url, the match includes a period on both sides.
See the below examples:
Thank you for shopping at Kohls - returns a match for kohls.
https://www.kohls.com - returns a match for .kohls.
Edit. https://www.KohlsAndMichaels.com - doesn't return any match for kohls.
I want it to only extract the exact match for kohls without periods or any other symbols/text in front or behind it. Can you tell me what I'm doing wrong?
In cases like that you can always use a site like regex101.com, which explains the regular expression and shows the matches with colors. So this is how your regular expression currently works:
As you can see in blue color, the problem with the dots is in the \W*, which matches any non-word character. In order to fix this, you can use the following regular expression:
\b((?i)kohls(?-i))\b
The \b (before and after the word you want to match) is used to assert the position at a word boundary. See how this work on that website now:
If you still have questions, look at the explanation of the regular expression provided by that website. It is worth looking.
The \W metacharacter is used to find non-word characters. So adding a star operator will match 0 or more of these non-word characters (like periods). Did you meant to add a word boundary instead?
\b(?i)kohls(?-i)\b
Replace both \W* with [\W,\.\-]* etc.
Should be enough.

check if there is a word repeated at least 2 or more times. (Regular Expression)

Using Regular Expression,
from any line of input that has at least one word repeated two or more times.
Here is how far i got.
/(\b\w+\b).*\1
but it is wrong because it only checks for single char, not one word.
input: i might be ill
output: < i might be i>ll
<> marks the matched part.
so, i try to do (\b\w+\b)(\b\w+\b)*\1
but it is not working totally.
Can someone give help?
Thanks.
this should work
(\b\w+\b).*\b\1\b
greedy algorithm will ensure longest match. If you want second instance to be a separate word you have to add the boundaries there as well. So it's the same as
\b(\w+)\b.*\b\1\b
Positive lookahead is not a must here:
/\b([A-Za-z]+)\b[\s\S]*\b\1\b/g
EXPLANATION
\b([A-Za-z]+)\b # match any word
[\s\S]* # match any character (newline included) zero or more times
\b\1\b # word repeated
REGEX 101 DEMO
To check for repeated words you can use positive lookahead like this.
Regex: (\b[A-Za-z]+\b)(?=.*\b\1\b)
Explanation:
(\b[A-Za-z]+\b) will capture any word.
(?=.*\b\1\b) will lookahead if the word captured by group is present or not. If yes then a match is found.
Note:- This will produce repeated results because the word which is matched once will again be matched when regex pointer captures it as a word.
You will have to use programming to strip off the repeated results.
Regex101 Demo

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.
Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.
To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

Using RegEx to mach the beginning of string if end of string is not

I am trying to match lines in a configuration that start with the word "deny" but do not end with the word "log". This seems terribly elementary but I can not find my solution in any of the numerous forums I have looked. My beginners mindset led me to try "^deny.* (?!log$)" Why wouldn't this work? My understanding is that it would find any strings that begin with "deny" followed by any character for 0 or more digits where the end of line is something other than log.
When given a line like deny this log, your ^deny.*(?!log$) regex (I'm omitting the space that was in your sample question) is evaluated as follows:
^deny matches "deny".
.* means "match 0 or more of any character", so it can match " this log".
^(?!log$) means "make sure that the next characters aren't 'log' then the end of the line." In this case, they're not - they're just the end of the line - so the regex matches.
Try this regex instead:
^deny.*$(?<!log)
"Match deny at the beginning of the string, then match to the end of the line, then use a zero-width negative look-behind assertion to check that whatever we just matched at the end of the line is not 'log'."
With all of that said...
Regexes aren't necessarily the best tool for the job. In this case, a simple Boolean operator like
if (/^deny/ and not /log$/)
is probably clearer than a more advanced regex like
if (/^deny.*$(?<!log)/)
(?!log$) is a zero-width negative look-ahead assertion that means don't match if immediately ahead at this point in the string is log and the end of the string, but the .* in your regex has already greedily consumed all the characters right up to the end of the string so there is no way the log could then match.
If your regular expression implementation supports look-behinds you could use a regex such as in Josh Kelley's answer, if you were using javascript you could use
/^deny(?:.{0,2}|.*(?!log)...)$/m
The m flag means multiline mode, which makes ^ and $ match the start and end of every line rather than just the start and end of the string.
Note that three . are positioned after the negative look-ahead so that it has space to match log if it is there. Including these three dots meant it was also necessary to add the .{0,2} option so that strings with from zero to two characters after deny would also match. The (?:a|b) means a non-capturing group where a or b has to match.