How to exclude a word from regex subpattern? - regex

I am using Delphi 7 and TDIPerlRegEx. I am looking for verbs in parts of sentence which contain some specific part to identify the verb.
s1 := '(I|you|he|she|it|we|they|this|that|these|those)';
s2 := (can|should|would|could|must|want to|have to|had to|might);
RegEx_Seek_1.MatchPattern := '(*UCP)(?m) \b'+s1+'\b \b'+s2+'\b \K([^ß\W]\w{2,15})\b';
The key word which is wrongly included in result is "not"; but should be exluded:
Sample text:
... that you should not ßeat of every ...
Verb like this should be included in result:
Sample text:
lest he should put forth his hand ...
Now I would explain the part with ß sign. The ß sign says, that the original text had "not" word, and then the verb is followed. But I changed this text in previous interaction or session so the source text which I am working now is as stated above. The pattern ([^ß\W]\w{2,15}) should avoid the word which is used in negative sense. This is also why do not include the "negative" verb.
So point of the question is how to exclude the "not" word from the captured text; that is - captured by this pattern, which is either ([^ß\W]\w{2,15}) or (\W{3,15}) .
I am using this pattern to replace substrings in text.
More sample text needed?
than I can bear. And
so I might have taken her
they might dwell together
they could not ßdwell together
lest you should say,
In group 3 I expect match:
for bear, taken (or posibly have instead of taken), dwell and say.
I am trying to exclude the not word, so any verb or word following not must be excluded from 3rd group or the match completely. I am interested about group 3 only. Group 1 and 2 just specifies alternatives preceding the verb.

You may use a branch reset group to match an empty string if there is not as a whole word after a modal verb, or a notional verb otherwise:
\b(I|you|he|she|it|we|they|this|that|these|those)\s+(can|should|would|could|must|want to|have to|had to|might)\s+\K(?|(?=not\b)()|([^ß\W]\w{2,15})\b)
See the regex demo
Details
\b - a word boundary
(I|you|he|she|it|we|they|this|that|these|those) - one of the pronouns in the group 1
\s+ - 1+ whitespaces (it is already acting as a word boundary on both sides of the adjacent groups)
(can|should|would|could|must|want to|have to|had to|might) - one ofthe modal verbs
\s+ - 1+ whitespaces
\K - match reset operator
(?|(?=not\b)()|([^ß\W]\w{2,15})\b) - the branch reset group matching either
(?=not\b)() - if there is not as whole word immediately to the right, capture an empty string into Group 3
| - or (here, else)
([^ß\W]\w{2,15})\b - match and capture into Group 3 any word char other than ß and then 2 to 15 word chars with a word boundary to follow.
Note that (?m) - PCRE_MULTILINE - is only necessary if you want your ^ and $ outside of character classes match start and end of lines rather than the whole string. Since your pattern has no such anchors, (?m) is redundant.

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Pattern to match everything except a string of 5 digits

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:
You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.
The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

Extract application name from user agent

I am using the following regex to extract application name from user agents:
^([^\s/\[]+)([\s/\[]|\z)
Application name termination character class consists of white space, backslash and [.
It reads any character that is not whitespace or / or [ until characters from the beginning until whitespace or / or [
link : https://regex101.com/r/7ndDEq/1
It is failing on some application name which has white space in between and extracts characters before white space.
eg:
Based on above regex on:
Pump Log/1300 CFNetwork/1121.2.2 Darwin/19.3.0
It extracts Pump
but the ground truth is Pump Log
Unless I'm misreading your requirements, your application name is anything up to but not including the first slash, which would just be
^([^/]+)
Or depending on your regex engine (which you should always specify when asking regex questions), you could do this with PCRE:
^(.+?)/
Try this:
^([^\s/[]+(?:\s[\w]+/)?)
It's almost there (the last slash should be removed in some matches).
The principle is simple: after capturing the required string, allow the regex to catch the optional stuff (in our case it's the second word after the first space) as well if it is available after the main match (the ? sign at the end makes this second part like optional).
UPD: this one is more general
^([^\s/[]+(?: [^/\d]+)?)
But there are two interesting points here:
I had to put a whitespace in regex, \s did not work there, I don't know how it will be in the code
It is required to have some rule what is possible after the whitespace, where we need to stop in the second optional part. If it's a slash or a bracket that will work fine but in strings like Apple iPhone10,4 iOS v13.3.1 Main/3.2.0 or POF 12.51.1859; (iPhone8,4; iOS 13.3.1; en_US; g=ON; p=ON; r=WWAN) 56BA8A93-3748-4C5E-9D00-D811FCC4EBCE; it's hard to find where to stop...
You might specify the allowed characters in a character class or use an alternation |
You can extend those to allow more characters or allowed strings.
^([^\s/\[]+(?: (?:& )?[A-Z][a-z]*)*)(?:[\s/\[]|\Z)
^ Start of string
( Capture group 1
[^\s/\[]+ Match 1+ times any char except a whitespace char, / or [
(?: Match a space (Or use \s+ to match 1+ whitespace chars which could also match a newline)
(?:& )?[A-Z][a-z]* Optionally match & and match an uppercase char A-Z followed by optional lowercase chars a-z
)* Close non capture group and optionally repeat
) Close group 1
(?:[\s/\[]|\Z) Match either a space / [ or assert the end of the string
Regex demo
Note that as you selected Python on regex101, you can use \Z to assert the position at the end of the string.

Regular Expression for checking subword between capture groups

Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)
Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"