Basic Regular expression for matching all lines except given set of lines - regex

Can someone explain using basic regular expression (not lookahead like extensions please) to match all content except matching set of lines
For example if I want to match everything in content except first three lines, I can think of doing this in two steps:
(.*\n){3} matches first three lines
Match everything except lines matched in last step
I tried expression like:
[^(.*\n){3}].*\n
But this isn't working.
How to do the second step ?

This is basic to me:
^(?:.*\n){3}\K[\w\W]*
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
\K 'K'
--------------------------------------------------------------------------------
[\w\W]* any character of: word characters (a-z, A-
Z, 0-9, _), non-word characters (all but a-
z, A-Z, 0-9, _) (0 or more times (matching
the most amount possible))

Related

Capture last occurrence from multiple occurrences in Regex pattern

How can I capture the below desired capture? I did this way Regex ONE.*(ONE.) but it captures the whole string.
Notedpad++:
1 ONE;TWO;THREE;ONE;FOUR;FIVE
2 TEST
3 TEST
4 TEST
5 TEST
Desired Capture: If ONE has 1 match then return ONE;TWO;THREE else if ONE has two matches then return ONE;FOUR;FIVE.
You can use
^.*\K\bONE\b.*
The pattern matches:
^ Start of string
.* Match any char 0+ times
\K\bONE\b Forget what is matched so far, and backtrack till the last occurrence of ONE to match it
.* Match the rest of the line
Regex demo
In Toad SQL, use
SELECT REGEXP_SUBSTR(Column, '.*(ONE.*)', 1, 1, NULL, 1)
EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
In Notepad++, use
.*\KONE(?:(?!ONE).)*
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\K matc reset operator
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------
You can also use (?:ONE.*)?(ONE.*) and retrieve your result from the first capturing group.
This regex will always try to match two ONEs in a line, but lets you access the part relevant to the second ONE. When there's only one that's the only part that matches.
You can try it here.

regex if (text contain this text) match this

I have these two sentence
TAGGING ODP:-7.160792, 113.496069
TAGGING pel:-7.160792, 113.496069
I want to match -7.160792 part only if the full sentence contain "odp" in it.
I tried the following (?(?=odp)-\d+.\d+) but it doesn't work, i don't know why.
Any help is appreciated.
(?(?=odp)-\d+\.\d+) won't work because (?=odp) is a positive lookahead that imposes a constraint on the pattern on the right, -\d+\.\d+. Namely, it requires odp string to occur exactly at the same location where - and a number are expected.
Use
(?<=ODP:)-\d+\.\d+
ODP:(-\d+\.\d+)
If lookbehinds are supported, the first variant is more viable.
Otherwise, another option with capturing groups is good to use.
And if odp can appear anywhere, even after the number:
(?i)^(?=.*odp).*(-\d+\.\d+)
This will capture the value into a group.
EXPLANATION
--------------------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
odp 'odp'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
You can use the regex, (?i)(?<=odp:)[^,]*.
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
[^,]*: Anything but ,
👉 If you want the match to be restricted to numbers only, you can use the regex, (?i)(?<=odp:)(?:-\d+.\d+)
Explanation:
(?i): Case-insenstitive flag
(?<=odp:): Positive lookbehind for odp:
(?:: Start non capturing group
-: Literal, -
\d+: 1+ digit(s)
.\d+: . followed by 1+ digit(s)
): End non capturing group
👉 If the sign can be either + or -, you can use the regex, (?i)(?<=odp:)(?:[+-]\d+.\d+)
The pattern (?(?=odp)\-\d+\.\d+) is using a conditional (? stating in the if clause:
If what is directly to the right from the current position is odp,
then match -\d+.\d+
That can not match.
What you also could do is match odp followed by any char other than a digit using \D* and capture the digit part in a group.
\bodp\b\D*(-\d+\.\d+)\b
The pattern matches:
\bodp\b match odp between word boundaries to prevent a partial match
\D* Optionally match any char other than a digit
(-\d+\.\d+) Capture - and 1+ digits with a decimal part in group 1
\b A word boundary
Regex demo
(?<=ODP:)(-\d+.\d+)
You can try using the negative look behind.
This should solve for the code you ve provided.

Regex with one specific number and a word

I would like to match a specific pattern with regex but I am running into catastrophic backtracking. I wonder if there's a way it would be possible to match what I would like and not get an error.
I start with a simple assumption; I want my string to contain only one specific number e.g. 7 and only that specific number:
^\D*7\D*$
Only if I find this pattern do I want to look for another word in the same text such as "Coffee"; I put my condition into a group (^\D*7\D*$) and reference the group in my conditional and the then part will contain "Coffee":
(?(1)Coffee|)
Is there another phrasing that would avoid the the catastrophic backtracking?
You can use a negative lookahead to assert that the word Coffee is at the right.
^(?=.*\bCoffee\b)\D*7\D*$
The pattern matches:
^ Start of string
(?= Positive lookahead, assert that on the right is
.*\bCoffee\b Match Coffee between word boundaries \b to prevent a partial match
) Close lookahead
\D*7\D* Match number 7 between optional non digit characters.
$ End of string
Regex demo
Note that \D also matches a newline. If you don't want to cross newline boundaries, you can use [^\r\n\d] instead.
Left to right checking is more traditional:
^(?=.*Coffee)[^\d7]*7\D*$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
Coffee 'Coffee'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^\d7]* any character except: digits (0-9), '7' (0
or more times (matching the most amount possible))
--------------------------------------------------------------------------------
7 '7'
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the string
Right to left checking is only possible with engines like latest JavaScript, .NET or PyPi regex in Python:
^[^\d7]*7\D*$(?<=Coffee.*)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[^\d7]* any character except: digits (0-9), '7' (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
7 '7'
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
Coffee 'Coffee'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of look-behind

How to match strings not containing any word characters between a minus sign and numbers in PL/SQL regexp

I have some strings in Oracle where there is a minus sign (not at the beginning but inside the string), followed by a number (int or decimal with dot or comma).
I would like to find these in PLSQL. I have this already, and it's almost perfect:
REGEXP_LIKE(string, '-\d+(,|\.)*\d*')
I was hoping that it's finding strictly strings like somestring-11,1 but the problem is, it finds also strings like somestring-11a1,1 so where there is eventually a non numeric (or word) character between the minus and the numbers. I was trying to use negative lookahead, but unfortunately it's not working:
REGEXP_LIKE(string, '-\d+!(\w)(,|\.)*\d*')
because somestring-1s won't be found either anymore. Could you please point me to the right direction? Thank you.
Could you please try following, written and tested based on your shown samples. Simple explanation would be: using lazy match to match till - then match digits(1 or more occurrences) followed by , and followed by 1 or more occurrences of digits.
.*?-\d+,\d+
Online regex demo for above regex
Use
(^|\D)-(\d+([,.]*\d+)?)($|\W)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[,.]* any character of: ',', '.' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \3)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z, 0-
9, _)
--------------------------------------------------------------------------------
) end of \4

Fix for the catastrophic backtracking in the regex expression

I have encountered the following regex in the company's code base.
(?:.*)(\\bKEYWORD\\b(?:\\s+)+(?:.*)+;)(?:.*)
Breakdown of Regex:
Non Capturing group: (?:.*)
Capturing group (\\bKEYWORD\\b(?:\\s+)+(?:.*)+;)
2.1 Non Captuing group: (?:\\s+)+
2.2 Non Capturing group: (?:.*)+
Non Capturing group: (?:.*)
The above regex goes into catastrophic bactracking when it fails to match ; or the test sample becomes too long. Check below the two test samples:
1. -- KEYWORD the procedure if data match between Type 1 and Type 2/3 views is not required.
2. KEYWORD SET MESSAGE_TEXT = '***FAILURE : '||V_G_ERR_MSG_01; /*STATEMENT TO THROW USER DEFINED ERROR CODE AND EXCEPTION TO THE CALLER*/
I went though Runaway Regular Expression's article aswell and tried to use to Atomic Grouping but still no results. Can anybody help me how to fix this regex ?
According to the link you provided, there are patterns like (x+x+)+ in your expression: (?:\\s+)+ and the subsequent (?:.*)+. The first matches one or more whitespace characters one or more times, and the second matches indefinite amount of any chars one or more times. This hardly makes sense.
Non-capturing groups are unnecessary here.
Use
.*\\b(KEYWORD\\s.*;).*
See proof
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
KEYWORD 'KEYWORD'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))