Capture a specific line if it exists with regex - regex

If I have files that could either be:
numbers:
32
45
999
56
or
numbers:
23
45
56
The 999 is a constant, but the other numbers & number of lines change.
Is there a way of capturing:
above 999 followed by an empty line in the first case (i.e. excluding 999)
above the empty line in the second case, where 999 doesn't exist
So far I've tried:
(numbers:(?:\n.+)*)(\n999) — this works great in the first case; the first group captures everything above 999. It obv doesn't work where there's no 999, so...
(numbers:(?:\n.+)*)(\n999)? — I would have thought this would work for both cases. But in the first case, this captures the 999 in the first group, I guess because it's greedy and the ? makes the (\n999) optional, so the first group is free to capture it.
It's also possible I'm massively overcomplicating this and there's some easy solution.
Thanks a lot!

Here is a regex that does it without using lookahead:
^(numbers:(?:\n.+)*?)(?:\n(?:999)?$)
RegEx Demo
RegEx Details:
^: Start line
(: Start capture group #1
numbers:: Match numbers: text
(?:\n.+)*?: Match a line break followed by 1+ character in a line. Repeat this 0 or more times (non-greedy)
): End capture group #1
(?:\n(?:999)?$): Must be followed by a line break and 999 or empty line

Use
(numbers:(?:\n(?!999).+)*)(\n999)?
See regex proof.
The question mark was the right move. The (?!999) negative lookahead is restricting the (.+)* and it does not cross optional nines.
If you do not want cross another numbers add to lookahead:
(numbers:(?:\n(?!999|numbers:).+)*)(\n999)?
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
numbers: 'numbers:'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
999 '999'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
999 '999'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
You prefer no lookarounds? Here:
^(numbers:[\s\S]*?)\n(?:999)?$
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
numbers: 'numbers:'
--------------------------------------------------------------------------------
[\s\S]*? any character of: whitespace (\n, \r,
\t, \f, and " "), non-whitespace (all
but \n, \r, \t, \f, and " ") (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
999 '999'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Related

Capture last occurrence from multiple occurrences in Regex pattern

How can I capture the below desired capture? I did this way Regex ONE.*(ONE.) but it captures the whole string.
Notedpad++:
1 ONE;TWO;THREE;ONE;FOUR;FIVE
2 TEST
3 TEST
4 TEST
5 TEST
Desired Capture: If ONE has 1 match then return ONE;TWO;THREE else if ONE has two matches then return ONE;FOUR;FIVE.
You can use
^.*\K\bONE\b.*
The pattern matches:
^ Start of string
.* Match any char 0+ times
\K\bONE\b Forget what is matched so far, and backtrack till the last occurrence of ONE to match it
.* Match the rest of the line
Regex demo
In Toad SQL, use
SELECT REGEXP_SUBSTR(Column, '.*(ONE.*)', 1, 1, NULL, 1)
EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
In Notepad++, use
.*\KONE(?:(?!ONE).)*
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\K matc reset operator
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
ONE 'ONE'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------
You can also use (?:ONE.*)?(ONE.*) and retrieve your result from the first capturing group.
This regex will always try to match two ONEs in a line, but lets you access the part relevant to the second ONE. When there's only one that's the only part that matches.
You can try it here.

How to capture multiple sequence of numbers as repeated groups?

I have a URL that contains multiple sequences of numbers I want to capture them all in groups suppose I have the following
https://www.example.com//first/part/54323?key=value
or
https://www.example.com/first/12345/second/part/part2/5432?key=value
I tried to use something like that but it only matches one sequence of numbers
(.*\/)([0-9]{4,})(\/.*|$|)
I want to have multiple groups represent different sections if numbers sequence is included
1st group will be "example.com/first"
2nd group "12345"
3rd group "second/part"
4th group "5432"
5th group "?key=value"
The initial .* is Greedy, meaning it tries to match as much as possible. It matched everything up to the last slash "https://www.example.com/first/12345/second/part". You can modify this behavior by replacing the initial .* with .*?, but that will stop after the first slash, which is also not what you want "https:/" because there are no digits after those slashes.
But really we need to back up and ask some questions about your pattern. Apparently, you have a preamble you are not interested in, an indefinite number of sequences of 'character string, followed by slash, followed by number string' and then there is the "everything after there are no more slash digit patterns".
The key question is whether the number of char/char/digits combos are indefinite or limited to a definite number like the two pairs in your example. To get the regex parser to return an unbounded number of string-number pairs, you are going to want to turn on the /g (Global) switch so regex will return all matches. That is a problem with the part of your URL at the beginning and end which does not fit your pattern.
I recommend first using a regular expression to divide your URL into three parts, preamble, path, remaining data. Then you can pass the path string to a second regular expression to parse the pairs - it will be much simpler.
If you do it that way your first expression could be:
^[a-z+.-]+?:\/\/(:www\.)?([^?#]+?)(.*)$
The first part skips over everything through the optional www. and does not capture it because you are not interested in that part. The second part captures everything up to any query or fragment (delimited by ? and #, respectively) and places it in the first capture group. The last part captures the rest of the URL into the the second capture group. In your example that is ?key=value.
Now take your first capture group, which contains the host and the path, and pass it to a second regex with the global flag set (so it processes all pairs repeatedly). This second regex will be:
(.*?)\/([0-9]{4,})\/?
For each match of this string, the parsed values and numbers will be in capture groups 1 & 2.
It sounds very straight-forward:
https?:\/\/(?:www\.)?(.*?)\/(\d+)\/(.*?)\/(\d+)(?:\?(.*))?
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
)? end of grouping

Regex with one specific number and a word

I would like to match a specific pattern with regex but I am running into catastrophic backtracking. I wonder if there's a way it would be possible to match what I would like and not get an error.
I start with a simple assumption; I want my string to contain only one specific number e.g. 7 and only that specific number:
^\D*7\D*$
Only if I find this pattern do I want to look for another word in the same text such as "Coffee"; I put my condition into a group (^\D*7\D*$) and reference the group in my conditional and the then part will contain "Coffee":
(?(1)Coffee|)
Is there another phrasing that would avoid the the catastrophic backtracking?
You can use a negative lookahead to assert that the word Coffee is at the right.
^(?=.*\bCoffee\b)\D*7\D*$
The pattern matches:
^ Start of string
(?= Positive lookahead, assert that on the right is
.*\bCoffee\b Match Coffee between word boundaries \b to prevent a partial match
) Close lookahead
\D*7\D* Match number 7 between optional non digit characters.
$ End of string
Regex demo
Note that \D also matches a newline. If you don't want to cross newline boundaries, you can use [^\r\n\d] instead.
Left to right checking is more traditional:
^(?=.*Coffee)[^\d7]*7\D*$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
Coffee 'Coffee'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^\d7]* any character except: digits (0-9), '7' (0
or more times (matching the most amount possible))
--------------------------------------------------------------------------------
7 '7'
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the string
Right to left checking is only possible with engines like latest JavaScript, .NET or PyPi regex in Python:
^[^\d7]*7\D*$(?<=Coffee.*)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[^\d7]* any character except: digits (0-9), '7' (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
7 '7'
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
Coffee 'Coffee'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of look-behind

How to match strings not containing any word characters between a minus sign and numbers in PL/SQL regexp

I have some strings in Oracle where there is a minus sign (not at the beginning but inside the string), followed by a number (int or decimal with dot or comma).
I would like to find these in PLSQL. I have this already, and it's almost perfect:
REGEXP_LIKE(string, '-\d+(,|\.)*\d*')
I was hoping that it's finding strictly strings like somestring-11,1 but the problem is, it finds also strings like somestring-11a1,1 so where there is eventually a non numeric (or word) character between the minus and the numbers. I was trying to use negative lookahead, but unfortunately it's not working:
REGEXP_LIKE(string, '-\d+!(\w)(,|\.)*\d*')
because somestring-1s won't be found either anymore. Could you please point me to the right direction? Thank you.
Could you please try following, written and tested based on your shown samples. Simple explanation would be: using lazy match to match till - then match digits(1 or more occurrences) followed by , and followed by 1 or more occurrences of digits.
.*?-\d+,\d+
Online regex demo for above regex
Use
(^|\D)-(\d+([,.]*\d+)?)($|\W)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[,.]* any character of: ',', '.' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \3)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z, 0-
9, _)
--------------------------------------------------------------------------------
) end of \4

How to remove the hash tag from this regex

My current regex:
(\d.{17})[^#]*(\D+)(\d+)gr(\d+)
In group 2, they are still having the hashtag, I want to remove it from there. What should I change from my current regex?
201223E0MWJPJD2230#AdeSaputra290gr99000
2101023CNV6TT1109J#Fefe430gr142000
2101183EDTFPSA0128#Jessica500gr112000
201221E2QKWRY11413#EssyYosita880gr233500
2101123G9XQ7R41705#Meily1120gr329000
201228ECEWTJT50859#WidyaNatali1720gr457230
201227EEBX1K9K1020#Excelio112gr58900
2101112N4YNFB12016#DebyNath520gr156220
2101072R8A0QB22347#AlycieHandoTan700gr85000
Output:
group 1: 201223E0MWJPJD2230
group 2: #AdeSaputra
group 3: 290
group 4: 99000
remove [^] from your regex
(\d.{17})#*(\D+)(\d+)gr(\d+)
see this
\D matches any non-digits and it matches # in your input.
Add # to the pattern instead of the opposite [^#] so as to match it.
Use
^(\d.{17})#(\D+)(\d+)gr(\d+)$
See proof it works. Adding anchors to match entire strings.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
.{17} any character except \n (17 times)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
# '#'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
gr 'gr'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string