RegEx, get very first match or very last match? - regex

New to RegEx, PCRE(PHP), have a basic question:
Text String I'm working with is below, text is literal
us%3Aks%2Cus%3Aal%2Cus%3Aok%2Cus%3Aia%2Cus%3Ala%2Cus%3Asc%2Cus%3Aut%2Cus%3Act%2Cus%3Aor%2Cus%3Atn%2Cus%3Amo%2Cus%3Aaz%2Cus%3Ain%2Cus%3Amd%2Cus%3Aco%2Cus%3Awi%2Cus%3Awa
Goal for getting the first is to get everything up to the first %2C and the first %2C -> "us%3Aks%2C"
Goal for getting the last is to get the the last %2C and everything after it. -> "%2Cus%3Awa"
What am I doing wrong with my attempts?
1. ^(.+%2C)
2. (%2C.+)$

You may use this regex with a lazy match and a greedy match:
^(.*?%2C).+(%2C.*)$
RegEx Demo
RegEx Details:
^: Start
(.*?%2C): Match 0 or more characters followed by %2C (lazy match) in group #1
.+: Match 1 or more of any characters (greedy match)
(%2C.*): Match %2C followed by 0 or more characters in group #2
$: End

It's a matter of greediness, which controls how many characters the expression will gobble before being satisfied. So, instead of using .+, you could use .*?.
For your case (1), the expression becomes:
1. ^(.*?%2C)
For your second case, unfortunately, purely lazy matching will not help, but we will have to actually skip most of the string in advance, with a very greedy .+, so the second expression becomes something like:
2. .+(%2C.+)$

Related

Regular expression using non-greedy matching -- confusing result

I thought I understood how the non-greedy modifier works, but am confused by the following result:
Regular Expression: (,\S+?)_sys$
Test String: abc,def,ghi,jkl_sys
Desired result: ,jkl_sys <- last field including comma
Actual result: ,def,ghi,jkl_sys
Use case is that I have a comma separated string whose last field will end in "_sys" (e.g. ,sometext_sys). I want to match only the last field and only if it ends with _sys.
I am using the non-greedy (?) modifier to return the shortest possible match (only the last field including the comma), but it returns all but the first field (i.e. the longest match).
What am I missing?
I used https://regex101.com/ to test, in case you want to see a live example.
You can use
,[^,]+_sys$
The pattern matches:
, Match the last comma
[^,]+ Match 1 + occurrences of any char except ,
_sys Match literally
$ End of string
See a regex demo.
If you don't want to match newlines and whitespaces:
,[^\s,]+_sys$
It sounds like you're looking for the a string that ends with "_sys" and it has to be at the end of the source string, and it has to be preceded by a comma.
,\s*(\w+_sys)$
I added the \s* to allow for optional whitespace after the comma.
No non-greedy modifiers necessary.
The parens are around \w+_sys so you can capture just that string, without the comma and optional whitespace.

Match first of two conditions

My problem is simple, but I've been pulling my hair out trying to solve it. I have two types of strings: one has a semicolon and the other doesn't. Both have colons.
Reason: A chosen reason
Delete: Other: testing
Reason for action: Other; testing
Blah: Other; testing;testing
If the string has a semicolon, I want to match anything after the first one. If it has no semicolon, I want to match everything after the first colon. For lines above I should get:
A chosen reason
Other: testing
testing
testing;testing
I can get the semicolon to match by using ;(.*) and I can get the colon to match by using :(.*).
I tried using an alternative like this: ;(.*)|:(.*) thinking that maybe if I have the right order I can get it to match the semicolon first, and then the colon if there is no semicolon, but it always just matched the colon.
What am I doing wrong?
Edit
I added another test case above to match the requirements I had stated. For strings with no semicolon, it should match the first colon.
Also, "Reason" could be anything, so I am clarifying that as well in the test cases.
Second Edit
To clarify, I'm using the POSIX Regular Expressions (using in PostgeSQL).
My guess is that you might want to design an expression, maybe similar to:
:\s*(?:[^;\r\n]*;)?\s*(.*)$
Demo
Here you have a fast regex (233 steps) with no look aheads.
.*?:\s*(?:([^\n;]+)|.*?;\s*(.*))$
Check out the regex https://regex101.com/r/9gbpjW/3
UPDATED: to match any placeholder. Instead of just Reason
One option is to use an alternation to first check if the string has no ; If there is none, then match until the first : and capture the rest in group 1.
In the case that there a ; match until the first semicolon and capture the rest in group 1.
For the logic stated in the question:
If the string has a semicolon, I want to match anything after the first one.
If it has no semicolon, I want to match everything after the first colon
You could use:
^(?:(?!.*;)[^\r\n:]*:|[^;\r\n]*;)[ \t]*(.*)$
Explanation
^ Start of string
(?: Non capturing group
(?!.*;) Negative lookahead (supported by Postgresql), assert string does not contain ;
[^\r\n:]*: If that is the case, match 0+ times not : or a newline, then match :
| Or
[^;\r\n]*; Match 0+ times not ; or newline, then match ;
) Close non capturing group
[ \t]* Match 0+ spaces or tabs
(.*) Capturing group 1, match any char 0+ times
$ End of string
Regex demo | Postgresql demo
regex = .*?:(?(?!.*;)(.*)|.*?;(.*))
demo

Regex in middle of text doesn't match

I have a regex to find url's in text:
^(?!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}?$
However it fails when it is surrounded by text:
https://regex101.com/r/0vZy6h/1
I can't seem to grasp why it's not working.
Possible reasons why the pattern does not work:
^ and $ make it match the entire string
(?!:\/\/) is a negative lookahead that fails the match if, immediately to the right of the current location, there is :// substring. But [a-zA-Z0-9-_]+ means there can't be any ://, so, you most probably wanted to fail the match if :// is present to the left of the current location, i.e. you want a negative lookbehind, (?<!:\/\/).
[a-zA-Z]{2,11}? - matches 2 chars only if $ is removed since the {2,11}? is a lazy quantifier and when such a pattern is at the end of the pattern it will always match the minimum char amount, here, 2.
Use
(?<!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}
See the regex demo. Add \b word boundaries if you need to match the substrings as whole words.
Note in Python regex there is no need to escape /, you may replace (?<!:\/\/) with (?<!://).
The spaces are not being matched. Try adding space to the character sets checking for leading or trailing text.

Regex searching for number and letter combination optional brackets

I need to get a regex that will find a match of a single lower case a-z character followed by 5 numbers that is either:
at the start of a line
at the end of a line
surrounded by () or []
surrounded by whitespace
So the following results are expected:
a12345 MATCH
(a12345) MATCH
[a12345] MATCH
text a12345 MATCH
aa12345 NO MATCH
At the moment I have this (?<=[])]*)[a-z]{1}[0-9]{5}(?=[])]*) but it is not working for all scenarios, for example it sees aa12345 and a12345a as being matches when I don't want them to.
Can anyone help?
EDIT:
Apologies I should have mentioned this is for .NET c#
First of all your should mention programming language.
Following solution is for PCRE.
Regex: ((?<=[\[( ])|^)[a-z]\d{5}((?=[\]\) ])|$)
Explanation:
((?<=[\[( ])|^) checks for preceding brackets, whitespaces OR beginning.
[a-z]\d{5} checks for alphabet followed by 5 digits.
((?=[\]\) ])|$) checks for succeeding brackets, whitespaces OR end of line.
Regex101 Demo
Does this work:
(\[[a-z]\d{5}\])|(\([a-z]\d{5}\))|(\b[a-z]\d{5}\b)

Greedy/non-greedy quantifiers in ABAP regular expressions

I would like to extract 2 things from this string: | 2013.10.10 FEL felsz
regex -> Date field -> the needed value will be only the 2013.10.10 (in this case)
regex -> String between 2013.10.10 and felsz string -> the needed value will be only the FEL string (in this case).
I tried with the following regexes as with not too much success:
(.*?<p/\s>.*?)(?=\s)
(.*?<p/("[0-9]+">.*?)(?=\s)
Do you have any suggestions?
As mentioned in comments, since ABAP doesn't allow non-greedy match with *?, if you can count on felsz occurring only immediately after the second portion you want to match you could use:
(\d{4}\.\d\d\.\d\d) (.*) felsz
(PS: Invalidated first answer: in non-ABAP systems where *? is supported, the following regex will get both values into submatches. The date will be in submatch 1 and the other value (FEL in this case) will be in submatch 2 : `(\d{4}.\d\d.\d\d) (.*?) felsz)
Is "felsz" variable? Can the white space vary? Can your date format vary? If not:
\| (\d{4}\.\d{2}\.\d{2}) (.*?) felsz
Otherwise:
\|\s+?(\d{4}\.\d{2}\.\d{2})\s+?(.*?)\s+?[a-z]+
Then access capture groups 1/2.
The regex
\d+\.\d+\.\d+
matches 2013.10.10 in the given string. Explanation and demonstration: http://regex101.com/r/bL7eO0
(?<=\d ).*(?= felsz)
should work to match FEL. Explanation and demonstration: http://regex101.com/r/pV2mW5
If you want them in capturing groups, you could use the regex:
\| (\d+\.\d+\.\d+) (.+?) .*
Explanation and demonstration: http://regex101.com/r/rQ6uU4
How about:
(?:\d+\.\d+\.\d+\s)(.*)\s See it in action.
This matches FEL
Some things I took for granted:
the date always comes first and is a mix of numbers and periods
the date is always followed by a space
the word to capture is always followed by a space
the word to capture never contains a space
Assuming that FEL is always a single word (that is, delimited by a space), you could use the following expression:
(\d{4}\.\d\d\.\d\d) ([^\s]+) (.*)