Parsing multiple groups from a regular expression

Parsing multiple groups from a regular expression - regex

I am having a problem parsing some fields from the following regular expression which I uploaded to rubular. The string that I am parsing is a special header from the banner of an FTP server. In order for me to process this banner, the line
special:pTXT1TOCAPTURE^:mTXT2TOCAPTURE^:uTXT3TOCAPTURE^
I thought that: (?i)^special(:[pmu](.*?)\^)?* would do the trick, however unfortunately this only gives me the last match and I am not sure why as I am lazily trying to capture each group. Also note that I should be able to capture an empty string also, i.e. if for ex the match string contains :u^
Wrap words Show invisibles Ruby version
Match result:
special:pTXT1TOMATCH^:mTXT2TOMATCH^:uTXT3TOMATCH^
Match groups:
:uTXT3TOMATCH^
TXT3TOMATCH
The idea is that the line must start with the test 'special' followed by up to 3 capture groups delimited with p,m or u lazily up to the next ^ symbol. I need to capture the text indicated above - basically I need to find TXT1TOCAPTURE, TXT2TOCAPTURE, and TXT3TOCAPTURE. There should be at least one of these three capture groups.
Thanks in advance

You have two problems with your RegEx, one is syntactic and one is conceptual.
Syntactic:
We don't have such a modifier ?* in PCRE but it is equal to * in Ruby which denotes a greedy quantifier. In the case of applying to a capturing group it captures last match.
Conceptual:
Using a lazy quantifier .*? doesn't provide you with continues matches. It stops immediately on engine satisfaction. While g modifier is on next match will never occur as there is no ^special at the next position of last match.
Solution is using \G token to benefit from its mean of start matching at the end of previous match:
(?:special|(?!\A)\G):([pmu][^^]*\^)
Live demo

You might want to have the \G modifier:
(?:(?:^special:)|\G(?!\A)\^:)[pmu]([^^]+)
See it working on rubular.com.

Related

Regular expression - skip characters in jMeter testing

I have the below regular expression which retrieves me all characters begins with
(state%3)((?:(?!#).)*)
I want to ignore the state%3. I have tried all kinds of lookback but nothing seems to work
Here is the full text that I need to match agains
"state%3DnGl%252BlPm8CkHfYd2PpBq7W0H2z6xgUeICgB7KFmGmGG8cTSQTf%252B9cYCfFSsT5YSPTITdbaLAlJoQ22%252FCXRAu3ROqTQYzpPfGYxKmRZ7iIqwx3g0GLpVkaXq5FL3Js5FcTGpncQx7TA9w1A6HsSyxxcktfwX8QSzhqJQj5lntOolrPoIqpa4l2C%252BbhCWuAOY18BwVynMv8%252BuSl#login/"
A couple of things I have already tried
^.{5}\Kstate
But seems not working. Any ideas. I need this to retrieve for jMeter testing.

No need of lookbehind, nor any lookarounds at all. Use a single capturing group and a negated character class:
state%3([^#]+)
AND set the template value to $1$.
See the regex demo. Details:
state%3 - matches a literal text
([^#]+) - Capturing group #1 (that is why template should be $1$): one or more chars other than #.

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.

Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.

There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.

You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Extracting part of a string using regex

I am trying to extract part of a strings below
I tried (.*)(?:table)?,it fails in the last case. How to make the expression capture entire string in the absence of the text "table"
Text: "diningtable" Expected Match: dining
Text: "cookingtable" Match: cooking
Text: "cooking" Match:cooking
Text: "table" Match:""

Rather than try to match everything but table, you should do a replacement operation that removes the text table.
Depending on the language, this might not even need regex. For example, in Java you could use:
String output = input.replace("table", "");

If you want to use regex, you can use this one:
(^.*)(?=table)|(?!.*table.*)(^.+)
See demo here: regex101
The idea is: match everything from the beginning of the line ^ until the word table or if you don't find table in the string, match at least one symbol. (to avoid matching empty lines). Thus, when it finds the word table, it will return an empty string (because it matches from the beginning of the line till the word table).

The (.*)(?:table)? fails with table (matches it) as the first group (.*) is a greedy dot matching pattern that grabs the whole string into Group 1. The regex engine backtracks and looks for table in the optional non-capturing group, and matches an empty string at the end of the string.
The regex trick is to match any text that does not start with table before the optional group:
^((?:(?!table).)+)(?:table)?$
See the regex demo
Now, Group 1 - ((?:(?!table).)+) - contains a tempered greedy token (?:(?!table).)+ that matches 1 or more chars other than a newline that do not start a table sequence. Thus, the first group will never match table.
The anchors make the regex match the whole line.
NOTE: Non-regex solutions might turn out more efficient though, as a tempered greedy token is rather resource consuming.
NOTE2: Unrolling the tempered greedy token usually enhances performance n times:
^([^t]*(?:t(?!able)[^t]*)*)(?:table)?$
See another demo
But usually it looks "cryptic", "unreadable", and "unmaintainable".

Despite other great answers, you could also use alternation:
^(?|(.*)table$|(.*))$
This makes use of a branch reset, so your desired content is always stored in group 1. If your language/tool of choice doesn't support it, you would have to check which of groups 1 and 2 contains the string.
See Demo

Capture filename parts: Why doesn't this regexp work?

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks

The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo

You can try with:
^([a-z.]+)\.([a-z]+)$
online example

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.

The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing multiple groups from a regular expression - regex

You might want to have the \G modifier: (?:(?:^special:)|\G(?!\A)\^:)[pmu]([^^]+) See it working on rubular.com.

Related

Regular expression - skip characters in jMeter testing

Select Northings from a 1 Line String

Extracting part of a string using regex

Capture filename parts: Why doesn't this regexp work?

Mixing Lookahead and Lookbehind in 1 Regexp

Categories

Resources