Regex not matching the first occurence of my string - regex

In this url:
http://example.com/SearchResult-Empty.html?caty[]=12345&caty[]=45678
I am trying to use the following regex to grab the first occurence of caty which should be "12345". However, instead, the regex below is giving me the final occurrence 45678. I tried using the "?" limiter to make it non-greedy per other stack overflow questions, but it isn't working. How can I do this?
^SearchResult(?:.*)(caty)(?:.*)\=([0-9]+)\&?$

As far as I can tell, two things are messing you up:
The anchors ^ and $ seem to be forcing the regex to produce bad matches
You are using greedy .* instead of non-greedy .*?
SearchResult(?:.*?)(caty)(?:.*?)\=([0-9]+)\&?
Should do the job

^SearchResult(?:.*)(caty)(?:.*)\=([0-9]+)\&?$
^^
.* is greedy matching, meaning that it will go the the last occurrence of caty rather than the first. You could check that by providing three caty's in the input string and it will then skip the first two.
.*? makes it non-greedy (aka reluctant), which will consume as little as possible to make a match - stopping at the first occurrence of caty.

Related

Regex: repeated matches using start of line

Say that I would like to replace all as that are after 2 initial as and that only have as in between it and the first 2 as. I can do this in Vim using the (very magic \v) regex s:\v(^a{2}a{-})#<=a:X:g:
aaaaaaaaaaa
goes to
aaXXXXXXXXX
However, why does s:\v^a{2}a{-}\zsa:X:g only replace the first occurrence? I.e., giving
aaXaaaaaaaa
I presume this is because the first match "consumes" the start of the line and the first 2 as such that later matches only are matching on what remains of the line, which never can match the ^ again. Is this true? Or rather what is the most pedagogical explanation?
P.S. This is a minimal example of another problem.
Edit
Accepted answer corrected a typo in the original regex (a missing ^) and its comment answered the question: why can the ^ be "reused" in the lookbehind but not in the \zs case? (Ans: lookbehind doesn't consume the match whereas \zs does.)
The point here is that (a{2}a{-})#<=a matches any a (see the last a) that is preceded with two or more a chars. In NFA regex flavors, it is equal to (?<=a{2,}?)a, see its demo.
The ^a{2}a{-}\zsa regex matches the start of string, then two or more as, then discards this matched text and matches an a. So, it cannot match other as since the ^ anchors the match at the start of the string (and it does not allow matching anywhere else).
You probably want to go on using a lookbehind construct and add ^ there (if you want to only start matching if the string starts with two as):
:%s/\v(^a{2}a{-})#<=a/X/g

Trying to combine two Regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3
The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

Regex Group Capture, how to stop before next word

I have the following regular expression:
Defaults(.*)Class=\"(?<class>.*)\"(.*)StorePath=\"(?<storePath>.*)\"
And the following string:
Defaults Class="Class name here" StorePath="Any store path here" SqlTable="SqlTableName"
I'm trying to achieve the following:
class Class name here
storePath Any store path here
But, what I'm getting as a result is:
class Class name here
storePath Any store path here SqlTable="SqlTableName"
How to stop before the Sqltable text?
The language is C# and the regex engine is the built in for .NET framework.
Thanks a lot!
The solution proposed by #ahmed-abdelhameed solves the problem, I forgot the non-greedy.
Defaults(.*)Class=\"(?<class>.*)\"(.*)StorePath=\"(?<storePath>.*?)\"
Thanks!
In the storePath group, you're matching zero or more times of any character (greedy match). What greedy match means is that it will return as many characters as possible, so it keeps matching characters until it reaches the last occurrence of ".
What you need to do is to convert your greedy match into a lazy match by replacing .* with .*?. What lazy match means is that it will return as few characters as possible, so in your case, it'll keep matching character until it reaches the first occurrence of ".
Simply replace your regex with:
Defaults(.*)Class=\"(?<class>.*)\"(.*)StorePath=\"(?<storePath>.*?)\"
References:
Laziness Instead of Greediness.
What do 'lazy' and 'greedy' mean in the context of regular expressions?
Alittle easier to read:
Class="(.+?)".+?StorePath="(.+?)"
The .+? is saying match un-greedy, basically match as little as possible.
That will cause it to capture up to the next "

Regex end of capture string

Can anyone tell me why this regex:
(<\s*script\s*>.*<\s*\/*script\s*>)
Matches this entire line:
< script > some more javascript</script> ggg <script>
You have two problems:
First a simple mistake, you are making the termination switch match 0 or more '/' characters by using the * quantifier. You can solve that by removing the quantifier. Requiring a single termination switch, changing your regex to: (<\s*script\s*>.*<\s*\/script\s*>)
But second, because .* is greedy. This means it grabs as much as it can while still matching the rest of the regex. In this case: <\s*\/*script\s*>. This means that if you had multiple "...<\script>"s on a line it would match the entire line rather than each "...".
What you want is to match any character as few times as possible. Which is called lazy matching. You can qualify any quantifier with ? to accomplish this, in your example:
.*?
Using that your regex would become:
(<\s*script\s*>.*?<\s*\/script\s*>)
If you're actually using the http://www.regexr.com "Reference" menu to build your regex, you can find this under "Quantifiers and Alternation">"Lazy"
Replace \/* by \/.
\/* match 0 or more "/".

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/