Trying to combine two Regex - regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3

The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

Related

Regex: Find multiple matching strings in all lines

I'm trying to match multiple strings in a single line using regex in Sublime Text 3.
I want to match all values and replace them with null.
Part of the string that I'm matching against:
"userName":"MyName","hiScore":50,"stuntPoints":192,"coins":200,"specialUser":false
List of strings that it should match:
"MyName"
50
192
200
false
Result after replacing:
"userName":null,"hiScore":null,"stuntPoints":null,"coins":null,"specialUser":null
Is there a way to do this without using sed or any other substitution method, but just by matching the wanted pattern in regex?
You can use this find pattern:
:(.*?)(,|$)
And this replace pattern:
:null\2
The first group will match any symbol (dot) zero or more times (asterisk) with this last quantifier lazy (question mark), this last part means that it will match as little as possible. The second group will match either a comma or the end of the string. In the replace pattern, I substitute the first group with null (as desired) and I leave the symbol matched by the second group unchanged.
Here is an alternative on amaurs answer where it doesn't put the comma in after the last substitution:
:\K(.*?)(?=,|$)
And this replacement pattern:
null
This works like amaurs but starts matching after the colon is found (using the \K to reset the match starting point) and matches until a comma of new line (using a positive look ahead).
I have tested and this works in Sublime Text 2 (so should work in Sublime Text 3)
Another slightly better alternative to this is:
(?<=:).+?(?=,|$)
which uses a positive lookbehind instead of resetting the regex starting point
Another good alternative (so far the most efficient here):
:\K[^,]*
This may help.
Find: (?<=:)[^,]*
Replace: null

Regex not matching the first occurence of my string

In this url:
http://example.com/SearchResult-Empty.html?caty[]=12345&caty[]=45678
I am trying to use the following regex to grab the first occurence of caty which should be "12345". However, instead, the regex below is giving me the final occurrence 45678. I tried using the "?" limiter to make it non-greedy per other stack overflow questions, but it isn't working. How can I do this?
^SearchResult(?:.*)(caty)(?:.*)\=([0-9]+)\&?$
As far as I can tell, two things are messing you up:
The anchors ^ and $ seem to be forcing the regex to produce bad matches
You are using greedy .* instead of non-greedy .*?
SearchResult(?:.*?)(caty)(?:.*?)\=([0-9]+)\&?
Should do the job
^SearchResult(?:.*)(caty)(?:.*)\=([0-9]+)\&?$
^^
.* is greedy matching, meaning that it will go the the last occurrence of caty rather than the first. You could check that by providing three caty's in the input string and it will then skip the first two.
.*? makes it non-greedy (aka reluctant), which will consume as little as possible to make a match - stopping at the first occurrence of caty.

How to match text which the part of it is already matched previous?

I have a string like aaa**b***c****ddd, and I want to get a sequence of matched text of pattern [^*]\*+[^*], which should I thank be [a**b, b***c, c***d]. However, when I test this in text editor like vim or emacs, the second (b***c) is not matched.
aaa**b***c***ddd
|--| |---|
first third
|---|
second, which I think should be matched but not
How should I modify the regular expression to match the second?
Yes you can, the trick consists to put all in a capturing group inside a lookahead to allow overlapping results:
(?=([^*]\*+[^*]))
But you can't use this do to replacements since this pattern matches nothing. (or perhaps if you can get the capture group length and the current offset)
EDIT:
it seems to be possible to obtain the capture group length with vim with strlen(submatch(1))
#CommuSoft is correct. One way to approach this problem would be to match the whole string against this regex and then the second time around, you match this regex against the substring that starts at (index_of_first_previous_match + 1) until the end of the string. Hope that is clear.
So if the index of your first match above (a**b) was 2. Then the new substring that you match against the regex the second time should start from index 3 till the end of the string. This will give you the two results.
However, Casimir's answer is much simpler.

Regex: optimal syntax for optional combined expression?

I want to match a combination of expressions that is optional. In this specific example, I want to match on the word through. Also, if the words run or swim precede through (with whitespace) then match on the whole phrase. So that combination of expressions preceding through must be optional.
I want all the following lines to be positive matches:
swim through <-- match entire phrase
jump through <-- match entire phrase
hike through <-- match only the word "through"
To do this, I can use the following expression:
(jump\W|swim\W)?through
However, is it possible to accomplish the same thing without having to add \W after jump and swim? I was trying something like this:
(jump|swim)?\W?through
But that wasn't working properly because it would include the space that precedes through on the 3rd example. I only want the word through, not the whitespace around it.
What about this one: (?:(jump|swim)\W)?through

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/