Regex: extract characters from two patterns - regex

I have the following string:
https://www.google.com/today/sunday/abcde2.hopeho.3345GETD?weatherType=RAOM&...
https://www.google.com/today/monday/jbkwe3.ho4eho.8495GETD?weatherType=WHTDSG&...
I'd like to extract jbkwe3.ho4eho.8495GETD or abcde2.hopeho.3345GETD. Anything between the {weekday}/ and the ?weatherType=.
I've tried (?<=sunday\/)$.*?(?=\?weatherType=) but it only works for the first line and I want to make it applicable to all strings regardless the value of {weekday}.
I tried (?<=\/.*\/)$.*?(?=\?weatherType=) but it didn't work. Could anyone familiar with Regex can lend some help? Thank you!
[Update]
I'm new to regex but I was experimenting it on sublime text editor via the "find" functionality which I think should be PCRE (according to this post)

Try this regex:
(?:sun|mon|tues|wednes|thurs|fri|satur)day\/\K[^?]+(?=\?weatherType)
Click for Demo
Link to Code
Explanation:
(?:sun|mon|tues|wednes|thurs|fri|satur)day - matches the day of a week i.e, sunday,monday,tuesday,wednesday,thursday,friday,saturday
\/ - matches /
\K - unmatches whatever has been matched so far and pretends that the match starts from the current position. This can be used for the PCRE.
[^?]+ - matches 1 or more occurences of any character that is not a ?
(?=\?weatherType) - the above subpattern[^?]+ will match all characters that are not ? until it reaches a position which is immediately followed by a ? followed by weatherType
To make the match case-insensitive, you can prepend the regex with (?i) as shown here

In the examples given, you actually only need to grab the characters between the last forward slash ("/") and the first question mark ("?").
You didn't mention what flavor regex (ie, PCRE, grep, Oracle, etc) you're using, and the actual syntax will vary depending on this, but in general, something like the following (Perl) replacement regex would handle the examples given:
s/.*\/([^?]*)\?.*/$1/gm
There are other (and more efficient) ways, but this will do the job.

Related

Regex for "starts with," "does not contain," and "ends with"

I'm trying to search for code within a WordPress site, specifically for a facebook pixel. I'm searching for strings using a regex and I know what the string starts with, ends with, and what the string should NOT contain. I have tried other solutions on SO but with no luck.
The string should start with:
fbq('track'
End with:
);
and NOT contain:
PageView
The expression that I have been playing with to try and do this search is:
^(?=^fbq('track')(?=.*\);$)(?=^(?:(?!PageView).)*$).*$/
From this other StackOverflow question:
Combine Regexp?
However, I keep getting back that this is in an invalid format.
You may use:
^(?!.*PageView)fbq\('track.*\);$
Or:
^fbq\('track(?!.*PageView).*\);$
Demo.
Breakdown:
^ - Beginning of the string.
(?!.*PageView) - Negative Lookahead (does not contain "PageView" from this point forward).
fbq\('track - Match "fbq('track", literally (notice how "(" is escabed: \().
.* - Match zero or more characters (any characters).
\); - Match ");" literally.
$ - End of string.
You can go with the first one!
I already already test it in the regex software what I use to try the "regexes" when I need to. ;)
I'm going to add my litle gain of sand :)
Here you have a good source to read the look-around and look-behind (and negative-look-behind, etc): https://www.regular-expressions.info/lookaround.html
*It contains iformation about the use and restrictions on the most used regex flavors (and it implementation in some programming languages).
First of all, if you are not able to locate the FB Pixel, check if you have Google Tag Manager on the site and perhaps it is added via GTM,
If not, then on with the RegEx...
As this is a script in a template file where it can span multiple lines and have spaces before the text etc, a more flexible pattern would be appropriate.
So the main idea is that you don't use ^ and $ in your pattern.
Example
fbq\('track'(?!.*?PageView)[^)]*\);
The pattern above satisfies the requirements you outlined in the OP, where
fbq\('track' - Literally matches fbq('track' as the start of the string
(?!.*?PageView) - Negative lookahead to fail if PageView is found, .*? is used to lazy match 0 or more characters as we would find PageView sooner than later and don't need to backtrack
As the lookahead above is 0 length, if it passed(PageView not found) the cursor will still be at the end of - fbq('track' <- Cursor here
[^)]* - Matched 0 or more characters until a closing parenthesis is found excluding it
\); - Match ); literally.
I am guessing you might be using VSCode, PhpStorm or similar so I selected JS as the flavor in the example for for compatibility.
If you are using grep say in Linux or a bash terminal on Windows(Not sure of Mac due to grep param compatibility) running this from the Theme directory should show you the files and matches.
grep -Pzro 'fbq\('\''track'\''(?!.*?PageView)[^)]*\);'

Regular expression to exclude tag groups or match only (.*) in between tags

I am struggling with this regex for a while now.
I need to match the text which is in between the <ns3:OutputData> data</ns3:OutputData>.
Note: after nscould be 1 or 2 digits
Note: the data is in one line just as in the example
Note: the ... preceding and ending is just to mention there are more tags nested
My regex so far: (ns\d\d?:OutputData>)\b(.*)(\/\1)
Sample text:
...<ns3:OutputData>foo bar</ns3:OutputData>...
I have tried (?:(ns\d\d?:OutputData>)\b)(.*)(?:(\/\1)) in an attempt to exclude group 1 and 3.
I wan't to exclude the tags which are matched, as in the images:
start
end
Any help is much appreciated.
EDIT
There might be some regex interpretation issue with Grep Console for IntelliJ which I intend to use the regex.
Here is is the latest image with the best match so far...
Your regex is almost there. All you need to do is to make the inside-matcher non-greedy. I.e. instead of (.*) you can write (.*?).
Another, xml-specific alternative is the negated character-class: ([^<]*).
So, this is the regex: (ns\d\d?:OutputData>)\b(.*?)(\/\1) You can experiment with it here.
Update
To make sure that the only group is the one that matches the text, then you have to make it work without backreferences: (?:ns\d\d?:OutputData>)\b(.*?)<
Update 2
It's possible to match only the required parts, using lookbehind. Check the regex here.:
(?<=ns\d:OutputData>)\b([^<]*)|(?<=ns\d\d:OutputData>)\b([^<]*)
Explanation:
The two alternatives are almost identical. The only difference is the number of digits. This is important because some flavors support only fixed-length lookbehinds.
Checking alternative one, we put the starting tag into one lookbehind (?<=...) so it won't be included into the full match.
Then we match every non- lt symbol greedily: [^<]*. This will stop atching at the first closing tag.
Essentially, you need a look behind and a look ahead with a back reference to match just the content, but variable length look behinds are not allowed. Fortunately, you have only 2 variations, so an alternation deals with that:
(?<=<(ns\d:OutputData>)).*?(?=<\/\1)|(?<=<(ns\d\d:OutputData>)).*?(?=<\/\2)
The entire match is the target content between the tags, which may contain anything (including left angle brackets etc).
Note also the reluctant quantifier .*?, so the match stops at the next matching end tag, rather than greedy .* that would match all the way to the last matching end tag.
See live demo.
This was the answer in my case:
(?<=(ns\d:OutputData)>)(.*?)(?=<\/\1)
The answer is based on #WiktorStribiżew 3 given solutions (in comments).
The last one worked and I have made a slight modification of it.
Thanks all for the effort and especially #WiktorStribiżew!
EDIT
Ok, yes #Bohemian it does not match 2-digits, I forgot to update:
(?<=(ns\d{0,2}:OutputData)>)(.*?)(?=<\/\1)

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.

Adding "/index.html" to paths in Vim

I'm trying to append "/index.html" to some folder paths in a list like this:
path/one/
/another/index.html
other/file/index.html
path/number/two
this/is/the/third/path/
path/five
sixth/path/goes/here/
Obviously the text only needs to be added where it does not exist yet. I could achieve some good results with (vim command):
:%s/^\([^.]*\)$/\1\/index.html/
The only problem is that after running this command, some lines like the 1st, 5th and 7th in the previous example end up with duplicated slashes. That's easy to solve too, all I have to do is search for duplicates and replace with a single slashes.
But the question is:
Isn't there a better way to achieve the correct result at once?
I'm a Vim beginner, and not a regex master also. Any tips are really appreciated!
Thanks!
So very close :)
Just add an optional slash to the end of the regex:
\/\?
Then you need to change the rest of the pattern to a non-greedy match so that it ignores a trailing slash. The syntax for a non-greedy match in vim (replacing the *) is:
\{-}
So we end up with:
:%s/^\([^\.]\{-}\)\/\?$/\1\/index.html/
(Doesn't hurt to be safe and escape the period.)
Vim's regex supports the ability to match a bit of text foo if it does or doesn't precedes or follows some other text bar without matching bar, and this is exactly the sort of thing you're looking for. Here you want to match the end of line with an optional /, but only if the / isn't followed by index.html, and then replace it with /index.html. A quick look at Vim's help tells me \#<! is exactly what to use. It tells Vim that the preceding atom must be in the text but not in what's matched. With a little experimentation, I get
:%s;/\?\(index\.html\)\#<!$;/index.html;
I use ; to delimit the parts of the :s command so that I don't have to escape any / in the regex or replacement expression. In this particular situation, it's not a big deal though.
The / is optional, and we say so with \?.
We need to group index.html together because otherwise our special \#<! would only affect the l otherwise.