Regex for "starts with," "does not contain," and "ends with" - regex

I'm trying to search for code within a WordPress site, specifically for a facebook pixel. I'm searching for strings using a regex and I know what the string starts with, ends with, and what the string should NOT contain. I have tried other solutions on SO but with no luck.
The string should start with:
fbq('track'
End with:
);
and NOT contain:
PageView
The expression that I have been playing with to try and do this search is:
^(?=^fbq('track')(?=.*\);$)(?=^(?:(?!PageView).)*$).*$/
From this other StackOverflow question:
Combine Regexp?
However, I keep getting back that this is in an invalid format.

You may use:
^(?!.*PageView)fbq\('track.*\);$
Or:
^fbq\('track(?!.*PageView).*\);$
Demo.
Breakdown:
^ - Beginning of the string.
(?!.*PageView) - Negative Lookahead (does not contain "PageView" from this point forward).
fbq\('track - Match "fbq('track", literally (notice how "(" is escabed: \().
.* - Match zero or more characters (any characters).
\); - Match ");" literally.
$ - End of string.

You can go with the first one!
I already already test it in the regex software what I use to try the "regexes" when I need to. ;)
I'm going to add my litle gain of sand :)
Here you have a good source to read the look-around and look-behind (and negative-look-behind, etc): https://www.regular-expressions.info/lookaround.html
*It contains iformation about the use and restrictions on the most used regex flavors (and it implementation in some programming languages).

First of all, if you are not able to locate the FB Pixel, check if you have Google Tag Manager on the site and perhaps it is added via GTM,
If not, then on with the RegEx...
As this is a script in a template file where it can span multiple lines and have spaces before the text etc, a more flexible pattern would be appropriate.
So the main idea is that you don't use ^ and $ in your pattern.
Example
fbq\('track'(?!.*?PageView)[^)]*\);
The pattern above satisfies the requirements you outlined in the OP, where
fbq\('track' - Literally matches fbq('track' as the start of the string
(?!.*?PageView) - Negative lookahead to fail if PageView is found, .*? is used to lazy match 0 or more characters as we would find PageView sooner than later and don't need to backtrack
As the lookahead above is 0 length, if it passed(PageView not found) the cursor will still be at the end of - fbq('track' <- Cursor here
[^)]* - Matched 0 or more characters until a closing parenthesis is found excluding it
\); - Match ); literally.
I am guessing you might be using VSCode, PhpStorm or similar so I selected JS as the flavor in the example for for compatibility.
If you are using grep say in Linux or a bash terminal on Windows(Not sure of Mac due to grep param compatibility) running this from the Theme directory should show you the files and matches.
grep -Pzro 'fbq\('\''track'\''(?!.*?PageView)[^)]*\);'

Related

Regex: extract characters from two patterns

I have the following string:
https://www.google.com/today/sunday/abcde2.hopeho.3345GETD?weatherType=RAOM&...
https://www.google.com/today/monday/jbkwe3.ho4eho.8495GETD?weatherType=WHTDSG&...
I'd like to extract jbkwe3.ho4eho.8495GETD or abcde2.hopeho.3345GETD. Anything between the {weekday}/ and the ?weatherType=.
I've tried (?<=sunday\/)$.*?(?=\?weatherType=) but it only works for the first line and I want to make it applicable to all strings regardless the value of {weekday}.
I tried (?<=\/.*\/)$.*?(?=\?weatherType=) but it didn't work. Could anyone familiar with Regex can lend some help? Thank you!
[Update]
I'm new to regex but I was experimenting it on sublime text editor via the "find" functionality which I think should be PCRE (according to this post)
Try this regex:
(?:sun|mon|tues|wednes|thurs|fri|satur)day\/\K[^?]+(?=\?weatherType)
Click for Demo
Link to Code
Explanation:
(?:sun|mon|tues|wednes|thurs|fri|satur)day - matches the day of a week i.e, sunday,monday,tuesday,wednesday,thursday,friday,saturday
\/ - matches /
\K - unmatches whatever has been matched so far and pretends that the match starts from the current position. This can be used for the PCRE.
[^?]+ - matches 1 or more occurences of any character that is not a ?
(?=\?weatherType) - the above subpattern[^?]+ will match all characters that are not ? until it reaches a position which is immediately followed by a ? followed by weatherType
To make the match case-insensitive, you can prepend the regex with (?i) as shown here
In the examples given, you actually only need to grab the characters between the last forward slash ("/") and the first question mark ("?").
You didn't mention what flavor regex (ie, PCRE, grep, Oracle, etc) you're using, and the actual syntax will vary depending on this, but in general, something like the following (Perl) replacement regex would handle the examples given:
s/.*\/([^?]*)\?.*/$1/gm
There are other (and more efficient) ways, but this will do the job.

Can I exclude Positive Lookaheads and Lookbehinds within a snippet in vscode?

I am having issues excluding parts of a string in a VSCode Snippet. Essentially, what I want is a specific piece of a path but I am unable to get the regex to exclude what I need excluded.
I have recently asked a question about something similar which you can find here: Is there a way to trim a TM_FILENAME beyond using TM_FILENAME_BASE?
As you can see, I am getting mainly tripped up by how the snippets work within vscode and not so much the regular expressions themselves
${TM_FILEPATH/(?<=area)(.+)(?=state)/${1:/pascalcase}/}
Given a file path that looks like abc/123/area/my-folder/state/...
Expected:
/MyFolder/
Actual:
abc/123/areaMyFolderstate/...
You need to match the whole string to achieve that:
"${TM_FILEPATH/.*area(\\/.*?\\/)state.*/${1:/pascalcase}/}"
See the regex demo
Details
.* - any 0+ chars other than line break chars, as many as possible
area - a word
-(\\/.*?\\/) - Group 1: /, any 0+ chars other than line break chars, as few as possible, and a /
-state.* - state substring and the rest of the line.
NOTE: If there must be no other subparts between area and state, replace .*? with [^\\/]* or even [^\\/]+.
The expected output seems to be different with part of a string in the input. If that'd be desired the expression might be pretty complicated, such as:
(?:[\s\S].*?)(?<=area\/)([^-])([^-]*)(-)([^\/])([^\/]*).*
and a replacement of something similar to /\U$1\E$2$3\U$4\E$5/, if available.
Demo 1
If there would be other operations, now I'm guessing maybe the pascalcase would do something, this simple expression might simply work here:
.*area(\\/.*?\\/).*
and the desired data is in this capturing group $1:
(\\/.*?\\/)
Demo 2
Building on my answer you linked to in your question, remember that lookarounds are "zero-length assertions" and "do not consume characters in the string". See lookarounds are zero-length assertions:
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
So in your snippet transform: /(?<=area)(.+)(?=state)/ the lookaround portions are not actually consumed and so are simply passed through. Vscode treats them, as it should, as not actually being within the "part to be transformed" segment at all.
That is why lookarounds are not excluded from your transform.

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.
The regex that includes negative lookahead that would work if it was enabled is:
test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
This matches:
test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23
test.com/?ref=23&e=35
and does not match (as it should):
test.com/ambassadors
test.com/admin/?signup=true
test.com/randomtext/
I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.
Thank you!
Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.
That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.
However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:
test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)
...which I'm pretty sure you don't want. :P
Try this regex:
test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$
or more readably:
test\.com
(?:
/
(?:index_\w+\.php)?
(?:
\?ref=\d+
(?:
&e=\d+
)?
)?
)?
\s*$
For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:
^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
Firstly I think your regex needs some fixing. Let's look at what you have:
test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
The case where you use the optional ? at the start of index... is already taken care of by the second alternative:
test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:
test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:
test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now the first and second and third option can be collapsed into one, if we make the file name optional, too:
test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)
Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)
Seeing your input examples, this seems to be closer to what you actually want to match.
Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.
So, if you use singleline mode (which probably means you have only one URL per string), use this:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z
If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ 
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.