Regex to target a page but not its children - regex

I'm trying to write a regular expression to target a URL but not any of its children. My regex is definitely pretty weak and could use some help.
Page I want to target (may include trailing slash and or UTM parameters): https://test.com/deals/
Example of a page I do not want to target: https://test.com/deals/Best-Sellers/c/901
My attempt:
.*Deals\/((?!Best).)*

You can use \/deals\/?(?:[?#]\S*)?$
Check on Regex101
This is a bit more permissive than what your question suggests but it might come in handy.
The main thing is that it tries to match /deals at the end of the line. This ensures that you won't match, say https://test.com/best-deals or similar but only the URL that ends with /deals. Also, the final / is optional - you might get https://test.com/deals.
In addition to that, the regex allows for the URL to end with # anchors or ? followed by parameters. The page might allow this right now or in the future - for example, if a link is used that leads to the same page (e.g. to a specific section), you'd get a # added to the URL. Or there might be something like a filter configuration embedded in the URL https://test.com/deals/?sort=price&productsPerPage=15&page=2&minPrice=100.
Finally, you should make your regex case insensitive to account for the fact the URL might also be https://test.com/Deals/. How you set this flag will depend on where you are using the regex, so I am just adding this as a reminder.

Related

Regex to match url path - Golang with Gorilla Mux

I'm setting up an api endpoint which after parsing the url I would get a path such as /profile/<username> e.g. /profile/markzuck
The username is optional though as this endpoint returns the the authenticated users profile if username is blank
Rule set:
I'm not the best at regex but I've created an expression that requires /profile after that if there is a following / e.g. /profile/ then you need to have a <username> that matches (\w){1,15}. Also I want it to be allowed to match any number of combinations if there is another following / e.g. /profile/<username>/<if preceding "/" then anything else>
Although I'm not 100% sure my expression is correct this seems to work in JavaScript
/^\/(profile)(\/(?=(\w){1,15}))?/
Gorilla Mux though is different and it requires the route matching string to always start with a slash and some other things I don't understand like it can only use non-capturing groups
( found this out by getting this error: panic: route /{_dummy:profile/([a-zA-Z_])?} contains capture groups in its regexp. Only non-capturing groups are accepted: e.g. (?:pattern) instead of (pattern) )
I tried using the same expression I used for JavaScript which didn't work here. I created a more forgiving expresion handlerFunc("/{_dummy:profile\/[a-zA-Z_].*}") which does work however this doesn't really follow the same rule set I'm using in my JavaScript expresion.
I was able to come up with my working expresion from this SO post here
And Gorilla Mux's docs talks a little bit about how their regex works when explaining how to use the package in the intro section here
My question is what is a similar or equivalent expression to the rule set I described that will work in Gorilla Mux HandlerFunc()?
If you're doing this with mux, then I believe what you need is not regex, but multiple paths.
For the first case, use a path "/profile". For the one containing a user name, use another path "/profile/{userName}". If you really want to use a regex, you can do "/profile/{username:}" to validate the user name. If you need to process anything that comes after username, either register separate paths (/profile/{username}/otherstuff), or register a pathPrefix "/profile/{username}/" and process the remaining part of the URL manually.

How to use regex to insert dynamic urls in the Analytics conversion funnel?

A have a client who has an ecommerce website working on Wix. She asked me to setup the conversion funnel so she could identify when visitors leave a step, I mean, those who don't get to the order placed page. On Wix, we have 3 steps/urls, as below:
Cart: https://www.easyhomedesign.com.br/cart?appSectionParams=%7B%22origin%22%3A%22cart-popup%22%7D
Checkout: https://www.easyhomedesign.com.br/checkout?appSectionParams=%7B%22a11y%22%3Afalse%2C%22cartId%22%3A%2283476f86-4ac9-44ac-8779-4479dde12cc2%22%2C%22storeUrl%22%3A%22https%3A%2F%2Fwww.easyhomedesign.com.br%2F%22%2C%22isFastFlow%22%3Afalse%2C%22isPickupFlow%22%3Afalse%7D
and the Thank you page: https://www.easyhomedesign.com.br/thank-you-page/d28bc342-0afe-40cb-8d98-a5784b6b2f17
After each of those urls, we have dynamic string values, so I need to put into the funnel step, on the url field, those same urls but using a regex that matches to the config "starts with", since we can't know what the values on the end of the urls are and, on the funnel setup section, we don't have that combobox "Starts with". At least, that's the only solution I could think about.
Then, my idea is use something like https://www.easyhomedesign.com.br/thank-you-page/$. I don't know regex, that's only an example of what I thought about, since that part of the url is the fixed one.
Could someone help me? tks.
Although the final goal setup highly depends on your Analyitcs implementation, it is basically possible to cover such funnel with RegEx in goal funnels.
Google Analytics tracks page visits with path, buy default. So https://www.easyhomedesign.com.br/thank-you-page/d28bc342-0afe-40cb-8d98-a5784b6b2f17 will become /thank-you-page/d28bc342-0afe-40cb-8d98-a5784b6b2f17 in your reports. It can be set up to contain the domain, and the goal flow can be created as well, but it's important to check, how the RegEx should be created.
It is also important, if the above mentioned query parts are affecting the step, whether they are part of the flow or not.
Assuming that the path part is relevant, you can set up something like this.
Cart URL in reports: /cart?appSectionParams=%7B%22origin%22%3A%22cart-popup%22%7D
Cart step RegEx in goal flow: ^\/cart
^ stands for beginning of the string, but you might need to adjust, if host is present in your reports. You can also extend it to ^\/cart\?, if you expect any parameters to be present, to qualify for cart visit.
Checkout URL in reports: /checkout?appSectionParams=%7B%22a11y%22%3Afalse%2C%22cartId%22%3A%2283476f86-4ac9-44ac-8779-4479dde12cc2%22%2C%22storeUrl%22%3A%22https%3A%2F%2Fwww.easyhomedesign.com.br%2F%22%2C%22isFastFlow%22%3Afalse%2C%22isPickupFlow%22%3Afalse%7D
Checkout Cart step RegEx in goal flow: ^\/checkout
The same applies for checkout step for beginning of the string, or for any parameters required.
Thank you URL in reports: /thank-you-page/d28bc342-0afe-40cb-8d98-a5784b6b2f17
Thank you RegEx, which is essentially the goal's RegEx: ^\/thank-you-page\/[\w-]+
Here, the [\w-]+ part expects alphanumerical characters, underscore, or hyphen to be present. More precisely, one or more must exist there.
The $ sign, mentioned in the OP, could not be used here, as it indicates the end of the string, and therefore the id at the end of the URL would make it non-matching.

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

Regular expression to add base domain to directory

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.example.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following:
1. the string 'src="' (and capture it in variable $1)
2. one or more non-double-quote (") character followed by " (and capture it in variable $2)
3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+