Regex for excluding URL - regex

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/

Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/

It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters

Related

What regex in Google Analytics to use for this case?

I'm trying to figure out what landing page regex to use to only show URLs that have only two sub-folders, e.g. see image below: just show green URLs but not the read ones as they have 3+ subfolders. Any advice on how to do this in GA with regex?
Cheers
If you want to match a path having only two components, e.g.
/component1/component2/
Then you may use the following regex:
/[^/]+/[^/]+/
Demo
If your regex tool requires anchors, then add them:
^/[^/]+/[^/]+/$
Is this what you are looking for?
^\/[!#$&-;=?-[]_a-z~]+\/[!#$&-;=?-[]_a-z~]+\/$
The two sections contain all the valid html characters. We're also forcing the regex to start with slash, end with slash and have only one slash in between.

URL pattern to exclude globally in Zap

I am having trouble with regex syntax in OWASP ZAP. I want to exclude from all scans all URLs that contain "web/lib". I've tried to add
^*web/lib*$
under Global Exclude URL option, but it didn't work. Please help - thanks a lot.
It's regex, if you're specifying wildcard you generally want period asterisk. You also probably need to escape the slash.
Eg: https://regex101.com/r/XLPF85/1

JMeter Proxy exclusion patterns still being recorded

I am using JMeter to record traffic in my browser. In my URL Patterns to Exclude are:
.*\.jpg,
.*\.js,
.*\.png
Which looks like they should block these patterns (I've even tested it with a regex tester here)
Yet, I still see plenty of these files get pulled up. In a related forum someone had a similar issue, but his was caused by having additional url parameters afterwards (eg www.website.com/image.jpg?asdf=thisdoesntmatch). However this doesn't seem to be the case here. Can anyone point me in the right direction?
As already mentioned in the question comments it is probably a problem with the trailing characters. The pattern matcher is executed against the complete url including parameters.
So an URL http://example.com/layout.css?id=123 is not matched against the pattern .*\.css The JMeter HTTP Request Sample seperates the Path and the Parameters so it might be not obvious when you look at the URL.
Solution:Change the pattern to support trailing characters .*\.css.*
Explained
.* Any character
\. Matching the . (dot) character
css The character sequence css
.* Any character
Maybe you can do the oposite: leave blank the URL Patterns to exclude and negate those patterns in the URL Patterns to Include box:
(?!..(bmp|css|js|gif|ico|jpe?g|png|swf|woff))(.)

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).