Regular expression to exclude URLs from web crawler - regex

I am using an online tool to crawl my client's website and provide a list of pages / URLs that exist on it.
There is an option to exclude pages, and it gives a regex example of \?.*page=.*$
I would like to ignore everything in the news section (apart from the News page itself)
So would I go with the following?
\?.*news/.*$

If I understand you correctly, you're looking for a regex that matches news/foo or news/foo/bar, but not news/.
You can use this regex for that: .*news/.+
.* string starts with 0 or more character(s)
news/ string includes news/
.+ string ends with 1 or more character(s)
http://regexr.com/3ffj1

Related

REGEX Match a String (Google Analytics)

I need to pull out links only have just string with excluding numbers and queries in URL in Google Analytics.
so, I need this URL
www.site.com/en/rent/cairo/apartments-for-rent/
and exclude these
www.site.com/en/buy/apartment-for-sale-in-acacia-compound-new-cairo-947145/
www.site.com/en/buy/apartment-for-sale-in-acacia-compound-new-cairo-947145/?price=1000
Thank you
If each URL is on its own line, and that's the only thing on the line (not even whitespace), this simple regex will do the trick: ^[^0-9|?| ]*$

Regex to match all except URLs that contain specific directory?

I need a regular expression for IIS URL Rewrite that will process the rule only when the expression matches any bit of the URL EXCEPT a specific sub-root directory.
Example:
www.mysite.com/wordpress - process rule on any URL that starts with /wordpress after the domain name
www.mysite.com/inventory - do not process rule on any URL that starts with /inventory after the domain name
Tried .*(?<!^\/inventory\/.*) but it still matches the entire string.
You need a lookahead rather than lookbehind. Something like this I think:
^([^/]*/){1}(?!inventory\b)
Where you change 1 to 2 when the exclusion is needed at the next lower sublevel, etc.

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.
This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.
Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1
This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

RegEx pattern to handle URL with dates

I moved to a new website and it mangled up my URL's. Now blog posts are accessible from multiple URL's and would like to redirect one pattern to the other.
I am trying to redirect the first case to the second case:
~/blogs/johndoe/john-doe/2014/03/14/test-article1 =>
~/blogs/john-doe/2014/03/14/test-article1
~/blogs/jimjones/jim-jones/2014/03/14/test-articleb =>
~/blogs/jim-jones/2014/03/14/test-articleb
How do I create a pattern smart enough to slice out the first "johndoe" and "jimjones"? I am using this for IIS rewrite but I think any RegEx should work. Thanks for any help.
This works:
^~/blogs/\w+/(\w+)-(\w+)/(\d{4})/(\d\d)/(\d\d)/([\w-]+)$
Debuggex Demo
It just discards the non-dash name. It doesn't know if its equal to the dash name or not. And it also assumes that the date numbers are valid. 9899/45/33 would be matched.
Capture groups:
First name
Last name
Year
Month
Day
Article name
I don't know about IIS rewrites, but this should work:
/^~/blogs\/[a-z]+\/ -> ~/blogs/
The regular expression will match the start of a string, following by ~/blogs/, followed by a string of all lowercase characters.
I don't use IIS, but this should be at least close.
Pattern:
^blogs/\w+/(\w+/)
Action
blogs/{R:1}
Handy usage doc

How 'Exclude URLs With regex' In Live HTTP headers

I want to exclude some urls from Live HTTP headers (firefox add-on).
so in Config area i checked Exclude URLs With regex and put the string below in it:
.gif$|.jpg$|.ico$|.css$|.js$|.png$|.bmp$|.jpeg$|google$|bing$|alexa$
i want to remove all images from capturing and any url that contains :
css - js - google - bing - alexa
what is the problem about my regex and would you please fix it for me?
thanks in advance
. means "any char"
$ means "the end of the string"
That said:
.gif$ will match "any string ending with gif that is at least 4-char long"
google$ will match "any string ending with google"
I guess you were looking for something like:
[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|\b(google|bing|alexa)\b
Maybe your regexps get autoanchored with ^ and $ by the tool you're using. In this case, use .* additionally:
.*[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|.*\b(google|bing|alexa)\b.*