Regular expression to exclude URLs from web crawler

Regular expression to exclude URLs from web crawler - regex

I am using an online tool to crawl my client's website and provide a list of pages / URLs that exist on it.
There is an option to exclude pages, and it gives a regex example of \?.*page=.*$
I would like to ignore everything in the news section (apart from the News page itself)
So would I go with the following?
\?.*news/.*$

If I understand you correctly, you're looking for a regex that matches news/foo or news/foo/bar, but not news/.
You can use this regex for that: .*news/.+
.* string starts with 0 or more character(s)
news/ string includes news/
.+ string ends with 1 or more character(s)
http://regexr.com/3ffj1

Related

REGEX Match a String (Google Analytics)

I need to pull out links only have just string with excluding numbers and queries in URL in Google Analytics.
so, I need this URL
www.site.com/en/rent/cairo/apartments-for-rent/
and exclude these
www.site.com/en/buy/apartment-for-sale-in-acacia-compound-new-cairo-947145/
www.site.com/en/buy/apartment-for-sale-in-acacia-compound-new-cairo-947145/?price=1000
Thank you

If each URL is on its own line, and that's the only thing on the line (not even whitespace), this simple regex will do the trick: ^[^0-9|?| ]*$

Regex to match all except URLs that contain specific directory?

I need a regular expression for IIS URL Rewrite that will process the rule only when the expression matches any bit of the URL EXCEPT a specific sub-root directory.
Example:
www.mysite.com/wordpress - process rule on any URL that starts with /wordpress after the domain name
www.mysite.com/inventory - do not process rule on any URL that starts with /inventory after the domain name
Tried .*(?<!^\/inventory\/.*) but it still matches the entire string.

You need a lookahead rather than lookbehind. Something like this I think:
^([^/]*/){1}(?!inventory\b)
Where you change 1 to 2 when the exclusion is needed at the next lower sublevel, etc.

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.

This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.

Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1

This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

RegEx pattern to handle URL with dates

I moved to a new website and it mangled up my URL's. Now blog posts are accessible from multiple URL's and would like to redirect one pattern to the other.
I am trying to redirect the first case to the second case:
~/blogs/johndoe/john-doe/2014/03/14/test-article1 =>
~/blogs/john-doe/2014/03/14/test-article1
~/blogs/jimjones/jim-jones/2014/03/14/test-articleb =>
~/blogs/jim-jones/2014/03/14/test-articleb
How do I create a pattern smart enough to slice out the first "johndoe" and "jimjones"? I am using this for IIS rewrite but I think any RegEx should work. Thanks for any help.

This works:
^~/blogs/\w+/(\w+)-(\w+)/(\d{4})/(\d\d)/(\d\d)/([\w-]+)$
Debuggex Demo
It just discards the non-dash name. It doesn't know if its equal to the dash name or not. And it also assumes that the date numbers are valid. 9899/45/33 would be matched.
Capture groups:
First name
Last name
Year
Month
Day
Article name

I don't know about IIS rewrites, but this should work:
/^~/blogs\/[a-z]+\/ -> ~/blogs/
The regular expression will match the start of a string, following by ~/blogs/, followed by a string of all lowercase characters.

I don't use IIS, but this should be at least close.
Pattern:
^blogs/\w+/(\w+/)
Action
blogs/{R:1}
Handy usage doc

How 'Exclude URLs With regex' In Live HTTP headers

I want to exclude some urls from Live HTTP headers (firefox add-on).
so in Config area i checked Exclude URLs With regex and put the string below in it:
.gif$|.jpg$|.ico$|.css$|.js$|.png$|.bmp$|.jpeg$|google$|bing$|alexa$
i want to remove all images from capturing and any url that contains :
css - js - google - bing - alexa
what is the problem about my regex and would you please fix it for me?
thanks in advance

. means "any char"
$ means "the end of the string"
That said:
.gif$ will match "any string ending with gif that is at least 4-char long"
google$ will match "any string ending with google"
I guess you were looking for something like:
[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|\b(google|bing|alexa)\b
Maybe your regexps get autoanchored with ^ and $ by the tool you're using. In this case, use .* additionally:
.*[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|.*\b(google|bing|alexa)\b.*

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression to exclude URLs from web crawler - regex

If I understand you correctly, you're looking for a regex that matches news/foo or news/foo/bar, but not news/. You can use this regex for that: .news/.+ . string starts with 0 or more character(s) news/ string includes news/ .+ string ends with 1 or more character(s) http://regexr.com/3ffj1

Related

REGEX Match a String (Google Analytics)

Regex to match all except URLs that contain specific directory?

Regular expression to match only domain from URL

RegEx pattern to handle URL with dates

How 'Exclude URLs With regex' In Live HTTP headers

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression to exclude URLs from web crawler - regex

If I understand you correctly, you're looking for a regex that matches news/foo or news/foo/bar, but not news/. You can use this regex for that: .*news/.+ .* string starts with 0 or more character(s) news/ string includes news/ .+ string ends with 1 or more character(s) http://regexr.com/3ffj1

Related

REGEX Match a String (Google Analytics)

Regex to match all except URLs that contain specific directory?

Regular expression to match only domain from URL

RegEx pattern to handle URL with dates

How 'Exclude URLs With regex' In Live HTTP headers

Categories

Resources

If I understand you correctly, you're looking for a regex that matches news/foo or news/foo/bar, but not news/. You can use this regex for that: .news/.+ . string starts with 0 or more character(s) news/ string includes news/ .+ string ends with 1 or more character(s) http://regexr.com/3ffj1