Match all YouTube urls except user accounts and channels urls - regex

Could you please help me to find a regex that match all YouTube urls except user accounts and channels urls
I am using this regex:
https?:\/\/((www|m)\.)?(youtube\.com|youtu.be)\/((^channel\/|^user\/){0}(embed\/|(watch)?(\?|\/)?v(=|\/)?))(\S+)?
It works fine but youtube url with format of " https://youtu.be/abcdefgh " is not match
Thanks

Use
https?:\/\/((www|m)\.)?(youtube\.com|youtu\.be)\/(?!user\/|channel\/)(embed\/|(watch)?[?\/]?v[=\/]?)?(\S*)
See proof. Note the (embed\/|(watch)?[?\/]?v[=\/]?)? part is now optional with the help of the ? quantifier.
The (?!user\/|channel\/) part will disallow user/ and channel/.

Related

How to fix regex url pattern

I need to fix my url pattern:
/^((http(s)?(\:\/\/)){1}(www\.)?([\w\-\.\/])*(\.[a-zA-Z]{2,4}\/?)[^\\\/#?])[^\s\b\n|]*[^\.,;:\?\!\#\^\$ -]/
I thought this regex was ok, but it is not working for urls like: https://xx.xx (without www). 'www' should be optional ((www.)?). Where is the bug?
The problem is not in the (www\.)? part but that parts after that.
Take a look at the [^\\\/#?] and the [^\.,;:\?\!\#\^\$ -] parts.
So a valid URL would be https://xx.xx plus none of \/#? plus none of .,;:?!#^$_- making the url valid if you add those, for example https://xx.xx11.
I do advice you to not try to create your own regex because you are missing a lot!
For example, tlds like .amsterdam are valid. And why are you capturing so many groups?
Your regex as an image made with https://www.debuggex.com/:

How to set up regex in nutch for filtering URL of techcrunch?

I want to crawl the pages of Techcrunch uploaded after the 1 Jan of 2013.The website follows the pattern
http://www.techcrunch.com/YYYY/MM/DD
So my question is how to setup the regex in urlfilter in nutch so that i could crawl only pages which i want.
+^http://www.techcrunch.com/2013/dd/dd/([a-z0-9\-A-Z]*\/)*
I don't know nutch but do you try:
+^http://www.techcrunch.com/2013/[0-9]{2}/[0-9]{2}.*$
or
+^http://www.techcrunch.com/2013/[0-9]+/[0-9]+.*$
The following expressions will match the URLs you need:
Without groups
http:\/\/www.techcrunch.com\/\d{4}\/\d{2}\/\d{2}\/\w+
With groups
http:\/\/www.techcrunch.com\/(\d{4})\/(\d{2})\/(\d{2})\/(\w+)
I didn't put anchors (^$), but you can put them if you need them for the filtering.
Try them to see if any of them work.
I don't know how nutch works, but a couple of suggestions about your regex that may apply: the / in the regexp should be escaped; the dd parts should be \d\d so they match two digits.
About setting up the regex, check out this answer to see if it helps you.

How 'Exclude URLs With regex' In Live HTTP headers

I want to exclude some urls from Live HTTP headers (firefox add-on).
so in Config area i checked Exclude URLs With regex and put the string below in it:
.gif$|.jpg$|.ico$|.css$|.js$|.png$|.bmp$|.jpeg$|google$|bing$|alexa$
i want to remove all images from capturing and any url that contains :
css - js - google - bing - alexa
what is the problem about my regex and would you please fix it for me?
thanks in advance
. means "any char"
$ means "the end of the string"
That said:
.gif$ will match "any string ending with gif that is at least 4-char long"
google$ will match "any string ending with google"
I guess you were looking for something like:
[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|\b(google|bing|alexa)\b
Maybe your regexps get autoanchored with ^ and $ by the tool you're using. In this case, use .* additionally:
.*[.](gif|jpg|ico|css|js|png|bmp|jpeg)$|.*\b(google|bing|alexa)\b.*

regex for validate URL without http/https

All,
I am new to REGEX world...
I know that there are lot of regex avail for validating the common URL with http in it.
But I am looking for a regex to validate the URL in the following formats(without HTTP/HTTPS):
www.example.com/user/login
www.example.com
www.exmaple.co.xx
www.example.com/user?id=234&name=fname
in case if the URL contains only,
www.example(without the domain - .com OR .co.xx)
example.com (without "www")
I should throw an error to the user.
any help would be highly appreciated...
Thanks
Raj
This regex will pass your first set, but not match the second set:
^www\.example\.(com|co.xx)(/.*)?$
In English, this regex requires:
starts with www.example.
followed by either com or co.xx
optionally followed by / then anything
You could be more prescriptive about what can follow the optional slash by replacing (/.*) with (/(user|buy|sell)\?.*) etc

My Django URLs not picking up dashes

Im trying to work out a url that will match domain.com\about-us\ & domain.com\home\
I have a url regex:
^(?P<page>\w+)/$
but it won't match the url with the - in it.
I've tried
^(?P<page>\.)/$
^(?P<page>\*)/$
but nothing seems to work.
Try:
^(?P<page>[-\w]+)/$
[-\w] will accept a-z 1-9 and dash