Regex to match a capture a domain name

Regex to match a capture a domain name - regex

Hi i will have URL in following format:
http://www.youtube.com/v/0PsnoiwMrhA
https://www.youtube.com/v/0PsnoiwMrhA
www.youtube.com/v/0PsnoiwMrhA
http://youtube.com/v/0PsnoiwMrhA
youtube.com/v/0PsnoiwMrhA
It all must capture and return a domain name as youtube.
I have tried using
(http://|https://)?(www.)(.?*)(.com|.org|.info|.org|.net|.mobi)
but it showing error as regex parsing nested quantifier.
Please help me out

If you are using a field that you know is in one of these formats, you can retrieve the match from Group 1 using this regex:
^(?:https?://)?(?:www\.)?([^.]+)
In VB.NET:
Dim ResultString As String
Try
ResultString = Regex.Match(SubjectString, "^(?:https?://)?(?:www\.)?([^.]+)", RegexOptions.Multiline).Groups(1).Value
Catch ex As ArgumentException
'Syntax error in the regular expression
End Try

(.?*) should be (.*?) - that's the source of your error.
Also, remember to escape the dot unless you want it to match any character.
And since the www. part is optional, you need to add a ? quantifier to that group as well.

You could try the below regex to get the domain name youtube from the above mentioned URL's,
^(?:https?:\/\/)?(?:www\.)?([^.]*)(?=(?:\.com|\.org|\.info|\.net|\.mobi)).*$
DEMO
It ensures that the domain name must be followed by .com or .info or .org or .net or .mobi.

Related

regex extract username from 2 types of url

I'm currently using this regex (?<=\/movie\/)[^\/]+, but it only matches the username from the second url, i know i could make a if (contains /movie/): use this regex, else: use another regex on my code, but i'm trying to do this directly on regex.
http://example.com:80/username/token/30000
http://example.com:80/movie/username/token/30000.mp4

To complete the Tensibai's answer, if you have not a port in url, you can use the last dot in url to start your regex :
\.[^\/\.]+\/(?:movie\/)?([^\/]+)
(demo)

You can use something like this to make the movie/ optional and have the username in a named capture group (Live exemple):
\d[/](?:movie\/)?(?<username>[^/]+)[/]
using \d/ to anchor the start of match at after the url.

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?

You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.

You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.

Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files

Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

Backbone.js route using regex - Matching a URL that does not end with a given string

I have to create a route using regex that matches a URL which does not end with a particular word say 'submit'. For example -
/login/submit ==> does not match
/login/abcsubmit ==> does not match
/abc/xyx => Matches

Use this regex:
((?!(.*?)/\w*submit).*)
like explained in http://backbonejs.org/#Router-route
this.route(/^((?!(.*?)/\w*submit).*)$/, "functionName");

I had tried #Nestenius regex that he provided and it was still matching the first two example urls that you had provided. The reason it was is because the regex was not anchored to the start of the string.
You could still use his regex if you add an ^ tag to the beginning of the regex like so:
^((?!(.*?)/\w*submit).*)
Or you can use this shorter version:
^(?!.*submit).*
Both will match any string that does not contain "submit" in it.

MFC: How do I construct a good regular expression that validates URLs?

Here's the regular expression I use, and I parse it using CAtlRegExp of MFC :
(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])
It works fine except with one flaw. When URL is preceded by characters, it still accepts it as a URL.
ex inputs:
this is a link www.google.com (where I can just tokenize the spaces and validate each word)
is...www.google.com (this string still matches the RegEx above :( )
Please help...
Thanks...

Use the IgnoreCase flag instead of catering for each case.
Stick a ^ at the beginning if you want the start of the string to be the start of the URL
You're missing a lot of characters from possible, valid URLs.

You need to tell the regex to only match at the start and end of the string. I'm not sure how you do that in VC++ - in most regexs you enclose the pattern with ^ and $. The ^ says "the start of the string" and the $ says "the end of the string."
^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\\.]+[a-zA-Z0-9]+[\\.]+[a-zA-Z0-9])$
The second is matching because the string still contains a valid URL.

How about using CUrl (that is, 'C-Url', in ATL, not curl as in libcurl) which can 'parse' urls with CUrl::CrackUrl . If that function returns FALSE you assume it's not a valid URL.
That said, decomposing URL is sufficiently complex to warrant a proper parser, not a regex based decomposition. Cfr. rfc 2396 etc. for an overview on the complexities.

Start the regex with ^ to and end it with $ to have the regex match only if the entire sting matches (if that's what you want):
^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])$

What about this one: (((f|ht)tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+) ?

This Regular Expression has been tested to work for the following
http|https://host[:port]/[?][parameter=value]*
public static final String URL_PATTERN = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:#/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?#]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";
PS. It also validates on localhost link.
(Thoroughly written by me :-))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match a capture a domain name - regex

(.?) should be (.?) - that's the source of your error. Also, remember to escape the dot unless you want it to match any character. And since the www. part is optional, you need to add a ? quantifier to that group as well.

You could try the below regex to get the domain name youtube from the above mentioned URL's, ^(?:https?:\/\/)?(?:www\.)?([^.])(?=(?:\.com|\.org|\.info|\.net|\.mobi)).$ DEMO It ensures that the domain name must be followed by .com or .info or .org or .net or .mobi.

Related

regex extract username from 2 types of url

Regular expression to extract different parts of URL and path

How to extract file name from URL?

Backbone.js route using regex - Matching a URL that does not end with a given string

MFC: How do I construct a good regular expression that validates URLs?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match a capture a domain name - regex

(.?*) should be (.*?) - that's the source of your error. Also, remember to escape the dot unless you want it to match any character. And since the www. part is optional, you need to add a ? quantifier to that group as well.

You could try the below regex to get the domain name youtube from the above mentioned URL's, ^(?:https?:\/\/)?(?:www\.)?([^.]*)(?=(?:\.com|\.org|\.info|\.net|\.mobi)).*$ DEMO It ensures that the domain name must be followed by .com or .info or .org or .net or .mobi.

Related

regex extract username from 2 types of url

Regular expression to extract different parts of URL and path

How to extract file name from URL?

Backbone.js route using regex - Matching a URL that does not end with a given string

MFC: How do I construct a good regular expression that validates URLs?

Categories

Resources

(.?) should be (.?) - that's the source of your error. Also, remember to escape the dot unless you want it to match any character. And since the www. part is optional, you need to add a ? quantifier to that group as well.

You could try the below regex to get the domain name youtube from the above mentioned URL's, ^(?:https?:\/\/)?(?:www\.)?([^.])(?=(?:\.com|\.org|\.info|\.net|\.mobi)).$ DEMO It ensures that the domain name must be followed by .com or .info or .org or .net or .mobi.