Regex - find all hosts inside a url - regex

Given the following url:
http://clk.atdmt.com/FLO/go/364329512/direct/01/?href=http://www.****123****.com/refer.do?r=linkshare&lsid=vl0mfKZlvKU-I%2AKKCkbqWO7Zb9aqRSVLEw&lsurl=http%3A%2F%2F****123****%2Fcollection.do%3Fdataset%3D12905%26cm_mmc%3DIM_AFFILIATES-_-Linkshare-vl0mfKZlvKU-_-10003079-_-3
Is there any expression that will match all the hosts above? (e.g http://clk.atdmt.com, http://*123*.com....)
I need the same expression to work even in the matched string will be http://clk.atdmt.com
Thanks

You can use this (javascript flavour) to extract the host (capture group 1):
/http:\/\/(?:www\.)?([^\/]+)/g

Related

regex extract username from 2 types of url

I'm currently using this regex (?<=\/movie\/)[^\/]+, but it only matches the username from the second url, i know i could make a if (contains /movie/): use this regex, else: use another regex on my code, but i'm trying to do this directly on regex.
http://example.com:80/username/token/30000
http://example.com:80/movie/username/token/30000.mp4
To complete the Tensibai's answer, if you have not a port in url, you can use the last dot in url to start your regex :
\.[^\/\.]+\/(?:movie\/)?([^\/]+)
(demo)
You can use something like this to make the movie/ optional and have the username in a named capture group (Live exemple):
\d[/](?:movie\/)?(?<username>[^/]+)[/]
using \d/ to anchor the start of match at after the url.

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Using a wildcard in Regex at the end of a URL in GA

I'm a newbie at Regex. I'm trying to get a report in GA that returns all pages after a certain point in the URL.
For example:
http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/14-June-2016/
I want to see all dates so: http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/*
Here's what I've got so far in my regex:
^https:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox(?=(?:\/.*)?$)
You can try this:
https?:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox[\w/_-]*
GA RE2 regex engine does not allow lookarounds (even lookaheads) in the pattern. You have defined one - (?=(?:\/.*)?$).
If you need all links having www.essentialibiza.com/ibiza-club-tickets/carl-cox/, you can use a simple regex:
www\.essentialibiza\.com/ibiza-club-tickets/carl-cox/
If you want to precise the protocol:
https?://www\.essentialibiza\.com/ibiza-club-tickets/carl-cox(/|$)
The ? will make s optional (1 or 0 occurrences) and (/|$) will allow matching the URL ending with cox (remove this group if you want to match URLs that only have / after cox).

Regex to match a capture a domain name

Hi i will have URL in following format:
http://www.youtube.com/v/0PsnoiwMrhA
https://www.youtube.com/v/0PsnoiwMrhA
www.youtube.com/v/0PsnoiwMrhA
http://youtube.com/v/0PsnoiwMrhA
youtube.com/v/0PsnoiwMrhA
It all must capture and return a domain name as youtube.
I have tried using
(http://|https://)?(www.)(.?*)(.com|.org|.info|.org|.net|.mobi)
but it showing error as regex parsing nested quantifier.
Please help me out
If you are using a field that you know is in one of these formats, you can retrieve the match from Group 1 using this regex:
^(?:https?://)?(?:www\.)?([^.]+)
In VB.NET:
Dim ResultString As String
Try
ResultString = Regex.Match(SubjectString, "^(?:https?://)?(?:www\.)?([^.]+)", RegexOptions.Multiline).Groups(1).Value
Catch ex As ArgumentException
'Syntax error in the regular expression
End Try
(.?*) should be (.*?) - that's the source of your error.
Also, remember to escape the dot unless you want it to match any character.
And since the www. part is optional, you need to add a ? quantifier to that group as well.
You could try the below regex to get the domain name youtube from the above mentioned URL's,
^(?:https?:\/\/)?(?:www\.)?([^.]*)(?=(?:\.com|\.org|\.info|\.net|\.mobi)).*$
DEMO
It ensures that the domain name must be followed by .com or .info or .org or .net or .mobi.

Need IP Address mask and DNS host name regular expressions?

I need to allow an IP/DNS name from a text box. I am looking for a IP regular expression which work for IP.
Now I am using one regular expression:
/\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/
which was working for 0-255 range. But allowing invalid IP such as : 121.21.05.234.01 which has 5 parts.
I need a regular expression which will work in all scenario's like below:
10.2.22.1 - true
123.123.123.123 - true
123.123.023.12 - true
12.23.12.0 - true
121.21.05.234.01 - false
Please provide me DNS expression also.
Try to anchor your regex with ^ and $, which will make it match the whole string.
Are you looking for a way to specify an occurrence count?
You may achieve this with curly brackets.
An exemple here.
In your case, it would lead to:
/\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}\b/
(I added a \ to escape the dot, too)