Dart extract host from URL string - regex

Supposing that I have the following URL as a String;
String urlSource = 'https://www.wikipedia.org/';
I want to extract the main page name from this url String; 'wikipedia', removing https:// , www , .com , .org part from the url.
What is the best way to extract this? In case of RegExp, what regular expression do I have to use?

You do not need to make use of RegExp in this case.
Dart has a premade class for parsing URLs:
Uri
What you want to achieve is quite simple using that API:
final urlSource = 'https://www.wikipedia.org/';
final uri = Uri.parse(urlSource);
uri.host; // www.wikipedia.org
The Uri.host property will give you www.wikipedia.org. From there, you should easily be able to extract wikipedia.
Uri.host will also remove the whole path, i.e. anything after the / after the host.
Extracting the second-level domain
If you want to get the second-level domain, i.e. wikipedia from the host, you could just do uri.host.split('.')[uri.host.split('.').length - 2].
However, note that this is not fail-safe because you might have subdomains or not (e.g. www) and the top-level domain might also be made up of multiple parts. For example, co.uk uses co as the second-level domain.

Related

Regular expressions (RegEx) to filter string from URLs in Google Analytics

I want to filter a string from the URLs in Google Analytics. This can be done using the Views > Filter > Exclude using RegEx, but I have been unable to get it to work.
An outline of how these filters are set up, can be found here, however, I can not work out how to isolate the string using RegEx. I believe it will need to be one filter per URL type.
The URLs follow this format:
/software/11F372288FA/pagename
/software/13F412C5FA/pagename/summary
/software/XIL1P0BFXCKM81/pagename2
I need to exclude this part of the URL:
/11F372288FA/
So that the URL data (e.g. Session time) is recorded against:
/software/pagename
/software/pagename/summary
/software/pagename2
I have worked out that I can isolate the string using thing following RegEx
^\/validate\/(..........)\/accounts\/summary$
It is not very elegant and would require a filter for every URL type.
Thanks for the help!
I'm not certain if this will work in your exact case but instead of using regex for this it might be easier to just create a new string from the start to the end of "software" and append everything from pagename to the end. In Java this might look something like:
String newString = oldString.substring(0, 9) + oldString.substring(oldString.indexOf("pagename"));
Take note though that this will only work if the "software" at the start is always the same length and you are actually only excluding things between "software" and "pagename".

How i can get PATH from URL with Regex?

Maybe somebody can help me with this regex ?
.*\:\/\/(?:www.)?([^\/]+)(\/.+")
I need to get all paths from URL. I tried, but i can't match only path without quotation mark
https://regex101.com/r/J6nILD/6
You can get the path using JSR223 Sampler with Groovy code.
Declare/ get the URL variable
Parse that URL to get protocol, host, port and path. Use JSR223 Sampler and paste the following code in Script area
URL url1 = new URL(vars.get('url'));
vars.put('protocol', url1.getProtocol());
vars.put('host', url1.getHost());
vars.put('port', url1.getPort() as String);
vars.put('path', url1.getPath());
vars.put('query', url1.getQuery());
Use that variables anywhere in the script using ${}
If you have to first scan for a URL:
I've attempted to provide a simple regex (overly simplified) that might work in your context, but you might have to modify it to provide some additional context. For example, x is a valid path and this regex will recognize it as such. But if you are trying to look for the path in a string such as <img src="x">, it will also recognize img as a valid url path. In that case, you would want perhaps:
/<img\s+src="((https?|ftp):\/\/[^\/]+)?(\/?[^?#\s"]*)/i
var regex = /\b((https?|ftp):\/\/[^\/]+)?(\/?[^?#\s]*)\b/i;
var s = 'http://example.com/a/b?x=1';
var result = regex.exec(s);
console.log(result[3]);
If the protocol and host potion of the URL are always present, then it becomes easier to distinguish URLs in just about any context by making the protocol and host not optional:
/\b((https?|ftp)://[^/]+)(/?[^?#\s]*)\b/i;
You could go for something like:
(?:([^:\\/?#]+):)?(?:\\/\\/([^\\/?#]*))?([^?#]*)(?:\\?([^#]*))?(?:#(.*))?
Demo:
More information:
JMeter: Regular Expressions
Using RegEx (Regular Expression Extractor) with JMeter
Perl 5 Regex Cheat sheet

Regex to validate a url using a wildcard?

Lets say I have a list of valid domain roots,
example.com
test.com
And a variable
String url
How would I make use of a regex to validate that my variable url is on the list, including subdomains?
For example, perhaps my url is "subdomain.case.example.com"
That is, to say clearly:
How would I utilize a regex to verify that my url is *.example.com OR *.test.com OR example.com OR test.com?
Something like this?
^((\*|[\w\d]+(-[\w\d]+)*)\.)*(example|test)(\.com)$
Edit live on Debuggex
To allow for such things as... subdomain.*.example.com, subdomain.example.com, example.com, *.example.com, etc.
Use $ to mark the end of string.
Your regex would be
.*(example|test)[.]com$

Regex Base URL Grabbing

I'm trying to filter out a bunch of urls to find their base url, which doesn't include the www or any prefix, having trouble writing a expression to capture it, but with subset of TLDs, it becomes a rather more complicated issue.
answers.yahoo.com => yahoo.com
www.google.com => google.com
uk.answers.yahoo.co.uk = > yahoo.co.uk
www.g.se => g.se
Any suggestions?
I was using this expression, but it messes up when the domain name isn't more than 2 characters or when the domain tld is less than 2 characters.
(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$
How do you know that the base of uk.answers.yahoo.co.uk is yahoo.co.uk, but the base of, for example, foo.bar.maps.google.com isn't maps.google.com?
[^\.]*\.(?:co.uk|\w{2,3})$
You'll need to add known domains in the regex.
http://regexr.com?30p4r

rel-tag bookmarklet for last path component of a URL

Many web sites support folksonomy tags. You may have heard of rel-tag, where it says that "The last path component of the URL is the text of the tag".
I am looking for a bookmarklet or greasemonkey script (javascript) to get the "last path component" for the URL currently being viewed in the browser, add that tag into another URL, and then open that page in a new tab or window.
For example, if I am looking at a delicious.com page with the tag "foo", I may want to create a new URL with the tag "foo". This should also work for multiple tags in the last path component, such as, foo+bar.
Some regexp suggestions have been offered.
Since you're using JavaScript, there's no need to worry about hostnames, querystrings, etc - just use location.pathname to get at the important bit.
For example:
var NewUrl = 'http://technorati.com/tag/';
var LastPart = location.pathname.match( /[^\/]+\/?$/ );
window.open( NewUrl + LastPart );
That allows for a potential single trailing slash.
You can use /[^\/]+\$/ to disallow trailing slashes, or /[^\/]+\/*$/ for any number of them.
If you can assume both your URLs to be valid, you can get the tag from the first URL with this regex:
^[a-z]+://[^/#?]+/[^#?]*?([^#?/]+)(?:[#?]|$)
The first (and only) capturing group will hold the tag. This regex won't match URLs that don't have any tags.
To append the tag to another URL, search for the regex:
^([^#?]*?)/?(?:[#?]|$)
and replace with:
$1/tag
This regex makes sure not to end up with two adjacent slashes in the URL if the path of the original URL ends with a slash.
implementation, as in how the servers are set up, all that jazz? I'm not very knowledgeable about that stuff =\ ahh that sounds