Regex to match anything after / - regex

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.

To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.

use this regular expression (?<=://).+

Related

Fluentvalidation 6.4.1.0 support me with Incorrect regex

In my case, i want to validate for url image, some url is valid but result is wrong.
Eg: link image is "https://fuvitech.online/wpcontent/uploads/2021/02/bta16600brg.jpg" or "https://fuvitech.online/wp-content/uploads/2021/02/bta16-600brg.jpg" reponse "The image link is not in the correct format".
My code here:
RuleFor(product => product.Images)
.Length(1, 3000).WithMessage(Labels.importProduct_ExceedDescription, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))))
.Matches(#"^(http:\/\/|https:\/\/){1}?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$").WithMessage(Labels.importProduct_UrlNotCorrect, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))));
Please help me where the above regex is wrong. Thank you.
Try this:
NOTE the following regex pattern may trigger false positives and also may ignore valid image URLs, because it is very difficult to validate whether a given URL is valid.
^https?:\/\/(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+[A-Za-z]{2,}(?::\d+)?\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|)[\w-]+\.(?:jpg|jpeg|png)$
Explanation
^ the start of a line/string.
https?:\/\/ match http with an optional letter s, followed by ://.
(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+ This will match things like foo-foo.bar-bar., foo.bar-bar. and foo.
[A-Za-z]{2,} this will match the TLD part, e.g., com, org, this part with the previous part will match things like foo-foo.bar-bar.com, foo.bar-bar.com or foo.com.
(?::\d+)? optional group of (a colon : followed by one or more digits) for port part.
\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|) this check for two things, the first one is /uploads/public-images/, /uploads/images/, the second one is a single /.
[\w-]+ this part for the file name, e.g., bta16-600brg.
\.(?:jpg|jpeg|png) you can add here multiple extensions, you can allow uppercase letters by using for example, [Jj][Pp][Gg] for jpg.
$ the end of the line/string.
See regex demo
Thanks #SaSkY answer my question.
I found my mistake.
This source [.[a-z]{2,5}] only allows domain extensions from 2-5 characters. Example [.com] is valid. But in my case [.online] was not valid.
I changed to [.[a-z]{1,10}].

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!
Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/
If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.

Regex for URL to sites

I have two URLs with the patterns:
1.http://localhost:9001/f/
2.http://localhost:9001/flight/
I have a site filter which redirects to the respective sites if the regex matches. I tried the following regex patterns for the 2 URLs above:
http?://localhost[^/]/f[^flight]/.*
http?://localhost[^/]/flight/.*
Both URLS are getting redirected to the first site, as both URLs are matched by the first regex.
I have tried http?://localhost[^/]/[f]/.* also for the 1st url. I am Unable to get what am i missing . I feel that this regex should not accept any thing other than "f", but it is allowing "flight" as well.
Please help me by pointing the mistake i have done.
Keep things simple:
.*/f(/[^/]*)?$
vs
.*/flight(/[^/]*)?$
Adding ? before $ makes the trailing slash with optional path term optional.
The first one will be caught with following regex;
/^http:[\/]{2}localhost:9001\/f[^light]$/
The other one will be disallowed and can be found with following regex
/^http:[\/]{2}localhost:9001\/flight\/$/
You regex has several issues: 1) p? means optional p (htt:// will match), 2) [^/] will only match : in your URLs since it will only capture 1 character (and you have a port number), 3) [^light] is a negated character class that means any character that is not l, i, g, h, or t.
So, if you want to only capture localhost URLs, you'd better use this regex for the 1st site:
http://localhost[^/]*/f/.*
And this one for the second
http://localhost[^/]*/flight/.*
Please also bear in mind that depending on where you use the regexps, your actual input may or may not include either the protocol.
These should work for you:
http[s]{0,1}:\/\/localhost:[0-9]{4}\/f\/
http[s]{0,1}:\/\/localhost:[0-9]{4}\/flight\/
You can see it working here

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.
This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.
Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1
This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

Extracting top-level and second-level domain from a URL using regex

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?
Here's my idea,
Match anything that isn't a dot, three times, from the end of the line using the $ anchor.
The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.
Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.
Regex:
[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Demonstration:
Regex101 Example
Updated 2019
This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.
The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.
There are several open-source libraries out there that you can use, like psl, or you can write your own.
Usage for psl is quite intuitive. From their docs:
var psl = require('psl');
// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null
// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'
// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
Old answer
You could use this:
(\w+\.\w+)$
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.
Example: http://regex101.com/r/wD8eP2
Also, you can likely do that with some expression similar to,
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
and add as much as capturing groups that you want to capture the components of a URL.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
This matches anything with a period followed by two or three characters and then a word boundary.
Here's some example outputs:
'example.aus.com' // .aus.com
'example.austin.com' // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy' // .co.uk
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.
Edit
I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
'example.aus.com'.match(/\.\w*\b/g).join('')
Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:
([^.\s]+\.[^.\s]+)$
Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.
http://regexr.com/3bmb3
With capturing groups you can achieve some magix.
For example, consider the following javascript:
let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');
document.write(domain);
This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.
The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.
if you want all specific Top Level Domain name then you can write regular expression like this:
[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]
You can also put more domain name from this link:
https://www.icann.org/resources/pages/tlds-2012-02-25-en
The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:
(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|
It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.
If you need to be more specific:
/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/
Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/