Regex match string in url - regex

http://example.com/ru/regiony/
http://example.com/regiuni/
https://example.com/obshestvo/regiony-zastavili-ego-zhiti
http://example.com/obshestvo/regiony
I need check, if in url I have link: http://example.com/regiony/ or http://example.com/regioni/ or http://example.com/regiony or http://example.com/regioni
My regex:
(regiony\/?|regiuni\/?)
But this is not working. I match all

\/regi(?:uni|ony)(?=$|\/)
http://example.com/ru/regiony/
http://example.com/regiuni/
https://example.com/obshestvo/regiony-zastavili-ego-zhiti
http://example.com/obshestvo/regiony
\/regi(?:uni|ony) matches /regiony or /regiuni
(?=$|\/) end of string/line or / must follow (positive lookahead)
Alternatively use
\/regi(?:uni|ony)(?:$|\/)
to include the trailing slash (/regiuni/)

If you are only interested in finding the segment after http://example.com/, then you can use this:
(region[iy][-/]?)|(regiuni[/]?)
This should cover all (six) possibilities: regioni, regiony, regioni/, regiony/, regiuni, regiuni/. Added regiony-, regioni-.

For your example data you might use a pattern like:
^http://.*/regi(?:on[iy]|uni)/?$
Or with the example url in the match:
^http?://example\.com.*/regi(?:on[iy]|uni)/?$
This will use $ to assert that this pattern regi(?:on[iy]|uni)/? occurs at the end of the line. You could use another delimiter for / like ~ so you don't have to escape the forward slashes.
For example:
$pattern = '~^http://.*/regi(?:on[iy]|uni)/?$~';
$string = "http://example.com/ru/regiony/";
if (preg_match($pattern, $string)) {
echo "Valid: $string" . PHP_EOL;
}
Test Php

Related

Regex to match and optionally remove a / at the end of an url

I have a form should get urls in the following formats:
http://website.com/test/1
http://website.com/test/1/
My regex is currently this:
url.match(/website.com(.*)/);
I want the capture group to match the content and automatically remove the last "/" at the end of the URL, so that, no matter if there is / or not, it would always return "/test/1" . How?
Try using the below regex,
url.match(/website.com(.*)\b/);
it should work for your case. Let me know if it is not.
Hmm... Regexs' can't really "remove"/explode characters. Say, for example, you have a field where you can enter a URL. You can restrict the user from adding a "/", like either giving him an error message that says: "Remove the / in the end of your URL" or just not allow it (with Javascript, so when they try to add it, the "/" gets removed).
I'll give you a regex expression that will check for everything in your URL:
(https?:\/\/)?(([A-Za-z0-9]+)((\.[a-z]{1,3})))(\/\w+)\/(\d)\/?
Here's what this regex expression does:
First it checks to see if http:// or https:// is present in the URL: (https?:\/\/)?
The https? basically means: if http has an "s" at the end of it, hence the s? (which could be any character, though: a? b? c? 1? 2? 3?). It's the same with the "?" after (https?:\/\/): (https?:\/\/)?, but here it checks for, as said, if the entire http:// or https:// is present. Meaning that an URL like this: example.com (without the http or https in the beginning) would get matched too.
Then we have this entire section of the expression: (([A-Za-z0-9]+)((\.[a-z]{1,3}))). Let's break it down a bit:
([A-Za-z0-9]+)
Here it checks for any letters or numbers (example: "website"), (uppercase or lowercase) until it meets: ((\.[a-z]{1,3})), which checks any letters any lowercase only with a maximum of 3 letters (example: .com).
So (([A-Za-z0-9]+)((\.[a-z]{1,3}))) would match, just to mention a few examples: stackoverflow.com, twitter.com, google.se but not example.online, because of the {1,3} which basically says "between 1 to 3" letters only.
Then we have the last part: (\/\w+)\/(\d)\/?. First we have (\/\w+) which checks for any word following a slash, for example: "/test". The \w basically mean check for any word.
After that it check for a "/": \/, and lastly a number (\d), following the "/", so for example: "/1". In the end of this regex expression we have a \/?, which just checks to see if there is a trailing slash or not.
So in PHP this regex expression could be used like this:
$pattern = "/(https?:\/\/)?(([A-Za-z0-9]+)((\.[a-z]{1,3})))(\/\w+)\/(\d)\/?/";
$url = "https://example.com/user/1/";
if(preg_match($pattern, $url, $matches)){
echo $matches[1]; // Will echo https://
echo $matches[2]; // Will echo "example.com"
echo $matches[3]; // Will echo "example"
echo $matches[4]; // Will echo ".com"
echo $matches[5]; // Will echo ".com"
echo $matches[6]; // Will echo "/user"
echo $matches[7]; // Will echo "1"
var_dump($matches); // Will dump the array
}
Hope this helps.
Edit; Of course, the regex can be written in more ways and it's differently for different languages. But this is just an example of how I usually build my regexs'; I always structure it and break it down into parts so I can more easily see what is what and try to think of everything I want to check for in a regex.

Write a regex for url match

I'm trying to write wordpress pretty permalinks regex.
I have following urls. I need 2 matches,
1st : last word between / and / before get/
2nd : string which is start with get/
Url's may be like these
http://localhost/akasia/yacht-technical-services/yacht-crew/get/gulets/for/sale/
Here I need "yacht-crew" and "get/gulets/for/sale/"
http://localhost/akasia/testimonials/get/motoryachts/for/sale/
here I need "testimonials" and get/motoryachts/for/sale/
http://localhost/akasia/may/be/lots/of/seperator/but/ineed/last/get/ships/for/rent/
here I need "last" and get/ships/for/rent/
I catch 2nd part with
(.(get/(.)?))
but for first part there is no luck.
I will be appreciated if someone helps.
Regards
Deniz
I suggest the following:
([^\/]+?)\/(get\/.+)
https://regex101.com/r/uN6yH3/1
The concept is that you match non-slash characters up to the first slash (non-greedy) that is followed by the word "get" grouping it, and then just grab the rest as the second group.
I am assuming PHP.
$path = parse_url($url,PHP_URL_PATH);
$s = strrpos($path,'/');
$matches[] = substr($path,$s+1);

Matching URLs with other characters around

I need a regex pattern to match URLs in a complicated environment.
An URL would be in this position:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
(That's just a sample URL)
I need to match the URL until the colon, the colon and the code after that should be ignored. There are so many URLs out there and I'm not that experienced to create a pattern to match everything from http:// to :
As I said, everything else should be ignored, left away, except the URL which I need to store in a variable.
Could someone help me create such a pattern? My tries were matching the URL above, but when I put in more complicated URLs, they wouldn't match.
This is the pattern I've created. It works with simple URLs, but not with the complicated ones:
http(s)?://[A-Za-z0-9.,/_-]+
I'm not very good in regex, I'm still learning.
Thank you.
This regex should do it for you.
\[url=(.*?):[a-zA-Z0-9]*\]
Run against your test data:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
This will return the URL in capture group 1.
Assuming PHP (since your test URL is for the PHP manual), you'd use this with preg_match like this:
$value = "[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]";
$pattern = "/\[url=(.*?):[a-zA-Z0-9]*\]/";
preg_match($pattern, $value, $matches);
echo $matches[1];
Output:
http://www.php.net/manual/en/function.preg-replace.php
This will also work against URLs which contain colons in them, such as:
http://www.php.net:8080/manual/en/function.preg-replace.php
http://www.php.net/manual/us:en/function.preg-replace.php
How about this:
^(http(s)?:\/\/)?[^]^(^)^ ]+
Below regex will give you the url part before colon:
\[url=((http|https)?://)?[^\:]+

Regex to get a filename from a url

I am trying to write a regex to get the filename from a url if it exists.
This is what I have so far:
(?:[^/][\d\w\.]+)+$
So from the url http://www.foo.com/bar/baz/filename.jpg, I should match filename.jpg
Unfortunately, I match anything after the last /.
How can I tighten it up so it only grabs it if it looks like a filename?
The examples above fails to get file name "file-1.name.zip" from this URL:
"http://sub.domain.com/sub/sub/handler?file=data/file-1.name.zip&v=1"
So I created my REGEX version:
[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))
Explanation:
[^/\\&\?]+ # file name - group of chars without URL delimiters
\.\w{3,4} # file extension - 3 or 4 word chars
(?=([\?&].*$|$)) # positive lookahead to ensure that file name is at the end of string or there is some QueryString parameters, that needs to be ignored
This one works well for me.
(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)
(?:.+\/)(.+)
Select all up to the last forward slash (/), capture everything after this forward slash. Use subpattern $1.
Non Pcre
(?:[^/][\d\w\.]+)$(?<=\.\w{3,4})
Pcre
(?:[^/][\d\w\.]+)$(?<=(?:.jpg)|(?:.pdf)|(?:.gif)|(?:.jpeg)|(more_extension))
Demo
Since you test using regexpal.com that is based on javascript(doesnt support lookbehind), try this instead
(?=\w+\.\w{3,4}$).+
I'm using this:
(?<=\/)[^\/\?#]+(?=[^\/]*$)
Explanation:
(?<=): positive look behind, asserting that a string has this expression, but not matching it.
(?<=/): positive look behind for the literal forward slash "/", meaning I'm looking for an expression which is preceded, but does not match a forward slash.
[^/\?#]+: one or more characters which are not either "/", "?" or "#", stripping search params and hash.
(?=[^/]*$): positive look ahead for anything not matching a slash, then matching the line ending. This is to ensure that the last forward slash segment is selected.
Example usage:
const urlFileNameRegEx = /(?<=\/)[^\/\?#]+(?=[^\/]*$)/;
const testCases = [
"https://developer.mozilla.org/en-US/docs/Web/API/MutationObserverInit#yo",
"https://developer.mozilla.org/static/fonts/locales/ZillaSlab-Regular.subset.bbc33fb47cf6.woff2",
"https://developer.mozilla.org/static/build/styles/locale-en-US.520ecdcaef8c.css?is-nice=true"
];
testCases.forEach(testStr => console.log(`The file of ${testStr} is ${urlFileNameRegEx.exec(testStr)[0]}`))
It might work as well:
(\w+\.)+\w+$
You know what your delimiters look like, so you don't need a regex. Just split the string. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $url = "http://www.foo.com/bar/baz/filename.jpg";
my #url_parts = split/\//,$url;
my $filename = $url_parts[-1];
if(index($filename,".") > 0 )
{
print "It appears as though we have a filename of $filename.\n";
}
else
{
print "It seems as though the end of the URL ($filename) is not a filename.\n";
}
Of course, if you need to worry about specific filename extensions (png,jpg,html,etc), then adjust appropriately.
> echo "http://www.foo.com/bar/baz/filename.jpg" | sed 's/.*\/\([^\/]*\..*\)$/\1/g'
filename.jpg
Assuming that you will be using javascript:
var fn=window.location.href.match(/([^/])+/g);
fn = fn[fn.length-1]; // get the last element of the array
alert(fn.substring(0,fn.indexOf('.')));//alerts the filename
Here is the code you may use:
\/([\w.][\w.-]*)(?<!\/\.)(?<!\/\.\.)(?:\?.*)?$
names "." and ".." are not considered as normal.
you can play with this regexp here https://regex101.com/r/QaAK06/1/:
In case you are using the JavaScript URL object, you can use the pathname combined with the following RegExp:
.*\/(.[^(\/)]+)
Benefit:
It matches anything at the end of the path, but excludes a possible trailing slash (as long as there aren't two trailing slashes)!
Try this one instead:
(?:[^/]*+)$(?<=\..*)
This is worked for me, no matter if you have '.' or without '.' it take the sufix of url
\/(\w+)[\.|\w]+$

Backbone.js route using regex - Matching a URL that does not end with a given string

I have to create a route using regex that matches a URL which does not end with a particular word say 'submit'. For example -
/login/submit ==> does not match
/login/abcsubmit ==> does not match
/abc/xyx => Matches
Use this regex:
((?!(.*?)/\w*submit).*)
like explained in http://backbonejs.org/#Router-route
this.route(/^((?!(.*?)/\w*submit).*)$/, "functionName");
I had tried #Nestenius regex that he provided and it was still matching the first two example urls that you had provided. The reason it was is because the regex was not anchored to the start of the string.
You could still use his regex if you add an ^ tag to the beginning of the regex like so:
^((?!(.*?)/\w*submit).*)
Or you can use this shorter version:
^(?!.*submit).*
Both will match any string that does not contain "submit" in it.