Regex to get a filename from a url - regex

I am trying to write a regex to get the filename from a url if it exists.
This is what I have so far:
(?:[^/][\d\w\.]+)+$
So from the url http://www.foo.com/bar/baz/filename.jpg, I should match filename.jpg
Unfortunately, I match anything after the last /.
How can I tighten it up so it only grabs it if it looks like a filename?

The examples above fails to get file name "file-1.name.zip" from this URL:
"http://sub.domain.com/sub/sub/handler?file=data/file-1.name.zip&v=1"
So I created my REGEX version:
[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))
Explanation:
[^/\\&\?]+ # file name - group of chars without URL delimiters
\.\w{3,4} # file extension - 3 or 4 word chars
(?=([\?&].*$|$)) # positive lookahead to ensure that file name is at the end of string or there is some QueryString parameters, that needs to be ignored

This one works well for me.
(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)

(?:.+\/)(.+)
Select all up to the last forward slash (/), capture everything after this forward slash. Use subpattern $1.

Non Pcre
(?:[^/][\d\w\.]+)$(?<=\.\w{3,4})
Pcre
(?:[^/][\d\w\.]+)$(?<=(?:.jpg)|(?:.pdf)|(?:.gif)|(?:.jpeg)|(more_extension))
Demo
Since you test using regexpal.com that is based on javascript(doesnt support lookbehind), try this instead
(?=\w+\.\w{3,4}$).+

I'm using this:
(?<=\/)[^\/\?#]+(?=[^\/]*$)
Explanation:
(?<=): positive look behind, asserting that a string has this expression, but not matching it.
(?<=/): positive look behind for the literal forward slash "/", meaning I'm looking for an expression which is preceded, but does not match a forward slash.
[^/\?#]+: one or more characters which are not either "/", "?" or "#", stripping search params and hash.
(?=[^/]*$): positive look ahead for anything not matching a slash, then matching the line ending. This is to ensure that the last forward slash segment is selected.
Example usage:
const urlFileNameRegEx = /(?<=\/)[^\/\?#]+(?=[^\/]*$)/;
const testCases = [
"https://developer.mozilla.org/en-US/docs/Web/API/MutationObserverInit#yo",
"https://developer.mozilla.org/static/fonts/locales/ZillaSlab-Regular.subset.bbc33fb47cf6.woff2",
"https://developer.mozilla.org/static/build/styles/locale-en-US.520ecdcaef8c.css?is-nice=true"
];
testCases.forEach(testStr => console.log(`The file of ${testStr} is ${urlFileNameRegEx.exec(testStr)[0]}`))

It might work as well:
(\w+\.)+\w+$

You know what your delimiters look like, so you don't need a regex. Just split the string. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $url = "http://www.foo.com/bar/baz/filename.jpg";
my #url_parts = split/\//,$url;
my $filename = $url_parts[-1];
if(index($filename,".") > 0 )
{
print "It appears as though we have a filename of $filename.\n";
}
else
{
print "It seems as though the end of the URL ($filename) is not a filename.\n";
}
Of course, if you need to worry about specific filename extensions (png,jpg,html,etc), then adjust appropriately.

> echo "http://www.foo.com/bar/baz/filename.jpg" | sed 's/.*\/\([^\/]*\..*\)$/\1/g'
filename.jpg

Assuming that you will be using javascript:
var fn=window.location.href.match(/([^/])+/g);
fn = fn[fn.length-1]; // get the last element of the array
alert(fn.substring(0,fn.indexOf('.')));//alerts the filename

Here is the code you may use:
\/([\w.][\w.-]*)(?<!\/\.)(?<!\/\.\.)(?:\?.*)?$
names "." and ".." are not considered as normal.
you can play with this regexp here https://regex101.com/r/QaAK06/1/:

In case you are using the JavaScript URL object, you can use the pathname combined with the following RegExp:
.*\/(.[^(\/)]+)
Benefit:
It matches anything at the end of the path, but excludes a possible trailing slash (as long as there aren't two trailing slashes)!

Try this one instead:
(?:[^/]*+)$(?<=\..*)

This is worked for me, no matter if you have '.' or without '.' it take the sufix of url
\/(\w+)[\.|\w]+$

Related

Regex match string in url

http://example.com/ru/regiony/
http://example.com/regiuni/
https://example.com/obshestvo/regiony-zastavili-ego-zhiti
http://example.com/obshestvo/regiony
I need check, if in url I have link: http://example.com/regiony/ or http://example.com/regioni/ or http://example.com/regiony or http://example.com/regioni
My regex:
(regiony\/?|regiuni\/?)
But this is not working. I match all
\/regi(?:uni|ony)(?=$|\/)
http://example.com/ru/regiony/
http://example.com/regiuni/
https://example.com/obshestvo/regiony-zastavili-ego-zhiti
http://example.com/obshestvo/regiony
\/regi(?:uni|ony) matches /regiony or /regiuni
(?=$|\/) end of string/line or / must follow (positive lookahead)
Alternatively use
\/regi(?:uni|ony)(?:$|\/)
to include the trailing slash (/regiuni/)
If you are only interested in finding the segment after http://example.com/, then you can use this:
(region[iy][-/]?)|(regiuni[/]?)
This should cover all (six) possibilities: regioni, regiony, regioni/, regiony/, regiuni, regiuni/. Added regiony-, regioni-.
For your example data you might use a pattern like:
^http://.*/regi(?:on[iy]|uni)/?$
Or with the example url in the match:
^http?://example\.com.*/regi(?:on[iy]|uni)/?$
This will use $ to assert that this pattern regi(?:on[iy]|uni)/? occurs at the end of the line. You could use another delimiter for / like ~ so you don't have to escape the forward slashes.
For example:
$pattern = '~^http://.*/regi(?:on[iy]|uni)/?$~';
$string = "http://example.com/ru/regiony/";
if (preg_match($pattern, $string)) {
echo "Valid: $string" . PHP_EOL;
}
Test Php

Write a regex for url match

I'm trying to write wordpress pretty permalinks regex.
I have following urls. I need 2 matches,
1st : last word between / and / before get/
2nd : string which is start with get/
Url's may be like these
http://localhost/akasia/yacht-technical-services/yacht-crew/get/gulets/for/sale/
Here I need "yacht-crew" and "get/gulets/for/sale/"
http://localhost/akasia/testimonials/get/motoryachts/for/sale/
here I need "testimonials" and get/motoryachts/for/sale/
http://localhost/akasia/may/be/lots/of/seperator/but/ineed/last/get/ships/for/rent/
here I need "last" and get/ships/for/rent/
I catch 2nd part with
(.(get/(.)?))
but for first part there is no luck.
I will be appreciated if someone helps.
Regards
Deniz
I suggest the following:
([^\/]+?)\/(get\/.+)
https://regex101.com/r/uN6yH3/1
The concept is that you match non-slash characters up to the first slash (non-greedy) that is followed by the word "get" grouping it, and then just grab the rest as the second group.
I am assuming PHP.
$path = parse_url($url,PHP_URL_PATH);
$s = strrpos($path,'/');
$matches[] = substr($path,$s+1);

Regex URI portion: Remove hyphens

I have to split URIs on the second portion:
/directory/this-part/blah
The issue I'm facing is that I have 2 URIs which logically need to be one
/directory/house-&-home/blah
/directory/house-%26-home/blah
This comes back as:
house-&-home and house-%26-home
So logically I need a regex to retrieve the second portion but also remove everything between the hyphens.
I have this, so far:
/[^(/;\?)]*/([^(/;\?)]*).*
(?<=directory\/)(.+?)(?=\/)
Does this solve your issue? This returns:
house-&-home and house-%26-home
Here is a demo
If you want to get the result:
house--home
then you should use a replace method. Because I am not sure what language you are using, I will give my example in java:
String regex = (?<=directory\/)(.+?)(?=\/);
String str = "/directory/house-&-home/blah"
Pattern.compile(regex).matcher(str).replaceAll("\&", "");
This replace method allows you to replace a certain pattern ( The & symbol ) with nothing ""

Matching URLs with other characters around

I need a regex pattern to match URLs in a complicated environment.
An URL would be in this position:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
(That's just a sample URL)
I need to match the URL until the colon, the colon and the code after that should be ignored. There are so many URLs out there and I'm not that experienced to create a pattern to match everything from http:// to :
As I said, everything else should be ignored, left away, except the URL which I need to store in a variable.
Could someone help me create such a pattern? My tries were matching the URL above, but when I put in more complicated URLs, they wouldn't match.
This is the pattern I've created. It works with simple URLs, but not with the complicated ones:
http(s)?://[A-Za-z0-9.,/_-]+
I'm not very good in regex, I'm still learning.
Thank you.
This regex should do it for you.
\[url=(.*?):[a-zA-Z0-9]*\]
Run against your test data:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
This will return the URL in capture group 1.
Assuming PHP (since your test URL is for the PHP manual), you'd use this with preg_match like this:
$value = "[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]";
$pattern = "/\[url=(.*?):[a-zA-Z0-9]*\]/";
preg_match($pattern, $value, $matches);
echo $matches[1];
Output:
http://www.php.net/manual/en/function.preg-replace.php
This will also work against URLs which contain colons in them, such as:
http://www.php.net:8080/manual/en/function.preg-replace.php
http://www.php.net/manual/us:en/function.preg-replace.php
How about this:
^(http(s)?:\/\/)?[^]^(^)^ ]+
Below regex will give you the url part before colon:
\[url=((http|https)?://)?[^\:]+

Regular expressions: Matching text up to last index of character

For example:
http://foobar.com/foo/bar/foobar.php
From this address, I need to extract the following:
http://foobar.com/foo/bar
I have tried with the following regex:
(?<namespace>.*)/.*?
but returned value is
http:
Can anyone help? Thanks.
Try this:
^(?<namespace>.*)/[^/]+$
A quick explanation:
^ # the start of input
(?<namespace>.*)/ # zero or more chars followed by a '/' (which the last '/')
[^/]+ # one or more chars other than '/'
$ # the end of input
I think a regex is overkill here. What programming language are you using? This would be how it's done in JavaScript.
var url = 'http://foobar.com/foo/bar/foobar.php'
url.split('/').slice(0,-1).join('/')
You could even use substr for some performance!
var url = 'http://foobar.com/foo/bar/foobar.php'
url.substr(0, url.lastIndexOf('/'))
The only reason I offered the array way is because I'm not sure of cross browser compatibility on lastIndexOf.
Try with this expression:
^(?<namespace>.*)/.*$