Remove a directory from a URL with regex - regex

I have a site with the following URL structure in places:
www.sitename.com/folder/sub_folder/item
What I need to do is remove the sub_folder part so that it displays as:
www.sitename.com/folder/item
Is there a regex expression for that?

Not sure if there's a regex expression, but you should be able to parse, try splitting on '/', then removing everything between the first and last instance, then recombine with '/'
So if string s="www.sitename.com/folder/sub_folder/item";
string newUrl=s.split('/')[0]+"/"+s.split('/')[s.split('/').length];
or something along those lines

In Python, you could do:
str1 = "www.sitename.com/folder/sub_folder/item"
str2 = re.sub(r'/\w*(/\w*)$',r'\1',str1)

Related

Regex ignore first 12 characters from string

I'm trying to create a custom filter in Google Analytic to remove the query parts of the url which I don't want to see. The url has the following structure
[domain]/?p=899:2000:15018702722302::NO:::
I would like to create a regex which skips the first 12 characters (that is until:/?p=899:2000), and what ever is going to be after that replace it with nothing.
So I made this one: https://regex101.com/r/Xgbfqz/1 (which could be simplified to .{0,12}) , but I actually would like to skip those and only let the regex match whatever is going to be after that, so that I'll be able to tell in Google Analytics to replace it with "".
The part in the url that is always the same is
?p=[3numbers]:[0-4numbers]
Thank you
Your regular expression:
\/\?p=\d{3}\:\d{0,4}(.*)
Tested in Golang RegEx 2 and RegEx101
It search for /p=###:[optional:####] and capture the rest of the right side string.
(extra) JavaScript:
paragraf='[domain]/?p=899:2000:15018702722302::NO:::'
var regex= /\/\?p=\d{3}\:\d{0,4}(.*)/;
var match = regex.exec(paragraf);
alert('The rest of the right side of the string: ' + match[1]);
Easily use "[domain]/?p=899:2000:15018702722302::NO:::".substr(12)
You can try this:
/\?p\=\d{3}:\d{0,4}
Which matches just this: ?p=[3numbers]:[0-4numbers]
Not sure about replacing though.
https://regex101.com/r/Xgbfqz/1

RegEx to cut out URL

I try to get an URL from a String of the following format:
RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH
I already tried some things, especially the the look before/after, which I used before successfully on another url format (starts https... ends .html, this was working).
But seems I'm too stupid to figure out the regex for the kind of string mentioned above. I just want the URL part from https.... to the end of the random last name. Is this even possible?
Any Ideas?
If you can guarantee that randomfirstname_randomlastname is all lowercase and RANDOMRUBBISH is all uppercase, you can use character classes [a-z] and [A-Z]. The language the regex is for will determine how to use these.
This is example works in javascript:
var str = "RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH";
var match = /https:\/\/www\.my-url\.com\/[a-z]*/.exec(str);

Regex URI portion: Remove hyphens

I have to split URIs on the second portion:
/directory/this-part/blah
The issue I'm facing is that I have 2 URIs which logically need to be one
/directory/house-&-home/blah
/directory/house-%26-home/blah
This comes back as:
house-&-home and house-%26-home
So logically I need a regex to retrieve the second portion but also remove everything between the hyphens.
I have this, so far:
/[^(/;\?)]*/([^(/;\?)]*).*
(?<=directory\/)(.+?)(?=\/)
Does this solve your issue? This returns:
house-&-home and house-%26-home
Here is a demo
If you want to get the result:
house--home
then you should use a replace method. Because I am not sure what language you are using, I will give my example in java:
String regex = (?<=directory\/)(.+?)(?=\/);
String str = "/directory/house-&-home/blah"
Pattern.compile(regex).matcher(str).replaceAll("\&", "");
This replace method allows you to replace a certain pattern ( The & symbol ) with nothing ""

Regex to get a filename from a url

I am trying to write a regex to get the filename from a url if it exists.
This is what I have so far:
(?:[^/][\d\w\.]+)+$
So from the url http://www.foo.com/bar/baz/filename.jpg, I should match filename.jpg
Unfortunately, I match anything after the last /.
How can I tighten it up so it only grabs it if it looks like a filename?
The examples above fails to get file name "file-1.name.zip" from this URL:
"http://sub.domain.com/sub/sub/handler?file=data/file-1.name.zip&v=1"
So I created my REGEX version:
[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))
Explanation:
[^/\\&\?]+ # file name - group of chars without URL delimiters
\.\w{3,4} # file extension - 3 or 4 word chars
(?=([\?&].*$|$)) # positive lookahead to ensure that file name is at the end of string or there is some QueryString parameters, that needs to be ignored
This one works well for me.
(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)
(?:.+\/)(.+)
Select all up to the last forward slash (/), capture everything after this forward slash. Use subpattern $1.
Non Pcre
(?:[^/][\d\w\.]+)$(?<=\.\w{3,4})
Pcre
(?:[^/][\d\w\.]+)$(?<=(?:.jpg)|(?:.pdf)|(?:.gif)|(?:.jpeg)|(more_extension))
Demo
Since you test using regexpal.com that is based on javascript(doesnt support lookbehind), try this instead
(?=\w+\.\w{3,4}$).+
I'm using this:
(?<=\/)[^\/\?#]+(?=[^\/]*$)
Explanation:
(?<=): positive look behind, asserting that a string has this expression, but not matching it.
(?<=/): positive look behind for the literal forward slash "/", meaning I'm looking for an expression which is preceded, but does not match a forward slash.
[^/\?#]+: one or more characters which are not either "/", "?" or "#", stripping search params and hash.
(?=[^/]*$): positive look ahead for anything not matching a slash, then matching the line ending. This is to ensure that the last forward slash segment is selected.
Example usage:
const urlFileNameRegEx = /(?<=\/)[^\/\?#]+(?=[^\/]*$)/;
const testCases = [
"https://developer.mozilla.org/en-US/docs/Web/API/MutationObserverInit#yo",
"https://developer.mozilla.org/static/fonts/locales/ZillaSlab-Regular.subset.bbc33fb47cf6.woff2",
"https://developer.mozilla.org/static/build/styles/locale-en-US.520ecdcaef8c.css?is-nice=true"
];
testCases.forEach(testStr => console.log(`The file of ${testStr} is ${urlFileNameRegEx.exec(testStr)[0]}`))
It might work as well:
(\w+\.)+\w+$
You know what your delimiters look like, so you don't need a regex. Just split the string. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $url = "http://www.foo.com/bar/baz/filename.jpg";
my #url_parts = split/\//,$url;
my $filename = $url_parts[-1];
if(index($filename,".") > 0 )
{
print "It appears as though we have a filename of $filename.\n";
}
else
{
print "It seems as though the end of the URL ($filename) is not a filename.\n";
}
Of course, if you need to worry about specific filename extensions (png,jpg,html,etc), then adjust appropriately.
> echo "http://www.foo.com/bar/baz/filename.jpg" | sed 's/.*\/\([^\/]*\..*\)$/\1/g'
filename.jpg
Assuming that you will be using javascript:
var fn=window.location.href.match(/([^/])+/g);
fn = fn[fn.length-1]; // get the last element of the array
alert(fn.substring(0,fn.indexOf('.')));//alerts the filename
Here is the code you may use:
\/([\w.][\w.-]*)(?<!\/\.)(?<!\/\.\.)(?:\?.*)?$
names "." and ".." are not considered as normal.
you can play with this regexp here https://regex101.com/r/QaAK06/1/:
In case you are using the JavaScript URL object, you can use the pathname combined with the following RegExp:
.*\/(.[^(\/)]+)
Benefit:
It matches anything at the end of the path, but excludes a possible trailing slash (as long as there aren't two trailing slashes)!
Try this one instead:
(?:[^/]*+)$(?<=\..*)
This is worked for me, no matter if you have '.' or without '.' it take the sufix of url
\/(\w+)[\.|\w]+$

Regular Expression for Google Analytics to determine page

I'm looking specifically for a regular expression that will grab the last term of a URL. This is not always a file name, it may not end in .html or .php, so I'll need to make sure that the regular expression is grabbing the last term from the URL.
Example:
I need to grab www.mydomain.com/anything_can_be_here/thankyoupage
I need to extract "thankyoupage" even when there can be any term preceding it in the URL.
Also note, there is no file extension on the thankyoupage URL segment.
This should do it:
/^(?:http:\/\/)?(?:[^\/]+)\/.*?\/([^\/]+)(?:\?.*)?$/
For example, the result of this:
m = 'http://example.com/where/is?the=pancakes/house'.match(/^(?:http:\/\/)?(?:[^\/]+)\/.*?\/([^\/]+)(?:\?.*)?$/);
is this array:
["http://example.com/where/is?the=pancakes/house", "is"]
And this:
m = 'http://example.com/where/is'.match(/^(?:http:\/\/)?(?:[^\/]+)\/.*?\/([^\/]+)(?:\?.*)?$/)
Results in:
["http://example.com/where/is", "is"]
And this:
m = 'http://example.com/'.match(/^(?:http:\/\/)?(?:[^\/]+)\/.*?\/([^\/]+)(?:\?.*)?$/)
Results in null.
And your component is in m[1] and that comes from ([^\/]+). The (?:[^\/]+) will take care of the hostname (and the userinfo if it happens to be present), the (?:\?.*)?$ part will take care of any trailing CGI arguments.
Depending on your URLs, you could replace ^(?:http:\/\/)? with ^http:\/\/.
If you are only feeding it urls, something simple as .*/(.*) should work
that's assuming there is a '/' after the .com/.org/whatever
otherwise you'll get everything after the http://
what you need is the path name, which can be access using:
window.location.pathname;
Try this regex:
^http:\/\/.*/(.+)$
It will look for string starting with http:// then will go all the way till the last / and store everything after the last / into $1 variable.
The regexp:
/(\/([^\/]+))+/g
Take the 3rd element of the resulting array:
var a='http://www.host.com/aaa/bbb/ccc/dd.pp';
var regexp=/(\/([^\/]+))+/g;
var result=regexp.exec(a)
if( result.length==3) {
document.write('<p>'+result[2]+'</p>');
} else {
document.write('<p>Fail</p>');
}
Try this:
var str = "www.mydomain.com/other/other/this";
var path = /(?:https?:\/\/)?(?:www\.)?.*\/([^\/]+)/.exec(str)[1]; //this
Hope this is what you want
console.log(window.location.pathname.split('/').reverse()[0]);
Alright figured it outmyself, thanks anyways guys
/\/*\/thanks/
will match /thanks