Splitting URL in three parts in htaccess - regex - regex

I'm trying to split any URL that would show up on my website into three parts:
Language (optional)
Hierarchical structure of the page (parents)
Current page
Right now I operate with 1 and 3 but I need to develop a way to allow for the pages to have the same names if they have different parents and therefore full URL is unique.
Here are the types of URL I may have:
(nothing)
en
en/test
en/parent/test
test
parent/test
ggparent/gparent/parent/test
I thought about extending my current directive:
RewriteRule ^(?:([a-z]{2})(?=\/))?.*(?:\/([\w\-\,\+]+))$ /index.php?lang=$1&page=$2 [L,NC]
to the following:
(?:([a-z]{2})(?=\/))?(.*)\/([^\/]*)?$
Which then I could translate to index.php?lang=$1&tree=$2&page=$3 but the difficulty I have is that the second capturing group captures the slash from the beginning.
I believe I can't (based on my search so far) dynamically have all the strings between slashes to be returned and make the last one to always be first, without repeating the same regex. I thought I would capture anything between language and current page and process the tree in PHP.
However my current regex has some problems and I can't figure them out:
If language is on its own, it doesn't get captured
The second group captures the slash betwen language and the tree
Link to Regex101: https://regex101.com/r/ecHBQT/1

This likely does it: Split the URL by slash into lang, tree, and page at the proper place, with all three parts possibly empty:
RewriteRule ^([a-z]{2}\b)?\/?(?:\/?(.+)\/)?(.*)$ /index.php?lang=$1&tree=$2&page=$3 [L,NC]
Testcase in JavaScript using this regex:
const regex = /^([a-z]{2}\b)?\/?(?:\/?(.+)\/)?(.*)$/;
[
'',
'en',
'en/test',
'en/parent/test',
'test',
'parent/test',
'ggparent/gparent/parent/test'
].forEach(str => {
let rewritten = str.replace(regex, '/index.php?lang=$1&tree=$2&page=$3');
console.log('"' + str + '" ==>', rewritten);
})
Output:
"" ==> /index.php?lang=&tree=&page=
"en" ==> /index.php?lang=en&tree=&page=
"en/test" ==> /index.php?lang=en&tree=&page=test
"en/parent/test" ==> /index.php?lang=en&tree=parent&page=test
"test" ==> /index.php?lang=&tree=&page=test
"parent/test" ==> /index.php?lang=&tree=parent&page=test
"ggparent/gparent/parent/test" ==> /index.php?lang=&tree=ggparent/gparent/parent&page=test
Notes:
This assumes that a page and parent must not be exactly two chars long (you could specify an explicit or-list of all languages you have)
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

I hope I've understood your question right. You can try this regex:
^([a-z]{2}(?=\/|$))?(?:\/?(.+)\/)?(.*)
Regex demo.
This will match 3 groups: first the language (two characters), then the parents and the last group is last part of the URL (after /).

Related

Matching just the first and second block of an URL

I'm trying to do a regex to match just the second part of a URL and leave the rest behind
For example
https://example.com/first-part/second-part/third-part/?prop=2
result = https://example.com/alt/second-part/
How can I do this?
I'm able to match the first two parts but for when I use the "/" for match it picks the last / one, instead the one before.
I can go the simple way like this:
RewriteRule ^(.*)first-part\/(.+)\/(.*)\/(.*)$ https://example.com/alt/$2 [R=301,L]
The problem is that if the URL is like this:
https://example.com/first-part/second-part/
Result expected. https://example.com/alt/second-part/
It won't even match it
So I'm looking for a more generic alternative, that may match multiple scenarios giving the same result ultimately in the same format:
https://example.com/alt/second-part/
Just knowing how the first-part exactly is and not knowing how anything beyond the second-part will be formated.
Taking into account the recommendations of #Eraklon to avoid the greedy checks I've found out a solution:
RewriteRule ^first-part\/([^\/]+(\/)?)(.*) https://example.com/alt/$1 [R=301,L]
Can be checked here:
https://htaccess.madewithlove.be?share=8973fe68-f137-59a5-b27b-0cbbe3d842bc
It exactly matches the first-part with ^first-part/ and then in enters the group:
([^\/]+(\/)?)
That checks for 1 or more chars that are not a slash /. When it finds the first slash it can be the next section of the URL or the end of the URL.
Not sure if this is the best but the idea is that it matches just 1 pattern for $1 that includes both the end slash and not-slash for the second-part block of the URL.
I've not been able to remove the last bit from the url (the parameters ?parameter=a)
So the result with this form a URL like:
https://example.com/first-part/second-part/third-part/?parameter=a
Will be
https://example.com/alt/second-part/?parameter=a
Fortunately, the parameters are not too bad, but I would have preferred the full solution.

Multi Taxonomy URL rewrite not wokring

I am trying to rewrite WP URL and here is the URL:
http://example.com/?job_listing_region=california&job_listing_category=wordpress
I want to change it as http://example.com/california/wordpress
I tried this:
add_rewrite_rule('([^/]*)/([^/]*)/?','job_listing_region=$matches[1]&job_listing_category=$matches[2]','top');
But its not working. Sorry I am not good at regex it might be a small one but I am not able to find a solution. Thanks in advance
Code
See regex in use here
Regex
\??\w+=([^&]+)&?
Replacement
$1/
Results
Input
http://example.com/?job_listing_region=california&job_listing_category=alcohol-abuse-programs
Output
http://example.com/california/alcohol-abuse-programs/
Explanation
Regex
\?? Match between zero and one of the ? character literally
\w+= Match any word character one or more times, followed by the = character literally (\w can be replaced with [a-zA-Z0-9_] if preferred/doesn't work in your regex flavour)
([^&]+) Capture into capture group 1 any character except the & character literally one or more times
&? Match between zero and one & character literally
Replacement
$1/ Matches the same text as most recently matched by the 1st capturing group, followed by a / literally
Using http://example.com/{job_listing_region}/{job_listing_category}/ is too broad - it would affect every single URL on your website, such as /wp-admin.
I'd recommend using http://example.com/jobs/{job_listing_region}/{job_listing_category}/ as your URL structure, in which case the rewrite rule would be set as follows:
add_rewrite_rule('^jobs/([^/]*)/([^/]*)/?','index.php?page_id=1234&job_listing_region=$matches[1]&job_listing_region=$matches[2]','top');
page_id should be set to the page ID of the page/post you'd like to route this to.
It's important to note that the rewrite might not be available until you view/save the Settings -> Permalinks page in the back end.
Thanks for the above answer, they helped me to get an solution finally.
So while passing url strings to wordpress we need to register the variables in functions.php and then instead of using php get we need to use wordpress var queries to get the urls.
As suggested by #athms above, I changed url structure.
Now "wordpress" is a wordpress page in which the query variables are captured.
So my URL is http://example.com/wordpress/?job_listing_region=california
In functions.php I registered these variables:
function custom_query_vars_filter($vars) {
$vars[] = 'job_listing_region';
return $vars;
}
add_filter( 'query_vars', 'custom_query_vars_filter' );
function custom_rewrite_tag1() {
add_rewrite_tag('%job_listing_region%', '([^&]+)');
}
add_action('init', 'custom_rewrite_tag1', 10, 0);
Rewrite Rule in functions.php:
function custom_rewrite_rule3() {
add_rewrite_rule('^wordpress/([^/]*)/?','index.php?page_id=35349&state=$matches[1]','top');
}
add_action('init', 'custom_rewrite_rule3', 10, 0);
Here page id is the id of page I created i.e "wordpress"
And in the page template for "wordpress" I captured the region using:
$region = get_query_var('job_listing_region');
Now you can pass this variable to your query.
So now you can start using this pretty URL:
http://example.com/wordpress/california
The end of URL california is taken as query string and can be used in our template.

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Regex for URL to sites

I have two URLs with the patterns:
1.http://localhost:9001/f/
2.http://localhost:9001/flight/
I have a site filter which redirects to the respective sites if the regex matches. I tried the following regex patterns for the 2 URLs above:
http?://localhost[^/]/f[^flight]/.*
http?://localhost[^/]/flight/.*
Both URLS are getting redirected to the first site, as both URLs are matched by the first regex.
I have tried http?://localhost[^/]/[f]/.* also for the 1st url. I am Unable to get what am i missing . I feel that this regex should not accept any thing other than "f", but it is allowing "flight" as well.
Please help me by pointing the mistake i have done.
Keep things simple:
.*/f(/[^/]*)?$
vs
.*/flight(/[^/]*)?$
Adding ? before $ makes the trailing slash with optional path term optional.
The first one will be caught with following regex;
/^http:[\/]{2}localhost:9001\/f[^light]$/
The other one will be disallowed and can be found with following regex
/^http:[\/]{2}localhost:9001\/flight\/$/
You regex has several issues: 1) p? means optional p (htt:// will match), 2) [^/] will only match : in your URLs since it will only capture 1 character (and you have a port number), 3) [^light] is a negated character class that means any character that is not l, i, g, h, or t.
So, if you want to only capture localhost URLs, you'd better use this regex for the 1st site:
http://localhost[^/]*/f/.*
And this one for the second
http://localhost[^/]*/flight/.*
Please also bear in mind that depending on where you use the regexps, your actual input may or may not include either the protocol.
These should work for you:
http[s]{0,1}:\/\/localhost:[0-9]{4}\/f\/
http[s]{0,1}:\/\/localhost:[0-9]{4}\/flight\/
You can see it working here

Regex to Replace Spaces in Two Conditions with Isapi ReWrite

I am using Isapi Rewrite for IIS, and I want to make SEO friendly URLs for a dynamic product page.
I need to replace spaces differently in two query string parameters.
In the first param, \s should be replaced with +
In the second, all \s should be replaced with -
#seo. 2 conditions. split on _ delimiter.
RewriteRule ^Store/([^_]+)_([^/]+) Store/Details.aspx?id=$1&name=$2 [QSA,NC]
#replace spaces with + in first condition (doesn't work)
#RewriteRule ^Store/([^\w\s]+)\s(.+)$ Store/Details.aspx?id=$1+$2 [QSA, NC]
#replace spaces with dash in second condition ???
Examples
Store/NP SP_name name
//$1: NP+SP
//$2: name-name
// output: Store/NP+SP_name-name
Store/mn%2098%20765_name%20name%20name
//$1: mn+98+765
//$2: name-name-name
//output: Store/mn+98+765_name-name-name
I've done smth like that the other day, but there was a simplier task with only one type of replacement. Try using the following for basic redirect(if it works, we'll think of a more complex, multiple-parameters scenario):
RewriteRule ^Store/(.+)\s([^_]+)_(.+)\s(.+) /Store/$1+$2_$3-$4 [NC,R=301,L]
Make sure you put in on top of the existing rewrite.