How can I handle virtual subdirectories with regex in app.yaml? - regex

I'd like to point all of my visitors to "single subdirectories" to one page, and all visitors to "double subdirectories" to another. E.g:
/foo/
/new/
/north/
/1-j4/
Would all point to 1.app, whereas
/foo/bar/
/new/york/
/north/west/
/1-j4/a_990/
Would all point to 2.app.
I figured I could do this with non-greedy regex matching, like so:
- url: /(.*?)/$
script: 1.app
- url: /(.*?)/(.*?)/$
script: 2.app
To my confusion, both /foo/ and /foo/bar/ resolve to script 1.app. Does the "lazy" regex force itself up to include the middle /, since that's the only way to get a match? How else can I do this? I have tried using (\w*?) but get the same result.

The .*? will still match through any amount of / because . matches any character but a line break char (by default). You need to base your regexps on a negated character class, [^/]*, that matches 0 or more chars other than /.
To match directories with one part, use ^([^/]*)/?$ and to match those with 2, use ^([^/]*)/([^/]*)/?$.
Note that if you plan to use the patterns in online Web testers, you will have to escape / in most of them as by default they use / symbol as a regex delimiter.

Yes, the (.*?) includes slashes, so will resolve to 1.app. If you put the 2.app handler first, it should do what you want:
- url: /(.*?)/(.*?)/$
script: 2.app
- url: /(.*?)/$
script: 1.app

Related

URl regex validator HTML5

I am validating url on my form through regex.
^(?:http(s)?://)?[\w.-]+(?:.[\w.-]+)+[\w-._~:/?#[]#!\$&'()*+,;=.]+$
It validates all URL for example:
https://www.example.com
http://www.example.com
www.example.com
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
However it also validates URL like
www.google
www.example
www.example.
www.google.
which are not accepted URL's
I am not too efficient with regex. Please help what needs to be changed
When using a regex in HTML5 pattern attribute you should escape characters very carefully, as those browsers that have ES6+ standard implemented might throw an exception when they "see" [\w\.-] (no need to escape dot, and once the pattern is compiled with u flag, it becomes an error).
Now, to fix the issue, you may add a (?!www\.[^.]+\.?$) lookahead after ^ to fail all inputs that start with www. and then have any 0 or more chars other than . and then an optional . at the end of the string.
You may use
^(?!www\.[^.]+\.?$)(?:https?:\/\/)?[\w.-]+(?:\.[\w.-]+)+[\w._~:/?#[\\\]#!$&'()*+,;=.-]+$
See the regex demo. Note I escaped both \ and ] in your pattern, I think you meant to match both (your original regex does not match \ with [\w\-\._~:/?#[\]#!\$&'\(\)\*\+,;=.]).
Note that the HTML5 pattern regex is anchored by default, you need no ^ and $ at the start/end:
pattern="(?!www\.[^.]+\.?$)(?:https?:\/\/)?[\w.-]+(?:\.[\w.-]+)+[\w._~:/?#[\\\]#!$&'()*+,;=.-]+"
But you may still keep them if you want.

Regex: start with something and ends with whatever except something

I need a regex for Url rewrite module, to validate urls in such way:
1) spa/ - match
2) spa/some/url - match
3) spa/some-url - match
4) spa/some.js - no match
5) spa/some.css - no match
So, it should match, if url
a) starts with "spa"
b) ends with whatever except ".js" or ".css"
What I tried to test is ^(spa/)((?!.js)|(?!.css))$
but it's not working.
Thank you and sorry if it's duplicated.
Try this regex:
^spa\/((.+)\/)*.*(?<!\.js|\.css)$
with g and m flags set.
Please note that this regex allows several characters that urls are not supposed to have. I have tried to keep it simple. So, you might want to tune it a bit before using it.
You need negative-lookbehind for this.
Try this (you may need to modify it slightly)
^spa.*(?<!(\.js|\.css))$
^spa : string beginning with spa
.* : followed by any character(s)
(?<!(\.js|\.css))$ : not ending with .js or .css

Django url regex hit end of url: *./something

I am geting 404 on different urls that end with the same string and instead of creating multiple redirects I would like to catch them all on the last string. It always appears at the same position, pattern goes like so:
/some-of-my-urls/the-same-string
No trailing slash there. I tried something like this:
url(r'^[a-zA-Z0-9_]+/the-same-string', redirect_func),
url(r'^./the-same-string', redirect_func),
But that doesn't work. Probably obvious for somebody with more regex knowledge, I am not very advanced. Anybody ideas?
You may use a negated character class [^/] to match any char but / and quantify it with a + quantifier that matches 1 or more repetitions:
r'^[^/]+/the-same-string'
See the regex demo.

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Exclude regular expression match if it contains a string

I'm still learning regular expressions and I seem to be stuck.
I wanted to write a reg exp that matches URL paths like these that contain "bulk":
/bulk-category_one/product
/another-category/bulk-product
to only get the product pages, but not the category pages like:
/bulk-category_one/
/another-category/
So I came up with:
[/].*(bulk).*[/].+|[/].*[/].*(bulk).*
But there's pagination, so when I put the reg exp in Google Analytics, I'm finding stuff like:
/bulk-category/_/showAll/1/
All of them have
/_/
and I don't want any URL paths that contain
/_/
and I can't figure out how to exclude them.
I would go about it this way:
/[^/\s]*bulk[^/]*/[^/\s]+(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*(?!/)
first part:
/ - match the slash
[^/\s]* - match everything that's not a slash and not a whitespace
bulk - match bulk literally
[^/]* - match everything that's not a slash
/ - match the slash
[^/\s]+ - match everything that's not a slash and not a whitespace
(?!/) - ensure there is not a slash afterwards (i.e. url has two parts)
The second part is more of the same, but this time 'bulk' is expected in the second part of the url not the first one.
If you need the word 'product' specifically in the second part of the url one more alternative would be required:
/[^/\s]*bulk[^/]*/[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*product[^/\s]*bulk[^/\s]*(?!/)
If I apply that simple regex to a file FILE
egrep ".*bulk.*product" FILE
which contains your examples above, it only matches the 2 lines with bulk and product. We can, additionally, exclude '/_/':
egrep ".*bulk.*product" FILE | egrep -v "/_/"
Two invocations are often much more easy to define and to understand, than a big one-fits-all.