Regular Expressions - Parsing Domain Issues

Regular Expressions - Parsing Domain Issues - regex

I am trying to find the domain -- everything but the subdomain.
I have this regexp right now:
(?:[-a-zA-Z0-9]+\.)*([-a-zA-Z0-9]+(?:\.[a-zA-Z]{2,3})){1,2}
This works for things like:
domain.tld
subdomain.tld
But it runs into trouble with tld's like ".com.au" or ".co.uk":
domain.co.uk (finds co.uk, should find domain.co.uk)
subdomain.domain.co.uk (finds co.uk, should find domain.co.uk)
Any ideas?

I'm not sure this problem is "reasonably solvable"; Mozilla maintains a list of 'public suffix' domains that is intended to help browser authors accept cookies for only domains within one administrative control (e.g., prevent someone from setting a cookie valid for *.co.uk. or *.union.aero.). It obviously isn't perfect (near the end, you'll find a long list of is-a-caterer.com-style domains, so foo.is-a-caterer.com couldn't set a cookie that would be used by bar.is-a-caterer.com, but is-a-caterer.com is perfectly well a "domain" as you've defined it.)
So, if you're prepared to use the list as provided, you could write a quick little parser that would know how to apply the general rules and exceptions to determine where in the given input string your "domain" comes, and return just the portion you're interested in.
I think simpler approaches are doomed to failure: some ccTLDs such as .ca don't use second-level domains, some such as .br use dozens, and some, like lib.or.us are several levels away from the "domain" such as multnomah.lib.or.us. Unless you're using curated lists of which domains are a public suffix, you're doomed to being wrong for some non-trivial set of input strings.

Related

Regex in #Path matches only only 1 of 2 two routes specified, resulting in 404

Here is what dropwizard logs to the console in terms configured resources and their paths:
INFO [07:07:13.741] i.d.jersey.DropwizardResourceConfig: The following paths were found for the configured resources:
DELETE /apps/affiliate/internal/v1/templates/ (aff.affiliate.http.internal.AffiliateURLTemplatesInternalAPIEndpoint)
GET /apps/affiliate/internal/v1/templates/ (aff.affiliate.http.internal.AffiliateURLTemplatesInternalAPIEndpoint)
POST /apps/affiliate/internal/v1/templates/ (aff.affiliate.http.internal.AffiliateURLTemplatesInternalAPIEndpoint)
GET /apps/affiliate/v1/generate-url (aff.affiliate.http.AffiliateEndpoint)
GET /apps/affiliate/v1/redirect-search-url (aff.affiliate.http.AffiliateEndpoint)
GET /openapi.{type:json|yaml} (io.swagger.v3.jaxrs2.integration.resources.OpenApiResource)
GET /{path: apps/affiliate/v1/redirect|api/affiliate/v1/redirect} (aff.affiliate.http.RedirectEndpoint)
The problem is with the last path, specified as a regular expression.
My expectation is that it should trigger for incoming requests to both /apps/affiliate/v1/redirect and /api/affiliate/v1/redirect.
However, visiting /apps/affiliate/v1/redirect results in a 404, but visiting api/affiliate/v1/redirect results in a 200. How can I get my resource to respond to either of those paths?
The code is hard to provide but this is essentially the scaffolding (fwiw, all methods work/api works, I'm just having trouble having one of the methods respond to the regex (my actual problem)).
// AffiliateURLTemplatesInternalAPIEndpoint.kt
#Path("/apps/affiliate/internal/v1/templates")
#Produces(MediaType.APPLICATION_JSON)
public class AffiliateURLTemplatesInternalAPIEndpoint() : DropwizardResource() {
#GET
#Path("/")
public fun methodA()
#POST
#Path("/")
public fun methodB()
#DELETE
#Path("/")
public fun methodC()
}
// AffiliateEndpoint.kt
#Path("/apps/affiliate/v1")
class AffiliateEndpoint() : DropwizardResource() {
#GET
#Path("generate-url")
fun methodA()
#GET
#Path("redirect-search-url")
fun methodB()
// RedirectEndpoint.kt
#Path("/{path: apps/affiliate/v1/redirect|api/affiliate/v1/redirect}")
#Produces(MediaType.APPLICATION_JSON)
class RedirectEndpoint() : DropwizardResource() {
#GET
fun methodA()

The 404 is indeed being correctly returned.
Why? JAX-RS’ URL Matching Algorithm.
It was only after #Paul Samsotha asked me to paste my code that I finally realized the reason for the 404. 🤦
The Dropwizard/Jersey output I was relying on shows all the routes it found, but leaves out critical context about how paths have been structured in the code. Due to the way JAX-RS has implemented route matching sorting and precedence roles, code structure is essential in determining which routes will be triggered. So in this case the helpful output ended up being mostly misleading.
Read Section 3.7 - Matching Requests to Resource Methods of JAX-RS Spec 3.0 if you dare, but the answers are there.
Also, Chapter 4 of Bill Burke's RESTful Java with JAX-RS 2.0 gives great insight into route matching behavior. Unfortunately it doesn't go into clarifying an important distinction (the exact situation I got into) which is that you can't simply combine a resource and its methods paths (like the output) when applying JAX-RS url matching rules. Actually, I went through a bunch of JAX-RS write ups and none of them mentioned this actual distinction.
Instead you first try to find a match a root resource class, then look at resource methods. If you don't find a match either at the root or method level, you must return a 404.
Still, I found it to be a great resource at shining light on the spec and is much less intimidating that the spec.
Now to the actual explanation of the 404.
Jersey (which implements the JAX-RS spec), first collects all the paths associated with root resources:
/apps/affiliate/internal/v1/templates
/apps/affiliate/v1
/{path: apps/affiliate/v1/redirect|api/affiliate/v1/redirect}
It then applies its sorting and precedence logic according the spec (paraphrased from Burke's book):
The primary key of the sort is the number of literal characters in the full URI matching pattern. The sort is in descending order.
If two or more patterns rank similar in the first, then the secondary key of the sort is the number of template expressions embedded within the pattern. This sort is in descending order.
Finally, if two or more patterns rank similar in the second, then the tertiary key of the sort is the number of non-default template expressions. A default template expression is one that does not define a regular expression.
When an incoming GET request to /apps/affiliate/v1/redirect arrives, both
/apps/affiliate/v1
/{path: apps/affiliate/v1/redirect|api/affiliate/v1/redirect}
match, but the first pattern takes precedence because it has the greatest number of literal characters that match (18 vs 1).
Now that a root resource is selected, it looks at root resource's methods and compiles a list of available paths/http methods that match the incoming request. A bit of an extra detail, but for pattern matching purposes at the method level, the root resource's path will be concatenated to the resource method's path.
The following patterns are available to select from:
GET /apps/affiliate/v1/generate-url
GET /apps/affiliate/v1/redirect-search-url")
Since the request was a GET to /apps/affiliate/v1/redirect neither of the above routes match. Hence my 404 :(.
It makes complete sense, but I got into this rabbit hole because my assumptions about routing rules and precedence from experience working with other routing libraries did not align with the actual JAX-RS specs. I expected the library to have a master list for each and every method available (much like the initial output from Dropwizard/Jersey) and for each request to run through sorting and precedence rules on that master list. Alas, that is not the case.

Case Insensitive Search parameters for API endpoint

I am working on a project that involves integrating the PUBG API. From my site, the player can lookup stats using their player name, platform and season. One issue I am facing is that the player name have to be exact and is case sensitive. Now I assumed it to be the case at the beginning. However, after searching for the name in this site I found that they don't need the name to be case sensitive. Also, referring to this post from the PUBG Dev community here I saw that it confirmed my initial assumption. So my question is if PUBG API requires the names to be case sensitive then, how is the site (linked) can search for the player even if the name provided is not in exact, matching case? For example,:
I looked up the player name MyCholula. From the PUBG API page for player lookup, it returns the proper value. When I tried mycholula, it doesn't and sends a 404. From the linked site above, both combination seems to work. Now if spaces or other separators were involved in the name then, it would be easy to convert it assuming that separated words are all capitalized (somewhat naive assumption though). For this name, I don't see any way of converting mycholula to MyCholula. I also tried many other combination in the linked site above (also different user names I got from my friends) to confirm that the linked site is actually returning the data as expected for any combination of user names. I also tried it on other sites like this and it didn't work just like it doesn't work from the PUBG DEV API page or from my page.
I am really confused as to how they are doing it. The only possible explanation I can come up with is that they have the player records stored in their database from where, they can perform advanced regexp based search to get the actual name. However, this sounds far fetched since, there are millions of players and it would require them to know all the player names and associated IDs. Also, as far as I know, it is not possible to use regex or other string manipulation to convert to the actual name because there can be many combinations (not an expert on regex so can't be definitive on this).
Any help or suggestions will be greatly appreciated. Thanks.

Ignoring cookies list efficiently in NGINX reverse proxy setup

I am currently working/testing microcache feature in NGINX reverse proxy setup for dynamic content.
One big issue that occurs is sessions/cookies that need to be ignored otherwise people will logon with random accounts on the site(s).
Currently I am ignoring popular CMS cookies like this:
if ($http_cookie ~* "(joomla_[a-zA-Z0-9_]+|userID|wordpress_(?!test_)[a-zA-Z0-9_]+|wp-postpass|wordpress_logged_in_[a-zA-Z0-9]+|comment_author_[a-zA-Z0-9_]+|woocommerce_cart_hash|woocommerce_items_in_cart|wp_woocommerce_session_[a-zA-Z0-9]+|sid_customer_|sid_admin_|PrestaShop-[a-zA-Z0-9]+")
{
# set ignore variable to 1
# later used in:
# proxy_no_cache $IGNORE_VARIABLE;
# proxy_cache_bypass $IGNORE_VARIABLE;
# makes sense ?
}
However this becomes a problem if I want to add more cookies to the ignore list. Not to mention that using too many "if" statements in NGINX is not recommended as per the docs.
My questions is, if this could be done using a map method ? I saw that regex in map is different( or maybe I am wrong ).
Or is there another way to efficiently ignore/bypass cookies ?
I have search a lot on stackoverflow, and whilst there are so many different examples; I could not find something specific for my needs.
Thank you
Update:
A lot of reading and "digging" on the internet ( we might as well just say Google ), and I found quite some interesting examples.
However I am very confused with these, as I do not fully understand the regex usage and I am afraid to implement such without understanding it.
Example 1:
map $http_cookie $cache_uid {
default nil;
~SESS[[:alnum:]]+=(?<session_id>[[:alnum:]]+) $session_id;
}
In this example I can notice that the regex is very different from
the ones used in "if" blocks. I don't understand why the pattern
starts without any "" and directly with just a ~ sign.
I don't understand what does [[:alnum:]]+ mean ? I search for this
but I was unable to find documentation. ( or maybe I missed it )
I can see that the author was setting "nil" as default, this will
not apply for my case.
Example 2:
map $http_cookie $cache_uid {
default '';
~SESS[[:alnum:]]+=(?<session_id>[[:graph:]]+) $session_id;
}
Same points as in Example 1, but this time I can see [[:graph:]]+.
What is that ?
My Example (not tested):
map $http_cookie $bypass_cache {
"~*wordpress_(?!test_)[a-zA-Z0-9_]+" 1;
"~*wp-postpass|wordpress_logged_in_[a-zA-Z0-9]+" 1;
"~*comment_author_[a-zA-Z0-9_]+" 1;
"~*[a-zA-Z0-9]+_session)" 1;
default 0;
}
In my pseudo example, the regex must be wrong since I did not find any map cookie examples with such regex.
So once again my goal is to have a map style list of cookies that I can bypass the cache for, with proper regex.
Any advice/examples much appreciated.

What exactly are you trying to do?
The way you're doing it, by trying to blacklist only certain cookies from being cached, through if ($http_cookie …, is a wrong approach — this means that one day, someone will find a cookie that is not blacklisted, and which your backend would nonetheless accept, and cause you cache poisoning or other security issues down the line.
There's also no reason to use the http://nginx.org/r/map approach to get the values of the individual cookies, either — all of this is already available through the http://nginx.org/r/$cookie_ paradigm, making the map code for parsing out $http_cookie rather redundant and unnecessary.
Are there any cookies which you actually want to cache? If not, why not just use proxy_no_cache $http_cookie; to disallow caching when any cookies are present?
What you'd probably want to do is first have a spec of what must be cached and under what circumstances, only then resorting to expressing such logic in a programming language like nginx.conf.
For example, a better approach would be to see which URLs should always be cached, clearing out the Cookie header to ensure that cache poisoning isn't possible (proxy_set_header Cookie "";). Else, if any cookies are present, it may either make sense to not cache anything at all (proxy_no_cache $http_cookie;), or to structure the cache such that certain combination of authentication credentials are used for http://nginx.org/r/proxy_cache_key; in this case, it might also make sense to reconstruct the Cookie request header manually through a whitelist-based approach to avoid cache-poisoning issues.

You 2nd example that you have is what you actually need
map $http_cookie $bypass_cache {
"~*wordpress_(?!test_)[a-zA-Z0-9_]+" 1;
"~*wp-postpass|wordpress_logged_in_[a-zA-Z0-9]+" 1;
"~*comment_author_[a-zA-Z0-9_]+" 1;
"~*[a-zA-Z0-9]+_session)" 1;
default 0;
}
Basically here what you are saying the bypass_cache value will be 1 if the regex is matched else 0.
So as long as you got the pattern right, it will work. And that list only you can have, since you would only know which cookies to bypass cache on

Correct non existent domain name to nearest match

I'm looking for a service that tells you the nearest match of a non existent domain, because it was misspelled by the user. For example, if an user writes 'hotmail.con', send a query with that and obtain as a result 'hotmail.com'.

You've picked a hard problem. A domain can be 1-63 characters long, shall contain characters [a-z0-9-], and shall not start with a hyphen. Brute forcing it not an option. If the user types in hotmail.con you could search misspellings of it, which would try homail.com and hotmale.com, which may or may not be accurate domain names, who is to know WHICH mis-spelling is the correct one? The computer would have to return a list of options to the user: "Did you mean this domain name, or maybe or that domain name?".
You might be interested in Peter Norvig's spelling corrector that Google uses to spell check queries that come in. It's one of the best spelling correctors on the planet.
http://norvig.com/spell-correct.html
Peter Norvig's Spell checker should work provided you had a body of correct domain names which is up to date. You could create your own list on the fly, by keeping a list of which sites the user has been to, and using those as the body of domain names to check against. That way, when the user selects "hotmail.con" it finds hotmail.com in your list. However, this does not protect the user from accidentally visiting: "hotmale.com". Because that is a valid site.
Here is a stackoverflow qustion about how to get all the domain names:
https://stackoverflow.com/questions/4539155/how-to-get-all-the-domain-names
The best idea is to think outside the box and do it like firefox does it. When the user starts typing hotmail.com, what they usually do is click a textbox, type "h", then "o". Have a dropdown come out with recently visited domain names that start with that.

Match all characters in group except for first and last occurrence

Say I request
parent/child/child/page-name
in my browser. I want to extract the parent, children as well as page name. Here are the regular expressions I am currently using. There should be no limit as to how many children there are in the url request. For the time being, the page name will always be at the end and never be omitted.
^([\w-]{1,}){1} -> Match parent (returns 'parent')
(/(?:(?!/).)*[a-z]){1,}/ -> Match children (returns /child/child/)
[\w-]{1,}(?!.*[\w-]{1,}) -> Match page name (returns 'page-name')
The more I play with this, the more I feel how clunky this solution is. This is for a small CMS I am developing in ASP Classic (:(). It is sort of like the MVC routing paths. But instead of calling controllers and functions based on the URL request. I would be travelling down the hierarchy and finding the appropriate page in the database. The database is using the nested set model and is linked by a unique page name for each child.
I have tried using the split function to split with a / delimiter however I found I was nested so many split statements together it became very unreadable.
All said, I need an efficient way to parse out the parent, children as well as page name from a string. Could someone please provide an alternative solution?
To be honest, I'm not even sure if a regular expression is the best solution to my problem.
Thank you.

You could try using:
^([\w-]+)(/.*/)([\w-]+)$
And then access the three matching groups created using Match.SubMatches. See here for more details.
EDIT
Actually, assuming that you know that [\w-] is all that is used in the names of the parts, you can use ^([\w-]+)(.*)([\w-]+)$ instead and it will handle the no-child case fine by itself as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js