Jax-RS overloading methods/paths order of execution - regex

I am writing an API for my app, and I am confused about how Jax-RS deals with certain scenarios
For example, I define two paths:
#Path("user/{name : [a-zA-Z]+}")
and
#Path("user/me")
The first path that I specified clearly encompasses the second path since the regular expression includes all letters a-z. However, the program doesn't seem to have an issue with this. Is it because it defaults to the most specific path (i.e. /me and then looks for the regular expression)?
Furthermore, what happens if I define two regular expressions as the path with some overlap. Is there a default method which will be called?
Say I want to create three paths for three different methods:
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
Is this best practice/appropriate? How will it know which method to call?
Thank you in advance for any clarification.

This is in the spec in "Matching Requests to Resource Methods"
Sort E using (1) the number of literal characters in each member as the primary key (descending order), (2) the number of capturing groups as a secondary key (descending order), (3) the number of capturing groups with non-default regular expressions (i.e. not ‘([^ /]+?)’) as the tertiary key (descending order), ...
What happens is the candidate methods are sorted by specified ordered "key". I highlight them in bold.
The first sort key is the number of literal characters. So for these three
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
if the requested URI is ../user/me, the last one will always be chosen, as it has the most literal characters (7, / counts). The others only have 5.
Aside from ../users/me anything else ../users/.. will depend on the regex. In your case one matches only numbers and one matches only letters. There is no way for these two regexes to overlap. So it will match accordingly.
Now just for fun, let's say we have
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}")
#Path("user/me")
If you look at the top two, we now have overlapping regexes. The first will match all numbers, as will the second one. So which one will be used? We can't make any assumptions. This is a level of ambiguity not specified and I've seen different behavior from different implementations. AFAIK, there is no concept of a "best matching" regex. Either it matches or it doesn't.
But what if we wanted the {id : \\d+} to always be checked first. If it matches numbers then that should be selected. We can hack it based on the specification. The spec talks about "capturing groups" which are basically the {..}s. The second sorting key is the number of capturing groups. The way we could hack it is to add another "optional" group
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}{dummy: (/)?}")
Now the latter has more capturing groups so it will always be ahead in the sort. All it does is allow an optional /, which doesn't really affect the API, but insures that if the request URI is all numbers, this path will always be chose.
You can see a discussion with some test cases in this answer

Related

Inconsistent behaviour of my Regular Expression

I have a regular expression which is matching correctly when parameters are in their reversed order but not when they are in the intended order:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\({1}\s*([0-9a-z_ ,.]+)\s* as (?:pin|bit|byte|word|dword|float|sbyte|sword|sdword|Long|slong|double|string\*[0-9]+,*)
matches this text just like I want to:
proc HMI_SendNumber(Value As Sword, Object As String*10)
But if I reverse the order of the parameters I am looking for...:
proc HMI_SendNumber(Object As String*10,Value As Sword)
...I only get a match on the first one, i.e. Object. It only occurs when String* is present, so I guess it has to do with the *10 element of it. Is there a way around this?
No, you don't get "two matches", you only get one of:
Value As Sword, Object As String
See, how *10 is missing? That's because [0-9a-z_ ,.]+ does not allow * to match, too. Likewise your other text only has one match of:
Object As String
What do you really want? One match of all parameters? Multiple matches - one for each parameter? Because it's totally irrelevant to define all the as (1|2|3...) because it already matches your initial class. Your whole regex can be reduced to:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.]+)\s*\)
if there would be no String*10 as data type. It can be fixed by including * as in:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.*]+)\s*\)
Beware that this still is only one match, not multiple matches. The match itself may have your desired multiple parameters.
Also this has nothing to do with Delphi. It's slightly Visual Basic at best.

Regex Elastic Query not identifying not identifying all instances of a literal apostrophe is used

My goal is to identify all 'deliberate' misspellings of the words "whatsapp" and "whats'app" (where a and s is substituted for 4 and 5 and where an apostrophe may or may not have been used after the s/5). My query must exclude correct spelling instances of whatapp and whats'app.
...this query almost works;
wh[a4]t[s5]’?[a4]p+#&~(whats’?ap+)
It excludes correct spellings and identifies many misspellings where a's and s's are substituted for 4's and 5's and it allows for instances where only one or two p are used.
However, it is not identifying many instances when an apostrophe is used!
It will identify the likes of Wh4tsapp What5app Whats4pp Wh4t5app Wh4ts4pp and even Wh4ts’app but will not identify the likes of What5’app Whats’4pp Wh4t5’app Wh4ts’4pp
Any suggestions?

Understanding SpamAssassin HK_RANDOM regex

SpamAssassin has several rules that attempt to detect "random looking" values. For example:
/^(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
I understand that the first part of the regex prevents certain cases from matching:
(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)
However, I am not able to understand how the second part detects "randomness". Any help would be greatly appreciated!
/[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
It will match strings containing 5 consecutive consonants (excluding h and s for some reason) :
[bcdfgjklmnpqrtvwxz]{5}
or 5 consecutive vowels :
[aeiouy]{5}
or the same letter or couple of letters repeated 3 times (present 4 times) :
([a-z]{1,2})(?:\1){3}
Here are a few examples of strings it will match :
somethingmkfkgkmsomething
aiaioe
totototo
aaaa
It obviously can't detect randomness, however it can identify patterns that don't often happen in meaningful strings, and mention these patterns look random.
It is also possible that these patterns are constructed "from experience", after analysis of a number of emails crafted by spammers, and would actually reflect the algorithms behind the tools used by these spammers or the process they use to create these emails (e.g. some degree of keyboard mashing ?).
Bottom note is that you can't detect randomness on a single piece of data. What you can do however is try to detect purpose, and if you don't find any then assume that to the best of your knowledge it is random. SpamAssasin assumes a few rules about human communication (which might fit different languages better or worse : as is it will flag a few forms of French's imperfect tense such as "échouaient"), and if the content doesn't match them it reports it as "random".

How do I find strings that only differ by their diacritics?

I'm comparing three lexical resources. I use entries from one of them to create queries — see first column — and see if the other two lexicons return the right answers. All wrong answers are written to a text file. Here's a sample out of 3000 lines:
réincarcérer<IND><FUT><REL><SG><1> réincarcèrerais réincarcérerais réincarcérerais
réinsérer<IND><FUT><ABS><PL><1> réinsèrerons réinsérerons réinsérerons
macérer<IND><FUT><ABS><PL><3> macèreront macéreront macéreront
répéter<IND><FUT><ABS><PL><1> répèterons répéterons répéterons
The first column is the query, the second is the reference. The third and fourth columns are the results returned by the lexicons. The values are tab-separated.
I'm trying to identify answers that only differ from the reference by their diacritics. That is, répèterons répéterons should match because the only difference between the two is that the second part has an acute accent on the e rather than a grave accent.
I'd like to match the entire line. I'd be grateful for a regex that would also identify answers that differ by their gemination — the following two lines should match because martellerait has two ls while martèlerait only has one.
modeler<IND><FUT><ABS><SG><2> modelleras modèleras modèleras
marteler<IND><FUT><REL><SG><3> martellerait martèlerait martèlerait
The last two values will always be identical. You can focus on values #2 and 3.
The first part can be achieved by doing a lossy conversion to ASCII and then doing a direct string comparison. Note, converting to ASCII effectively removes the diacritics.
To do the second part is not possible (as far as I know) with a regex pattern. You will need to do some research into things like the Levenshtein distance.
EDIT:
This regex will match duplicate consonants. It might be helpful for your gemination problem.
([b-df-hj-np-tv-xz])\\1+
Which means:
([b-df-hj-np-tv-xz]) # Match only consonants
\\1+ # Match one or times again what was captured in the first capture group

name splitting regex

I'm trying to split a string (a persons name) into components: prefix (Dr, Mr, Miss, etc), given, middle, family, and suffix (Jr, III, etc...).
Prefixes and suffixes can be a known list of options.
Edge cases for double barreled family names like 'da Vinci' or 'di Caprio' don't really bother me too much. The da's and di's will just be dropped in the middle name, or if a middle is given (i.e. 4 names are found that don't match a prefix or suffix) then everything after the second name is dropped in the family name.
I'm thinking about writing the regex myself... but before I go and reinvent the wheel, I wonder if anyone has something that works I can use?
Thanks.
Here is a proposal in perl (I did not find a language or regex flavor requirement).
Perl supports non-capturing groups, e.g. "(?:\w+)", which I consider needed to stay below 10 captured groups.
I am using "\w+" almost everywhere, for simplicity. Names can therefor contain "_" and digits. If you do not like that, use "[[:alpha:]]+" instead.
perl -pe"s/(?:(Dr\.|Mr\.) )?(?:(\w+)(?: (\w+(?: \w+)*))? )?(?:(\w+) (Jr\.|I+))|(?:(Dr\.|Mr\.) )?(?:(\w+)(?: (\w+(?: \w+)*))? )?(\w+)/pre\1\6 give\2\7 middle\3\8 fam\4\9 post\5/"
For demonstration purposes, the code replaces, while inserting field names.
Please extract the requested regex and fill in the missing pres and posts.
What I consider the trick is to have one big alternative "|", which prefers matches with a postfix.
The fields are filled by using two groups each, one from the first, one from the second alternative. Only one of each pair is non-empty.
I tested with a test text file, containing combination of
prefix present
postfix present
given present
middle present (assuming that more middles work too)
second middle present
All test cases have a family name.
"Superman II" and "Madonna" would both only have a family name, hope that is OK, the super hero movie gets a suffix.
"Dr. Who" has a prefix and a family name.
I.e. I ignored the "Di"s, as you permitted.
I consider the output plausible.