Case insensitive option with Regex.compile/2 - regex

I'm attempting to build a caseless regex binary using Regex.compile/2 but can't seem to find an example on how the option should be set.
Regex.compile("^(foo):?", :caseless)
** (FunctionClauseError) no function clause matching in Regex.compile/3
The following arguments were given to Regex.compile/3:
# 1
"^(foo):?"
# 2
:caseless
# 3
"8.41 2017-07-05"
(elixir) lib/regex.ex:140: Regex.compile/3

In short
According to the link you provided, options need to be provisioned as a list as you can provide multiple options. The following should work:
Regex.compile("^(foo):?", [:caseless])
In more detail
The type specification is as follows:
compile(source, options \\ "")
compile(binary(), binary() | [term()]) :: {:ok, t()} | {:error, any()}
The second line is the type specification in dialyzer and basically states that the function compile accepts two arguments:
The first one is a binary, corresponding to your "^(foo):?"
The second one is either a binary, or either a list containing several terms.
The return value will be either {:ok, t()} in the case of success, where t() is a %Regex{} struct or either will be {:error, any()} in the case of an error.
Coming back to the discussion of the second parameter, in the case of a list, you will need to leverage the various options as mentioned here.
In the case of binary, you can provide the second argument as a one letter abbreviation. So whereas the following will fail:
Regex.compile("^(foo):?", "caseless")
The following on the other hand succeeds:
Regex.compile("^(foo):?", "i")
The mapping you can get from the table of the various module modifiers I linked to above.
The main difference between the approaches stems from the fact that Erlang Regex as powered by :re builds on top of the PCRE standard. According to that standard, the various module modifiers are handled by the single lower case letters, such as i, u etc.. So you could combine accordingly both options with binary as follows:
Regex.compile("^(foo):?", "iu")
which technically speaking should give you the equivalent of:
Regex.compile("^(foo):?", [:caseless, :unicode])
This allows you to communicate about Regex in Erlang and in Elixir through either the language specifications, or either the PCRE specifications.
Highly Advanced Details
As the OP rightly pointed out in the comments, there is some confusion as to why the Regex produced in two different ways(e.g. through options as list vs options as binary) looks differently.
To explain this discrepancy in more detail, consider the following scenarios:
r0 = Regex.compile!("(foo):?") ---> ~r/(foo):?/
r1 = Regex.compile!("(foo):?", "i") ---> ~r/(foo):?/i
--->~r/(foo):?/# ?????? WHERE IS THEi` ?????
When confronted with this, one might gain the impression that the Elixir Regex is broken. r0 and r2 are identical and different from r1.
However, functionality wise, r2 behaves like r1, not like r0, consider the following examples as shamelessly inspired by the comment of the OP:
Regex.replace(r0, "Foo: bar", "") ---> "Foo: bar"
Regex.replace(r1, "Foo: bar", "") ---> " bar"
Regex.replace(r2, "Foo: bar", "") ---> " bar"
So how is this possible?
If you recall from above, e.g. pertaining to the explanation of the type t(), a Regex in Elixir is nothing but a struct under the hood.
A Regex may be presented beautifully in the following way: ~r/(foo):?/, but in reality it is nothing but something like this:
%Regex{ opts: opts, re_pattern: re_pattern, re_version: re_version, source: source }
Now, from all those struct fields, the only thing that counts at the end of the day is what is under: re_pattern. That will contain the fully compiled Regex with all the options. So we find that accordingly:
r1.re_pattern == r2.re_pattern
But
r0.re_pattern != r2.re_pattern
As far as the opts field is concerned, that is a container solely reserved for the options in binary format. So you will find that:
- r0.opts == r2.opts == ""
Whereas:
- r1.opts == "i"
These same opts fields are utilized to beautifully display the options at the end of Regex accordingly, so you will see:
~r/(foo):?/ for both r0 as well as r2
But you will see:
~r/(foo):?/i for both r1
on account of the opts fields differing from each other.
It is for this reason that you could manually update the Regex if you would like it to look more consistent by doing this for instance:
%{r2 | opts: "i"} ---> ~r/(foo):?/i
Except for the field, re_pattern, none of the other fields have any functional influence to the actual Regex. Those other fields are merely there for the purpose of documentation only.
Next, on the basis of the source code, you can see that binary options get translated to the list version of options because that is how Erlang regex engine, :re expects them to be.
Even though not difficult in itself, the Elixir core team have opted not to provide translation for the reverse, e.g. from the actual list of module modifier atoms to the equivalent PCRE binary option, thus you end up with the opts field remaining empty and bereft of the corresponding options in PCRE binary format and thus, you end up with the defective rendering of the Regex as evidenced by the discrepancy above.
Above I have only delved into the mechanics that explain the discrepancy, however, whether such a discrepancy is warranted or not is another question in itself. I would be immensely grateful if someone with more insight than me could be so kind as to clarify if there is any way to defend such a discrepancy.
Conclusion
r0 = Regex.compile!("(foo):?") ---> ~r/(foo):?/
r1 = Regex.compile!("(foo):?", "i") ---> ~r/(foo):?/i
r2 = Regex.compile!("(foo):?", [:caseless]) ---> ~r/(foo):?/
r1 and r2 may look different, but they behave exactly the same.

Related

OpenModelica SimulationOptions 'variableFilter' not working with '^' exceptions

To reduce size of my simulation output files, I want to give variable name exceptions instead of a list of many certain variables to the simulationsOptions/outputFilter (cf. OpenModelica Users Guide / Output) of my model. I found the regexp operator "^" to fullfill my needs, but that didn't work as expected. So I think that something is wrong with the interpretation of connected character strings when negated.
Example:
When I have any derivatives der(...) in my model and use variableFilter=der.* the output file will contain all the filtered derivatives. Since there are no other varibles beginning with character d the same happens with variableFilter=d.*. For testing I also tried variableFilter=rde.* to confirm that every variable is filtered.
When I now try to except by variableFilter=^der.*, =^rde.* or =^d.*, I get exactly the same result as without using ^. So the operator seems to be ignored in this notation.
When I otherwise use variableFilter=[^der].*, =[^rde].* or even =[^d].*, all wanted derivation variables are filtered from the ouput, but there is no difference between those three expressions above. For me it seems that every character is interpretated standalone and not as as a connected string.
Did I understand and use the regexp usage right or could this be a code bug?
Side/follow-up question: Where can I officially report this for software revision?
_
OpenModelica v.1.19.2 (64-bit)

RegEx - Order of OR'd values in capture group changes results

Visual Studio / XPath / RegEx:
Given Expression:
(?<TheObject>(Car|Car Blue)) +(?<OldState>.+) +---> +(?<NewState>.+)
Given Searched String:
Car Blue Flying ---> Crashed
I expected:
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
What I get:
TheObject = "Car"
OldState = "Blue Flying"
NewState = "Crashed"
Given new RegEx:
(?<TheObject>(Car Blue|Car)) +(?<OldState>.+) +---> +(?<NewState>.+)
Result is (what I want):
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
I conceptually get what's happening under the hood; the RegEx is putting the first (left-to-right) match it finds in the OR'd list into the <TheObject> group and then goes on.
The OR'd list is built at run time and cannot guarantee the order that "Car" or "Car Blue" is added to the OR'd list in <TheObject> group. (This is dramatically simplified OR'd list)
I could brute force it, by sorting the OR'd list from longest to shortest, but, I was looking for something a little more elegant.
Is there a way to make <TheObject> group capture the largest it can find in the OR'd list instead of the first it finds? (Without me having to worry about the order)
Thank you,
I would normally automatically agree with an answer like ltux's, but not in this case.
You say the alternation group is generated dynamically. How frequently is it generated dynamically? If it's every user request, it's probably faster to do a quick sort (either by longest length first, or reverse-alphabetically) on the object the expression is built from than to write something that turns (Car|Car Red|Car Blue) into (Car( Red| Blue)?).
The regex may take a bit longer (you probably won't even notice a difference in the speed of the regex) but the assembly operation may be much faster (depending on the architecture of the source of your data for the alternation list).
In simple test of an alternation with 702 options, in three methods, results are comparable using an option set like this, but none of these results are taking into calculation the amount of time to build the string, which grows as the complexity of the string grows.
The options are all the same, just in different formats
zap
zap
yes
xerox
...
apple
yes
zap
yes
xerox
...
apple
xerox
zap
yes
xerox
...
apple
...
apple
zap
yes
xerox
...
apple
Using Google Chrome and Javascript, I tried three (edit: four) different formats and saw consistent results for all between 0-2ms.
'Optimized factoring' a(?:4|3|2|1)?
Reverse alphabetically sorting (?:a4|a3|a2|a1|a)
Factoring a(?:4)?|a(?:3)?|a(?:2)?|a(?:1)?. All are consistently coming in at 0 to 2ms (the difference being what else my machine might be doing at the moment, I suppose).
Update: I found a way that you may be able to do this without sorting in Regular Expressions, using a lookahead like this (?=a|a1|a2|a3|a4|a5)(.{15}|.(14}|.{13}|...|.{2}|.) where 15 is the upper bound counting all the way down to the lower bound.
Without some restraints on this method, I feel like it can lead to a lot of problems and false positives. It would be my least preferred result. If the lookahead matches, the capture group (.{15}|...) will capture more than you'll desire on any occasion where it can. In other words, it will reach ahead past the match.
Though I made up the term Optimized Factoring in comparison to my Factoring example, I can't recommend my Factoring example syntax for any reason. Sorted would be the most logical, coupled with easier to read/maintain than exploiting a lookahead.
You haven't given much insight into your data but you may still need to sort the sub groups or factor further if the sub-options can contain spaces and may overlap, further diminishing the value of "Optimized Factoring".
Edit: To be clear, I am providing a thorough examination as to why no form of factoring is a gain here. At least not in any way that I can see. A simple Array.Sort().Reverse().Join("|") gives exactly what anyone in this situation would need.
The | operator of regular expression usually uses Aho–Corasick algorithm under the hood. It will always stop at the left most match it found. We can't change the behaviour of | operator.
So the solution is to avoid using | operator. Instead of (Car Blue|Car) or (Car|Car Blue), use (Car( Blue)?).
(?<TheObject>(Car( Blue)?) +(?<OldState>.+) +---> +(?<NewState>.+)
Then the <TheObject> group will always be Car Blue in the presence of Blue.

Jax-RS overloading methods/paths order of execution

I am writing an API for my app, and I am confused about how Jax-RS deals with certain scenarios
For example, I define two paths:
#Path("user/{name : [a-zA-Z]+}")
and
#Path("user/me")
The first path that I specified clearly encompasses the second path since the regular expression includes all letters a-z. However, the program doesn't seem to have an issue with this. Is it because it defaults to the most specific path (i.e. /me and then looks for the regular expression)?
Furthermore, what happens if I define two regular expressions as the path with some overlap. Is there a default method which will be called?
Say I want to create three paths for three different methods:
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
Is this best practice/appropriate? How will it know which method to call?
Thank you in advance for any clarification.
This is in the spec in "Matching Requests to Resource Methods"
Sort E using (1) the number of literal characters in each member as the primary key (descending order), (2) the number of capturing groups as a secondary key (descending order), (3) the number of capturing groups with non-default regular expressions (i.e. not ‘([^ /]+?)’) as the tertiary key (descending order), ...
What happens is the candidate methods are sorted by specified ordered "key". I highlight them in bold.
The first sort key is the number of literal characters. So for these three
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
if the requested URI is ../user/me, the last one will always be chosen, as it has the most literal characters (7, / counts). The others only have 5.
Aside from ../users/me anything else ../users/.. will depend on the regex. In your case one matches only numbers and one matches only letters. There is no way for these two regexes to overlap. So it will match accordingly.
Now just for fun, let's say we have
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}")
#Path("user/me")
If you look at the top two, we now have overlapping regexes. The first will match all numbers, as will the second one. So which one will be used? We can't make any assumptions. This is a level of ambiguity not specified and I've seen different behavior from different implementations. AFAIK, there is no concept of a "best matching" regex. Either it matches or it doesn't.
But what if we wanted the {id : \\d+} to always be checked first. If it matches numbers then that should be selected. We can hack it based on the specification. The spec talks about "capturing groups" which are basically the {..}s. The second sorting key is the number of capturing groups. The way we could hack it is to add another "optional" group
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}{dummy: (/)?}")
Now the latter has more capturing groups so it will always be ahead in the sort. All it does is allow an optional /, which doesn't really affect the API, but insures that if the request URI is all numbers, this path will always be chose.
You can see a discussion with some test cases in this answer

Regular Expressions (Normal OR Nested Brackets)

So I'm completely new to the overwhelming world of Regex. Basically, I'm using the Gedit API to create a new custom language specification (derived from C#) for syntax-highlighting (for DM from Byond). In escaped characters in DM, you have to use [variable] as an escaping syntax, which is simple enough. However, it could also be nested, such as [array/list[index]] for instance. (It could be nested infinitely.) I've looked through the other questions, and when they ask about nested brackets they only mean exclusively nested, whereas in this case it could be either/or.
Several attempts I've tried:
\[.*\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[.*?\] produces the result "Test [Test[Test] Test]Test [Test] Test"
\[(?:.*)\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[(?:(?!\[|\]).)*\] produces the result "Test [Test[Test] Test]Test[Test] Test". This is derived from https://stackoverflow.com/a/9580978/2303154 but like mentioned above, that only matches if there are no brackets inside.
Obviously I've no real idea what I'm doing here in more complex matching, but at least I understand more of the basic operations from other sources.
From #Chaos7Theory:
Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead.
Digging into it and through trial-and-error, I got it to work with:
\[(([^\[\]]*|(?R))*)\]
I hope this helps someone else in the future.

PCRE in Haskell - what, where, how?

I've been searching for some documentation or a tutorial on Haskell regular expressions for ages. There's no useful information on the HaskellWiki page. It simply gives the cryptic message:
Documentation
Coming soonish.
There is a brief blog post which I have found fairly helpful, however it only deals with Posix regular expressions, not PCRE.
I've been working with Posix regex for a few weeks and I'm coming to the conclusion that for my task I need PCRE.
My problem is that I don't know where to start with PCRE in Haskell. I've downloaded regex-pcre-builtin with cabal but I need an example of a simple matching program to help me get going.
Is it possible to implement multi-line matching?
Can I get the matches back in this format: [(MatchOffset,MatchLength)]?
What other formats can I get the matches back in?
Thank you very much for any help!
There's also regex-applicative which I've written.
The idea is that you can assign some meaning to each piece of a regular expression and then compose them, just as you write parsers using Parsec.
Here's an example -- simple URL parsing.
import Text.Regex.Applicative
data Protocol = HTTP | FTP deriving Show
protocol :: RE Char Protocol
protocol = HTTP <$ string "http" <|> FTP <$ string "ftp"
type Host = String
type Location = String
data URL = URL Protocol Host Location deriving Show
host :: RE Char Host
host = many $ psym $ (/= '/')
url :: RE Char URL
url = URL <$> protocol <* string "://" <*> host <* sym '/' <*> many anySym
main = print $ "http://stackoverflow.com/questions" =~ url
There are two main options when wanting to use PCRE-style regexes in Haskell:
regex-pcre uses the same interface as described in that blog post (and also in RWH, as I think an expanded version of that blog post); this can be optionally extended with pcre-less. regex-pcre-builtin seems to be a pre-release snapshot of this and probably shouldn't be used.
pcre-light is bindings to the PCRE library. It doesn't provide the return types you're after, just all the matchings (if any). However, the pcre-light-extras package provides a MatchResult class, for which you might be able to provide such an instance. This can be enhanced using regexqq which allows you to use quasi-quoting to ensure that your regex pattern type-checks; however, it doesn't work with GHC-7 (and unless someone takes over maintaining it, it won't).
So, assuming that you go with regex-pcre:
According to this answer, yes.
I think so, via the MatchArray type (it returns an array, which you can then get the list out from).
See here for all possible results from a regex.
Well, I wrote much of the wiki page and may have written "Coming soonish". The regex-pcre package was my wrapping of PCRE using the regex-base interface, where regex-base is used as the interface for several very different regular expression engine backends. Don Stewart's pcre-light package does not have this abstraction layer and is thus much smaller.
The blog post on Text.Regex.Posix uses my regex-posix package which is also on top of regex-base. Thus the usage of regex-pcre will be very very similar to that blog post, except for the compile & execution options of PCRE being different.
For configuring regex-pcre the Text.Regex.PCRE.Wrap module has the constants you need. Use makeRegexOptsM from regex-base to specify the options.
regexpr is another PCRE-ish lib that's cross-platform and quick to get started with.
I find rex to be quite nice too, its ViewPatterns integration is a nice idea I think.
It can be verbose though but that's partially tied to the regex concept.
parseDate :: String -> LocalTime
parseDate [rex|(?{read -> year}\d+)-(?{read -> month}\d+)-
(?{read -> day}\d+)\s(?{read -> hour}\d+):(?{read -> mins}\d+):
(?{read -> sec}\d+)|] =
LocalTime (fromGregorian year month day) (TimeOfDay hour mins sec)
parseDate v#_ = error $ "invalid date " ++ v
That said I just discovered regex-applicative mentioned in one of the other answers and it may be a better choice, could be less verbose and more idiomatic, although rex has basically zero learning curve if you know regular expressions which can be a plus.