What does (($2 :: fst $1), snd $1) do in ocaml? - ocaml

I'm a beginner in ocaml just need some guidance with the syntax sometimes
(($2 :: fst $1), snd $1)
I know $2 must be the second token in the line, $1 the first, and fst and snd refer to the first and second component of a pair. I know :: usually indicates to make a list?
And then the overall placement of the parentheses makes me think it's a returning a pair.
But what does this entire line mean, everything taken together?

This syntax uses the ocamlyacc rule grammar, which is a DSL for writing parsers. Symbols $N refer to N-th semantic attribute of the defined non-terminal. You can think of them as simple variables, that are bound by the non-terminal pattern expression. So what does (($2 :: fst $1), snd $1) mean?
It is a pair, the first constituent is a list $2 :: fst $1 made from the $2 and the first element of $1, which is itself a pair. And the second part of $1 makes the second constituent of the resulting pair. E.g., suppose that $1 = (5,7) and $2 is 42, you will get, ([42;5],7) as the result of this semantic action.

Related

extract string between two substrings in Haskell

I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.
But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:
import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
I was hoping to get the first and last instance of the middle string
iwantthis
But I can't get the result right:
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
I haven't used regex or pcre in Haskell before.
Can someone help with the right usage (to extract the first and last occurrence) ?
Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?
I searched the documentation but found no mention of the usage with type conversion to :: [[String]].
The result you obtain is the following:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:
+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---
So it still matches between the asdf=5; and 123jasd part.
This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]
And now we obtain two matches. Each match has "iwantthis" as capture group 1.
You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:
Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]

Haskell and Regex with Intersections

I am using regex with Haskell along with Text.Regex.PCRE and in my case I have:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" :: AllTextMatches [] String)
[]
I am expecting some values returned but list is empty. However this returns what is expected:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X])([A-Z])([A-Z])(\\d{2,10})" :: AllTextMatches [] String)
["32UMU1078"]
So if I remove the intersections like &&[^IO] there are no problems.
As I just discovered PCRE doesn't support intersections. Any alternative library with Haskell that support it?
PCRE does not support character class intersection/subtraction.
However, you may work around it with negative lookaheads and other methods.
Here, replace "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" with
"(\\d{1,2})((?![IO])[C-X])((?![IO])[A-Z])((?![IO])[A-Z])(\\d{2,10})"
^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
That is, replace the subtractions with lookaheads, [C-X&&[^IO]] -> (?![IO])[C-X].
Another way, that is more verbose, is to spell out the character classes:
"(\\d{1,2})([C-HJ-NP-X])([A-HJ-NP-Z])([A-HJ-NP-Z])(\\d{2,10})"
So, [C-X] that does not match I and O must be written as [C-HJ-NP-X].

Need explanation of tcl regexp inline example in the man page please

While trying to understand regexp and --inline use, saw this example but couldn't understand how it works.
Link to the man page is: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm#M13
In there, under --inline option, this example was given:
regexp -inline -- {\w(\w)} " inlined "
=> {in n}
regexp -all -inline -- {\w(\w)} " inlined "
=> {in n li i ne e}
How does this "{\w(\w)}" yield "{in n}"? Can someone explain please.
Appreciate the help.
Thanks
If -inline but not -all is not given, regexp returns a list consisting of one value for the entire region matched and one value for each submatch (regions captured by parentheses). To see what the entire match is, ignore the parentheses: the pattern is now {\w\w}, matching the two first word characters in the string (in). The first submatch is what you get if you skip one word character (the \w outside the parentheses) and then capture the next word character (the \w inside the parentheses), getting n.
If both -inline and -all are given, regexp does this repeatedly, restarting at the first character beyond the last entire match.
I think that to understand -inline, you must first understand that -inline puts the matches (and submatches) in a list. Because if you had...
regexp -- {\w(\w)} " inlined " m1 m2
You will have...
% puts $m1
in
% puts $m2
n
As the whole match in is stored in m1 while the submatch of the capture group n is stored in m2.
Putting those in a list (i.e. when using -inline) will give {in n}.
When you now have -all and -inline at the same time (assuming that you already know that -all retrieves all non-overlapping matches in regexp), you can no more use variable names after the input string, so you get a list containing all the matches and submatches and if I have to name them m and s (for match and submatch respectively), you have:
in n li i ne e
m s m s m s

Can't capture a group in a string

I want to capture a group in as string:
import Text.Regex.Posix
"somestring; somestring2=\"(.*?)\"" =~ "somestring; somestring2=\"my_super_string123\"" :: String
It returns an empty string "", as opposed to my_super_string123 which I expect. I've tried ::[String] and ::[[String]] and, obviously, they were empty. Your suggestions?
The problem is that you have your string and your pattern swapped. You also will want to have the return type be [[String]]:
> "somestring; somestring2=\"my_super_string123\"" =~ "somestring; somestring2=\"(.*)\"" :: [[String]]
[["somestring; somestring2=\"my_super_string123\"", "my_super_string123"]]
Note that I had to remove the ? from the .*? part of the pattern. This is because POSIX doesn't support the lazy quantifier *?. You'll have to select both of the POSIX flavors from the drop downs to see, but it says both do not support the lazy quantifiers. It's also recommended to use negation instead of laziness for regex since it improves performance over having to backtrack. To do this, you'd have to change your pattern to
"somestring; somestring2=\"([^\"]*)\""
To clarify, here's the output from my GHCi:
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: [[String]]
[["s1; s2=\"my_super_string123\"","my_super_string123"]]
it :: [[String]]
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: String
"s1; s2=\"my_super_string123\""
it :: String
1As you can see, with the return type as String, it returns whatever text matches the entire pattern, not just the capturing groups. Use [[String]] when you want to get the contents of the individual capturing groups.
I edited the contents of the string so that it would fit without having to scroll horizontally, just for illustrative purposes.

Grouping in haskell regular expressions

How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?
I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.
Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]