extract string between two substrings in Haskell - regex

I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.
But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:
import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
I was hoping to get the first and last instance of the middle string
iwantthis
But I can't get the result right:
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
I haven't used regex or pcre in Haskell before.
Can someone help with the right usage (to extract the first and last occurrence) ?
Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?
I searched the documentation but found no mention of the usage with type conversion to :: [[String]].

The result you obtain is the following:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:
+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---
So it still matches between the asdf=5; and 123jasd part.
This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]
And now we obtain two matches. Each match has "iwantthis" as capture group 1.
You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:
Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]

Related

What is the best way to ensure a regex in OCaml matches the entire input string?

In OCaml, I'm trying to check if a regex matches the entire input string, not just a prefix or a suffix or the potion of the input string before the first newline.
For example, I want to avoid a regex of [0-9]+ matching against strings like these:
let negative_matches = [
" 123"; (* leading whitespace *)
"123 "; (* trailing whitespace *)
"123\n"; (* trailing newline *)
]
I see that Str.string_match still returns true when trailing characters do not match the pattern:
# List.map (fun s -> Str.string_match (Str.regexp "[0-9]+") s 0) negative_matches;;
- : bool list = [false; true; true]
Adding $ to the pattern helps in the second example, but $ is documented to only "match at the end of the line", so the third example still matches
# List.map (fun s -> Str.string_match (Str.reg exp "[0-9]+$") s 0) negative_matches;;
- : bool list = [false; false; true]
I don't see a true "end of string" matcher (like \z in Java and Ruby) documented, so the best answer I've found is to additionally check the length of the input string against the length of the match using Str.match_end:
# List.map (fun s -> Str.string_match (Str.reg exp "[0-9]+") s 0 && Str.match_end () = String.length s) negative_matches;;
- : bool list = [false; false; false]
Please tell me I'm missing something obvious and there is an easier way.
Edit: note that I'm not always looking to match against a simple regex like [0-9]+. I'd like a way to match an arbitrary regex against the entire input string.
You are missing something obvious. There is an easier way. If
[^0-9]
is matched in the input string you will know it contains a non-digit character.
Unfortunately, I don't think Str offers a better way to ensure the whole string has been matched than your own solution, or the similar, slightly clearer alternative:
Str.string_match (Str.regexp "[0-9]+") s 0 && Str.matched_string s = s
Or you could just check for the presence of a newline character as that is the fly in the ointment as you show.
And, of course, there are other regular expression libraries available that do not have this problem.
try this for your example
(?<![^A-z]|\w)[0-9]+(?![^A-z]|\w)
test it here
if you want to generate other patterns you can start by knowing this
(?<!'any group you don't want it to appear before your desire')
(?!'any group you don't want it to appear after your desire')

What does (($2 :: fst $1), snd $1) do in ocaml?

I'm a beginner in ocaml just need some guidance with the syntax sometimes
(($2 :: fst $1), snd $1)
I know $2 must be the second token in the line, $1 the first, and fst and snd refer to the first and second component of a pair. I know :: usually indicates to make a list?
And then the overall placement of the parentheses makes me think it's a returning a pair.
But what does this entire line mean, everything taken together?
This syntax uses the ocamlyacc rule grammar, which is a DSL for writing parsers. Symbols $N refer to N-th semantic attribute of the defined non-terminal. You can think of them as simple variables, that are bound by the non-terminal pattern expression. So what does (($2 :: fst $1), snd $1) mean?
It is a pair, the first constituent is a list $2 :: fst $1 made from the $2 and the first element of $1, which is itself a pair. And the second part of $1 makes the second constituent of the resulting pair. E.g., suppose that $1 = (5,7) and $2 is 42, you will get, ([42;5],7) as the result of this semantic action.

Haskell and Regex with Intersections

I am using regex with Haskell along with Text.Regex.PCRE and in my case I have:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" :: AllTextMatches [] String)
[]
I am expecting some values returned but list is empty. However this returns what is expected:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X])([A-Z])([A-Z])(\\d{2,10})" :: AllTextMatches [] String)
["32UMU1078"]
So if I remove the intersections like &&[^IO] there are no problems.
As I just discovered PCRE doesn't support intersections. Any alternative library with Haskell that support it?
PCRE does not support character class intersection/subtraction.
However, you may work around it with negative lookaheads and other methods.
Here, replace "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" with
"(\\d{1,2})((?![IO])[C-X])((?![IO])[A-Z])((?![IO])[A-Z])(\\d{2,10})"
^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
That is, replace the subtractions with lookaheads, [C-X&&[^IO]] -> (?![IO])[C-X].
Another way, that is more verbose, is to spell out the character classes:
"(\\d{1,2})([C-HJ-NP-X])([A-HJ-NP-Z])([A-HJ-NP-Z])(\\d{2,10})"
So, [C-X] that does not match I and O must be written as [C-HJ-NP-X].

Can't capture a group in a string

I want to capture a group in as string:
import Text.Regex.Posix
"somestring; somestring2=\"(.*?)\"" =~ "somestring; somestring2=\"my_super_string123\"" :: String
It returns an empty string "", as opposed to my_super_string123 which I expect. I've tried ::[String] and ::[[String]] and, obviously, they were empty. Your suggestions?
The problem is that you have your string and your pattern swapped. You also will want to have the return type be [[String]]:
> "somestring; somestring2=\"my_super_string123\"" =~ "somestring; somestring2=\"(.*)\"" :: [[String]]
[["somestring; somestring2=\"my_super_string123\"", "my_super_string123"]]
Note that I had to remove the ? from the .*? part of the pattern. This is because POSIX doesn't support the lazy quantifier *?. You'll have to select both of the POSIX flavors from the drop downs to see, but it says both do not support the lazy quantifiers. It's also recommended to use negation instead of laziness for regex since it improves performance over having to backtrack. To do this, you'd have to change your pattern to
"somestring; somestring2=\"([^\"]*)\""
To clarify, here's the output from my GHCi:
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: [[String]]
[["s1; s2=\"my_super_string123\"","my_super_string123"]]
it :: [[String]]
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: String
"s1; s2=\"my_super_string123\""
it :: String
1As you can see, with the return type as String, it returns whatever text matches the entire pattern, not just the capturing groups. Use [[String]] when you want to get the contents of the individual capturing groups.
I edited the contents of the string so that it would fit without having to scroll horizontally, just for illustrative purposes.

Grouping in haskell regular expressions

How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?
I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.
Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]