I am trying to match all the occurrences of a regex and get the indices as a result. The example from Real World Haskell says I can do
string =~ regex :: [(Int, Int)]
However, this is broken since the regex library has been updated since the publication of RWH. (See All matches of regex in Haskell and "=~" raise "No instance for (RegexContext Regex [Char] [String])"). What is the correct way to do this?
Update:
I found matchAll which might give me what I want. I have no idea how to use it, though.
The key to using matchAll is using the type annotation :: Regex when creating regexs:
import Text.Regex
import Text.Regex.Base
re = makeRegex "[^aeiou]" :: Regex
test = matchAll re "the quick brown fox"
This returns a list of arrays. To get a list of (offset,length) pairs, just access the first element of each array:
import Data.Array ((!))
matches = map (!0) $ matchAll re "the quick brown fox"
-- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]
To use the =~ operator, things may have changed since RWH. You should use the predefined types MatchOffset and MatchLength and the special type constructor AllMatches:
import Text.Regex.Posix
re = "[^aeiou]"
text = "the quick brown fox"
test1 = text =~ re :: Bool
-- True
test2 = text =~ re :: String
-- "t"
test3 = text =~ re :: (MatchOffset,MatchLength)
-- (0,1)
test4 = text =~ re :: AllMatches [] (MatchOffset, MatchLength)
-- (not showable)
test4' = getAllMatches $ (text =~ re :: AllMatches [] (MatchOffset, MatchLength))
-- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]
See the docs for Text.Regex.Base.Context for more details on what contexts are available.
UPDATE: I believe the type constructor AllMatches was introduced to resolve the ambiguity introduced when an regex has subexpressions -- e.g.:
foo = "axx ayy" =~ "a(.)([^a])"
test1 = getAllMatches $ (foo :: AllMatches [] (MatchOffset, MatchLength))
-- [(0,3),(3,3)]
-- returns the locations of "axx" and "ayy" but no subexpression info
test2 = foo :: MatchArray
-- array (0,2) [(0,(0,3)),(1,(1,1)),(2,(2,1))]
-- returns only the match with "axx"
Both are essentially a list of offset-length pairs, but they mean different things.
Related
I'm looking at the RWH tutorial here, which suggests, but has an error on, the usage of [String] to return multiple results. As you can see:
"I'd like to group by word breaks" =~ "\\S+" :: Bool
"I'd like to group by word breaks" =~ "\\S+" :: String
"I'd like to group by word breaks" =~ "\\S+" :: [[String]]
produce
True
"I'd"
[["I'd"],["like"],["to"],["group"],["by"],["word"],["breaks"]]
respectively.
But the recommended [String] does not, and instead has an error:
"I'd like to group by word breaks" =~ "\\S+" :: [String]
<interactive>:1:1: error:
• No instance for (RegexContext Regex String [String]) arising from a use of ‘=~’
• In the expression: "I'd like to group by word breaks" =~ "\\S+" :: [String]
In an equation for ‘it’: it = "I'd like to group by word breaks" =~ "\\S+" :: [String]
How can I ask for the missing [String]-like type suggestion that would provide what I'm looking for, namely:
["I'd","like","to","group","by","word","breaks"]
without having to post-process? Also, it's interesting that this seems natural in the context of the other successful type conversions, and it even worked this way at one point when the book was written, and no longer does. What is the explanation for the change?
Looks like from the comments on that, the recommendations are either:
Prelude Text.Regex.PCRE> getAllTextMatches ("I'd like to group by word breaks" =~ "\\S+") :: [String]
["I'd","like","to","group","by","word","breaks"]
or
Prelude Text.Regex.PCRE> concat $ "I'd like to group by word breaks" =~ "\\S+" :: [String]
["I'd","like","to","group","by","word","breaks"]
Neither of these are as clean as how this used to work.
I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.
But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:
import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
I was hoping to get the first and last instance of the middle string
iwantthis
But I can't get the result right:
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
I haven't used regex or pcre in Haskell before.
Can someone help with the right usage (to extract the first and last occurrence) ?
Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?
I searched the documentation but found no mention of the usage with type conversion to :: [[String]].
The result you obtain is the following:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:
+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---
So it still matches between the asdf=5; and 123jasd part.
This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]
And now we obtain two matches. Each match has "iwantthis" as capture group 1.
You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:
Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]
I am using regex with Haskell along with Text.Regex.PCRE and in my case I have:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" :: AllTextMatches [] String)
[]
I am expecting some values returned but list is empty. However this returns what is expected:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X])([A-Z])([A-Z])(\\d{2,10})" :: AllTextMatches [] String)
["32UMU1078"]
So if I remove the intersections like &&[^IO] there are no problems.
As I just discovered PCRE doesn't support intersections. Any alternative library with Haskell that support it?
PCRE does not support character class intersection/subtraction.
However, you may work around it with negative lookaheads and other methods.
Here, replace "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" with
"(\\d{1,2})((?![IO])[C-X])((?![IO])[A-Z])((?![IO])[A-Z])(\\d{2,10})"
^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
That is, replace the subtractions with lookaheads, [C-X&&[^IO]] -> (?![IO])[C-X].
Another way, that is more verbose, is to spell out the character classes:
"(\\d{1,2})([C-HJ-NP-X])([A-HJ-NP-Z])([A-HJ-NP-Z])(\\d{2,10})"
So, [C-X] that does not match I and O must be written as [C-HJ-NP-X].
I want to capture a group in as string:
import Text.Regex.Posix
"somestring; somestring2=\"(.*?)\"" =~ "somestring; somestring2=\"my_super_string123\"" :: String
It returns an empty string "", as opposed to my_super_string123 which I expect. I've tried ::[String] and ::[[String]] and, obviously, they were empty. Your suggestions?
The problem is that you have your string and your pattern swapped. You also will want to have the return type be [[String]]:
> "somestring; somestring2=\"my_super_string123\"" =~ "somestring; somestring2=\"(.*)\"" :: [[String]]
[["somestring; somestring2=\"my_super_string123\"", "my_super_string123"]]
Note that I had to remove the ? from the .*? part of the pattern. This is because POSIX doesn't support the lazy quantifier *?. You'll have to select both of the POSIX flavors from the drop downs to see, but it says both do not support the lazy quantifiers. It's also recommended to use negation instead of laziness for regex since it improves performance over having to backtrack. To do this, you'd have to change your pattern to
"somestring; somestring2=\"([^\"]*)\""
To clarify, here's the output from my GHCi:
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: [[String]]
[["s1; s2=\"my_super_string123\"","my_super_string123"]]
it :: [[String]]
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: String
"s1; s2=\"my_super_string123\""
it :: String
1As you can see, with the return type as String, it returns whatever text matches the entire pattern, not just the capturing groups. Use [[String]] when you want to get the contents of the individual capturing groups.
I edited the contents of the string so that it would fit without having to scroll horizontally, just for illustrative purposes.
How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?
I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.
Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]