haskell regex substitution - regex

Despite the ridiculously large number of regex matching engines for Haskell, the only one I can find that will substitute is Text.Regex, which, while decent, is missing a few thing I like from pcre. Are there any pcre-based packages which will do substitution, or am I stuck with this?

I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this. Including Scheme. So here's some actual resources; my code is from a project where I was trying to replace "qql foo bar baz qq" with text based on calling a function on the stuff inside the qq "brackets", because reasons.
Best option: pcre-heavy:
let newBody = gsub [re|\s(qq[a-z]+)\s(.*?)\sqq\s|] (unWikiReplacer2 titles) body in do
[snip]
unWikiReplacer2 :: [String] -> String -> [String] -> String
unWikiReplacer2 titles match subList = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> match
Note that pcre-heavy directly supports function-based replacement, with any
string type. So nice.
Another option: pcre-light with a small function that works but isn't exactly
performant:
let newBody = replaceAllPCRE "\\s(qq[a-z]+)\\s(.*?)\\sqq\\s" (unWikiReplacer titles) body in do
[snip]
unWikiReplacer :: [String] -> (PCRE.MatchResult String) -> String
unWikiReplacer titles mr = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> PCRE.mrMatch mr
where
subList = PCRE.mrSubList mr
-- A very simple, very dumb "replace all instances of this regex
-- with the results of this function" function. Relies on the
-- MatchResult return type.
--
-- https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs
-- was very helpful to me in constructing this
--
-- I also used
-- https://github.com/jaspervdj/hakyll/blob/ea7d97498275a23fbda06e168904ee261f29594e/src/Hakyll/Core/Util/String.hs
replaceAllPCRE :: String -- ^ Pattern
-> ((PCRE.MatchResult String) -> String) -- ^ Replacement (called on capture)
-> String -- ^ Source string
-> String -- ^ Result
replaceAllPCRE pattern f source =
if (source PCRE.=~ pattern) == True then
replaceAllPCRE pattern f newStr
else
source
where
mr = (source PCRE.=~ pattern)
newStr = (PCRE.mrBefore mr) ++ (f mr) ++ (PCRE.mrAfter mr)
Someone else's fix: http://0xfe.blogspot.com/2010/09/regex-substitution-in-haskell.html
Another one, this time embedded in a major library: https://github.com/jaspervdj/hakyll/blob/master/src/Hakyll/Core/Util/String.hs
Another package for this purpose: https://hackage.haskell.org/package/pcre-utils

Update 2020
I totally agree with #rlpowell that
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this.
At the time of this writing, there is also Regex.Applicative.replace for regex substitution, though it's not Perl-compatible.
For pattern-matching and substitution with parsers instead of regex, there is Replace.Megaparsec.streamEdit

The regular expression API in regex-base is generic to the container of characters to match. Doing some kind of splicing generically to implements substitution would be very hard to make efficient. I did not want to provide a crappy generic routine.
Writing a small function to do the substitution exactly how you want is just a better idea, and it can be written to match your container.

Related

Why is my regex failing on on certain strings that otherwise succeed?

I have code written in F# that iterates over an array of strings using regex to extract part of those strings. The problem is that the regex appears to randomly successfully match on some, but fail on others, even on an exact duplicates from the same list where it previously succeeded. What am I missing? Is this some sort of regex issue that I am not aware of?
Regex Pattern:
(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))
F# code:
[<Literal>]
let ApiPattern = #"(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))"
let parseOutEndpoints (inputs : (int * string) array) =
let regEx = new Regex(ApiPattern, RegexOptions.Compiled)
inputs |> Array.map (fun (id, path) -> [|id.ToString(); path|]) |> Array.collect (fun x -> x)
|> writeRawPathsToFile
File.ReadAllLines(RawPathsFile)
|> Array.map(fun (x) ->
let m = regEx.Match(x)
if m.Success
then
let endpoint = Domain.Endpoint(m.Value)
endpoint
else
let line = $"{x}"
File.AppendAllLines(FailedRegexMatches, [line], Encoding.UTF8)
Domain.NoEndpoint
)
Sample string array Data:
All of these should return a match, but don't. In comparison to this original list, a significantly reduced list of successful matches will be returned.
/enterprise-review/v9/choose?rr=Straight&pr=1%2E35239
/review-id-service/v1/business-id
/orderout/v1/vendor/shipping
/vendor-service/v1/Product/PartnerId/35310108
/Inspect/v1/Recommendation/Products/LaneId/0002,519188,13148,16939,7348,195982
/bin-inventory/v1/vendor?el=1%2E35239
/u-future/v1/fone?fhid=3028
/decline-summary/v1/details/card/65821974
/provide-service/v8/proDetails
/monetary-points/v1/sum/wins/681197
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
This one helped to resolve your issue:
/(.*?/v\d/.*?((?=[\?\d\s])|$))
The reason behind problem: probably \r (windows carriage return), whitespaces and also end of string (noted as $ in regex).
Here's your regex and input in regexstorm, a .net Rex tester:
regex storm
I'd have made this a comment but RS's share urls contain the full Rex and input so it's too long for a comment (and SO doesn't allow url shorteners in comments)
So, my question is; does this look right to you? Are all the highlighted matches what you're expecting to match? If so, as RS's engine is .net based, I don't think there is a problem with the regex part of your code..

ocaml Str.full_split does not returns the original string instead of the expected substring

I am trying to write a program that will read diff files and return the filenames, just the filenames. So I wrote the following code
open Printf
open Str
let syname: string = "diff --git a/drivers/usc/filex.c b/drivers/usc/filex"
let fileb =
let pat_filename = Str.regexp "a\/(.+)b" in
let s = Str.full_split pat_filename syname in
s
let print_split_res (elem: Str.split_result) =
match elem with
| Text t -> print_string t
| Delim d -> print_string d
let rec print_list (l: Str.split_result list) =
match l with
| [] -> ()
| hd :: tl -> print_split_res hd ; print_string "\n" ; print_list tl
;;
() = print_list fileb
upon running this I get the original sting diff --git a/drivers/usc/filex.c b/drivers/usc/filex back as the output.
Whereas if I use the same regex pattern with the python standard library I get the desired result
import re
p=re.compile('a\/(.+)b')
p.findall("diff --git a/drivers/usc/filex.c b/drivers/usc/filex")
Output: ['drivers/usc/filex.c ']
What am I doing wrong?
Not to be snide, but the way to understand OCaml regular expressions is to read the documentation, not compare to things in another language :-) Sadly, there is no real standard for regular expressions across languages.
The main problem appears to be that parentheses in OCaml regular expressions match themselves. To get grouping behavior they need to be escaped with '\\'. In other words, your pattern is looking for actual parentheses in the filename. Your code works for me if you change your regular expression to this:
Str.regexp "a/\\(.+\\)b"
Note that the backslashes must themselves be escaped so that Str.regexp sees them.
You also have the problem that your pattern doesn't match the slash after b. So the resulting text will start with a slash.
As a side comment, I also removed the backslash before /, which is technically not allowed in an OCaml string.

(Ocaml) Using 'match' to extract list of chars from a list of chars

I have just started to learn ocaml and I find it difficult to extract small list of chars from a bigger list of chars.
lets say I have:
let list_of_chars = ['#' ; 'a' ; 'b' ; 'c'; ... ; '!' ; '3' ; '4' ; '5' ];;
I have the following knowledge - I know that in the
list above I have '#' followed by a '!' in some location further in the list .
I want to extract the lists ['a' ;'b' ;'c' ; ...] and ['3' ; '4' ; '5'] and do something with them,
so I do the following thing:
let variable = match list_of_chars with
| '#'::l1#['!']#l2 -> (*[code to do something with l1 and l2]*)
| _ -> raise Exception ;;
This code doesn't work for me, it's throwing errors. Is there a simple way of doing this?
(specifically for using match)
As another answer points out, you can’t use pattern matching for this because pattern matching only lets you use constructors and # is not a constructor.
Here is how you might solve your problem
let split ~equal ~on list =
let rec go acc = function
| [] -> None
| x::xs -> if equal x on then Some (rev acc, xs) else go (x::acc) xs
in
go [] list
let variable = match list_of_chars with
| '#'::rest ->
match split rest ~on:'!' ~equal:(Char.equal) with
| None -> raise Exception
| Some (left,right) ->
... (* your code here *)
I’m now going to hypothesise that you are trying to do some kind of parsing or lexing. I recommend that you do not do it with a list of chars. Indeed I think there is almost never a reason to have a list of chars in ocaml: a string is better for a string (a chat list has an overhead of 23x in memory usage) and while one might use chars as a kind of mnemonic enum in C, ocaml has actual enums (aka variant types or sum types) so those should usually be used instead. I guess you might end up with a chat list if you are doing something with a trie.
If you are interested in parsing or lexing, you may want to look into:
Ocamllex and ocamlyacc
Sedlex
Angstrom or another parser generator like it
One of the regular expression libraries (eg Re, Re2, Pcre (note Re and Re2 are mostly unrelated)
Using strings and functions like lsplit2
# is an operator, not a valid pattern. Patterns need to be static and can't match a varying number of elements in the middle of a list. But since you know the position of ! it doesn't need to be dynamic. You can accomplish it just using :::
let variable = match list_of_chars with
| '#'::a::b::c::'!'::l2 -> let l1 = [a;b;c] in ...
| _ -> raise Exception ;;

Regular expressions versus lexical analyzers in Haskell

I'm getting started with Haskell and I'm trying to use the Alex tool to create regular expressions and I'm a little bit lost; my first inconvenience was the compile part. How I have to do to compile a file with Alex?. Then, I think that I have to import into my code the modules that alex generates, but not sure. If someone can help me, I would be very greatful!
You can specify regular expression functions in Alex.
Here for example, a regex in Alex to match floating point numbers:
$space = [\ \t\xa0]
$digit = 0-9
$octit = 0-7
$hexit = [$digit A-F a-f]
#sign = [\-\+]
#decimal = $digit+
#octal = $octit+
#hexadecimal = $hexit+
#exponent = [eE] [\-\+]? #decimal
#number = #decimal
| #decimal \. #decimal #exponent?
| #decimal #exponent
| 0[oO] #octal
| 0[xX] #hexadecimal
lex :-
#sign? #number { strtod }
When we match the floating point number, we dispatch to a parsing function to operate on that captured string, which we can then wrap and expose to the user as a parsing function:
readDouble :: ByteString -> Maybe (Double, ByteString)
readDouble str = case alexScan (AlexInput '\n' str) 0 of
AlexEOF -> Nothing
AlexError _ -> Nothing
AlexToken (AlexInput _ rest) n _ ->
case strtod (B.unsafeTake n str) of d -> d `seq` Just $! (d , rest)
A nice consequence of using Alex for this regex matching is that the performance is good, as the regex engine is compiled statically. It can also be exposed as a regular Haskell library built with cabal. For the full implementation, see bytestring-lexing.
The general advice on when to use a lexer instead of a regex matcher would be that, if you have a grammar for the lexemes you're trying to match, as I did for floating point, use Alex. If you don't, and the structure is more ad hoc, use a regex engine.
Why do you want to use alex to create regular expressions?
If all you want is to do some regex matching etc, you should look at the regex-base package.
If it is plain Regex you want, the API is specified in text.regex.base. Then there are the implementations text.regex.Posix , text.regex.pcre and several others. The Haddoc documentation is a bit slim, however the basics are described in Real World Haskell, chapter 8. Some more indepth stuff is descriped in this SO question.

How to do Erlang pattern matching using regular expressions?

When I write Erlang programs which do text parsing, I frequently run into situations where I would love to do a pattern match using a regular expression.
For example, I wish I could do something like this, where ~ is a "made up" regular expression matching operator:
my_function(String ~ ["^[A-Za-z]+[A-Za-z0-9]*$"]) ->
....
I know about the regular expression module (re) but AFAIK you cannot call functions when pattern matching or in guards.
Also, I wish matching strings could be done in a case-insensitive way. This is handy, for example, when parsing HTTP headers, I would love to do something like this where "Str ~ {Pattern, Options}" means "Match Str against pattern Pattern using options Options":
handle_accept_language_header(Header ~ {"Accept-Language", [case_insensitive]}) ->
...
Two questions:
How do you typically handle this using just standard Erlang? Is there some mechanism / coding style which comes close to this in terms of conciseness and easiness to read?
Is there any work (an EEP?) going on in Erlang to address this?
You really don't have much choice other than to run your regexps in advance and then pattern match on the results. Here's a very simple example that approaches what I think you're after, but it does suffer from the flaw that you need to repeat the regexps twice. You could make this less painful by using a macro to define each regexp in one place.
-module(multire).
-compile(export_all).
multire([],_) ->
nomatch;
multire([RE|RegExps],String) ->
case re:run(String,RE,[{capture,none}]) of
match ->
RE;
nomatch ->
multire(RegExps,String)
end.
test(Foo) ->
test2(multire(["^Hello","world$","^....$"],Foo),Foo).
test2("^Hello",Foo) ->
io:format("~p matched the hello pattern~n",[Foo]);
test2("world$",Foo) ->
io:format("~p matched the world pattern~n",[Foo]);
test2("^....$",Foo) ->
io:format("~p matched the four chars pattern~n",[Foo]);
test2(nomatch,Foo) ->
io:format("~p failed to match~n",[Foo]).
A possibility could be to use Erlang Web-style annotations (macros) combined with the re Erlang module. An example is probably the best way to illustrate this.
This is how your final code will look like:
[...]
?MATCH({Regexp, Options}).
foo(_Args) ->
ok.
[...]
The MATCH macro would be executed just before your foo function. The flow of execution will fail if the regexp pattern is not matched.
Your match function will be declared as follows:
?BEFORE.
match({Regexp, Options}, TgtMod, TgtFun, TgtFunArgs) ->
String = proplists:get_value(string, TgtArgs),
case re:run(String, Regexp, Options) of
nomatch ->
{error, {TgtMod, match_error, []}};
{match, _Captured} ->
{proceed, TgtFunArgs}
end.
Please note that:
The BEFORE says that macro will be executed before your target function (AFTER macro is also available).
The match_error is your error handler, specified in your module, and contains the code you want to execute if you fail a match (maybe nothing, just block the execution flow)
This approach has the advantage of keeping the regexp syntax and options uniform with the re module (avoid confusion).
More information about the Erlang Web annotations here:
http://wiki.erlang-web.org/Annotations
and here:
http://wiki.erlang-web.org/HowTo/CreateAnnotation
The software is open source, so you might want to reuse their annotation engine.
You can use the re module:
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$").
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$", [caseless]).
EDIT:
match(String, Regexps) ->
case lists:dropwhile(
fun({Regexp, Opts}) -> re:run(String, Regexp, Opts) =:= nomatch;
(Regexp) -> re:run(String, Regexp) =:= nomatch end,
Regexps) of
[R|_] -> R;
_ -> nomatch
end.
example(String) ->
Regexps = ["$RE1^", {"$RE2^", [caseless]}, "$RE3"]
case match(String, Regexps) of
nomatch -> handle_error();
Regexp -> handle_regexp(String, Regexp)
...
For string, you could use the 're' module : afterwards, you iterate over the result set. I am afraid there isn't another way to do it AFAIK: that's why there are regexes.
For the HTTP headers, since there can be many, I would consider iterating over the result set to be a better option instead of writing a very long expression (potentially).
EEP work : I do not know.
Erlang does not handle regular expressions in patterns.
No.
You can't pattern match on regular expressions, sorry. So you have to do
my_function(String) -> Matches = re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$"),
...