How to do Erlang pattern matching using regular expressions? - regex

When I write Erlang programs which do text parsing, I frequently run into situations where I would love to do a pattern match using a regular expression.
For example, I wish I could do something like this, where ~ is a "made up" regular expression matching operator:
my_function(String ~ ["^[A-Za-z]+[A-Za-z0-9]*$"]) ->
....
I know about the regular expression module (re) but AFAIK you cannot call functions when pattern matching or in guards.
Also, I wish matching strings could be done in a case-insensitive way. This is handy, for example, when parsing HTTP headers, I would love to do something like this where "Str ~ {Pattern, Options}" means "Match Str against pattern Pattern using options Options":
handle_accept_language_header(Header ~ {"Accept-Language", [case_insensitive]}) ->
...
Two questions:
How do you typically handle this using just standard Erlang? Is there some mechanism / coding style which comes close to this in terms of conciseness and easiness to read?
Is there any work (an EEP?) going on in Erlang to address this?

You really don't have much choice other than to run your regexps in advance and then pattern match on the results. Here's a very simple example that approaches what I think you're after, but it does suffer from the flaw that you need to repeat the regexps twice. You could make this less painful by using a macro to define each regexp in one place.
-module(multire).
-compile(export_all).
multire([],_) ->
nomatch;
multire([RE|RegExps],String) ->
case re:run(String,RE,[{capture,none}]) of
match ->
RE;
nomatch ->
multire(RegExps,String)
end.
test(Foo) ->
test2(multire(["^Hello","world$","^....$"],Foo),Foo).
test2("^Hello",Foo) ->
io:format("~p matched the hello pattern~n",[Foo]);
test2("world$",Foo) ->
io:format("~p matched the world pattern~n",[Foo]);
test2("^....$",Foo) ->
io:format("~p matched the four chars pattern~n",[Foo]);
test2(nomatch,Foo) ->
io:format("~p failed to match~n",[Foo]).

A possibility could be to use Erlang Web-style annotations (macros) combined with the re Erlang module. An example is probably the best way to illustrate this.
This is how your final code will look like:
[...]
?MATCH({Regexp, Options}).
foo(_Args) ->
ok.
[...]
The MATCH macro would be executed just before your foo function. The flow of execution will fail if the regexp pattern is not matched.
Your match function will be declared as follows:
?BEFORE.
match({Regexp, Options}, TgtMod, TgtFun, TgtFunArgs) ->
String = proplists:get_value(string, TgtArgs),
case re:run(String, Regexp, Options) of
nomatch ->
{error, {TgtMod, match_error, []}};
{match, _Captured} ->
{proceed, TgtFunArgs}
end.
Please note that:
The BEFORE says that macro will be executed before your target function (AFTER macro is also available).
The match_error is your error handler, specified in your module, and contains the code you want to execute if you fail a match (maybe nothing, just block the execution flow)
This approach has the advantage of keeping the regexp syntax and options uniform with the re module (avoid confusion).
More information about the Erlang Web annotations here:
http://wiki.erlang-web.org/Annotations
and here:
http://wiki.erlang-web.org/HowTo/CreateAnnotation
The software is open source, so you might want to reuse their annotation engine.

You can use the re module:
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$").
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$", [caseless]).
EDIT:
match(String, Regexps) ->
case lists:dropwhile(
fun({Regexp, Opts}) -> re:run(String, Regexp, Opts) =:= nomatch;
(Regexp) -> re:run(String, Regexp) =:= nomatch end,
Regexps) of
[R|_] -> R;
_ -> nomatch
end.
example(String) ->
Regexps = ["$RE1^", {"$RE2^", [caseless]}, "$RE3"]
case match(String, Regexps) of
nomatch -> handle_error();
Regexp -> handle_regexp(String, Regexp)
...

For string, you could use the 're' module : afterwards, you iterate over the result set. I am afraid there isn't another way to do it AFAIK: that's why there are regexes.
For the HTTP headers, since there can be many, I would consider iterating over the result set to be a better option instead of writing a very long expression (potentially).
EEP work : I do not know.

Erlang does not handle regular expressions in patterns.
No.

You can't pattern match on regular expressions, sorry. So you have to do
my_function(String) -> Matches = re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$"),
...

Related

Why is my regex failing on on certain strings that otherwise succeed?

I have code written in F# that iterates over an array of strings using regex to extract part of those strings. The problem is that the regex appears to randomly successfully match on some, but fail on others, even on an exact duplicates from the same list where it previously succeeded. What am I missing? Is this some sort of regex issue that I am not aware of?
Regex Pattern:
(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))
F# code:
[<Literal>]
let ApiPattern = #"(?i)/(.*?/v\d/.*?((?=\?)|(?=\d)|(?=\n)))"
let parseOutEndpoints (inputs : (int * string) array) =
let regEx = new Regex(ApiPattern, RegexOptions.Compiled)
inputs |> Array.map (fun (id, path) -> [|id.ToString(); path|]) |> Array.collect (fun x -> x)
|> writeRawPathsToFile
File.ReadAllLines(RawPathsFile)
|> Array.map(fun (x) ->
let m = regEx.Match(x)
if m.Success
then
let endpoint = Domain.Endpoint(m.Value)
endpoint
else
let line = $"{x}"
File.AppendAllLines(FailedRegexMatches, [line], Encoding.UTF8)
Domain.NoEndpoint
)
Sample string array Data:
All of these should return a match, but don't. In comparison to this original list, a significantly reduced list of successful matches will be returned.
/enterprise-review/v9/choose?rr=Straight&pr=1%2E35239
/review-id-service/v1/business-id
/orderout/v1/vendor/shipping
/vendor-service/v1/Product/PartnerId/35310108
/Inspect/v1/Recommendation/Products/LaneId/0002,519188,13148,16939,7348,195982
/bin-inventory/v1/vendor?el=1%2E35239
/u-future/v1/fone?fhid=3028
/decline-summary/v1/details/card/65821974
/provide-service/v8/proDetails
/monetary-points/v1/sum/wins/681197
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
/listen-service/v1/audio-Details
/comment/v1/data
/comment/v1/data
This one helped to resolve your issue:
/(.*?/v\d/.*?((?=[\?\d\s])|$))
The reason behind problem: probably \r (windows carriage return), whitespaces and also end of string (noted as $ in regex).
Here's your regex and input in regexstorm, a .net Rex tester:
regex storm
I'd have made this a comment but RS's share urls contain the full Rex and input so it's too long for a comment (and SO doesn't allow url shorteners in comments)
So, my question is; does this look right to you? Are all the highlighted matches what you're expecting to match? If so, as RS's engine is .net based, I don't think there is a problem with the regex part of your code..

Regular Exp match anything but not specific string

I am handling user input in my program by using regular exp.
the string contains /_MyWord/ and only a-z is accepted before /_MyWord/.
the string not contain /s/123, /s/32A and atr/will in the beginning.
My try:
^(?!.*/s/123)(?!.*/s/32A )(?!.*atr/will)([/a-z]+)/_MyWord/(.*)$
Example:
/s/123/QWERERTYU/_MyWord/45454545 -> fail
/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452 -> OK
HiCanYouHelpMe/_MyWord/fgh -> OK
/_MyWord/HiCanYouHelpMefgh -> OK
Can anyone help me to finish the Regular Exp string
If I got your question correctly, try this regex:
^(?!.*\/s\/123)(?!.*\/s\/32A)(?!.*atr\/will)([\/a-zA-Z]*)\/_MyWord\/(.*)$
Unescaped: ^(?!.*/s/123)(?!.*/s/32A)(?!.*atr/will)([/a-zA-Z]*)/_MyWord/(.*)$
Changed ([\/a-z]+) to ([\/a-zA-Z]*) to include lower and upper case as well as support none (e.g /_MyWord/Test)
Regex101 Demo
Works for
/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452
HiCanYouHelpMe/_MyWord/fgh
/_MyWord/HiCanYouHelpMefgh
Doesn't match:
/s/123/QWERERTYU/_MyWord/45454545
atr/will/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452
Also, you really don't need lookaheads for /s/123 and /s/32A since they contain numbers so they will automatically be rejected because your condition includes [a-zA-Z]. So you might want to remove (?!.*\/s\/123)(?!.*\/s\/32A) from the beginning.

F#: Detecting errors in regex patterns

I am new to programming and F# is my first .NET language.
As a beginner's project, I would like to write an application asking the user to enter a regex pattern and then flagging any errors.
I have looked through the Regex API on MSDN but there doesn't seem to be any methods that would automatically detect any errors in regex patterns. Will more experienced programmers kindly share with me how they would go about accomplishing this?
Thank you in advance for your help.
If you need to check if a regex compiles or not, simply use try-with block. If you need to check if a regex pattern matches your input string, use IsMatch() or .Success. That is quite enough.
An example with code taken from another SO post, but with an error in regex pattern where I replaced (http:\/\/\S+) with (http:\/\/\S+:
try
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches input =
Regex.Matches(input, "(http:\/\/\S+")
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) -> value, (groups |> Seq.length))
with
| :? System.Exception as ex -> printfn "Exception! %s " (ex.Message); None
More on F# exception raising can be found here or here.

haskell regex substitution

Despite the ridiculously large number of regex matching engines for Haskell, the only one I can find that will substitute is Text.Regex, which, while decent, is missing a few thing I like from pcre. Are there any pcre-based packages which will do substitution, or am I stuck with this?
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this. Including Scheme. So here's some actual resources; my code is from a project where I was trying to replace "qql foo bar baz qq" with text based on calling a function on the stuff inside the qq "brackets", because reasons.
Best option: pcre-heavy:
let newBody = gsub [re|\s(qq[a-z]+)\s(.*?)\sqq\s|] (unWikiReplacer2 titles) body in do
[snip]
unWikiReplacer2 :: [String] -> String -> [String] -> String
unWikiReplacer2 titles match subList = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> match
Note that pcre-heavy directly supports function-based replacement, with any
string type. So nice.
Another option: pcre-light with a small function that works but isn't exactly
performant:
let newBody = replaceAllPCRE "\\s(qq[a-z]+)\\s(.*?)\\sqq\\s" (unWikiReplacer titles) body in do
[snip]
unWikiReplacer :: [String] -> (PCRE.MatchResult String) -> String
unWikiReplacer titles mr = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> PCRE.mrMatch mr
where
subList = PCRE.mrSubList mr
-- A very simple, very dumb "replace all instances of this regex
-- with the results of this function" function. Relies on the
-- MatchResult return type.
--
-- https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs
-- was very helpful to me in constructing this
--
-- I also used
-- https://github.com/jaspervdj/hakyll/blob/ea7d97498275a23fbda06e168904ee261f29594e/src/Hakyll/Core/Util/String.hs
replaceAllPCRE :: String -- ^ Pattern
-> ((PCRE.MatchResult String) -> String) -- ^ Replacement (called on capture)
-> String -- ^ Source string
-> String -- ^ Result
replaceAllPCRE pattern f source =
if (source PCRE.=~ pattern) == True then
replaceAllPCRE pattern f newStr
else
source
where
mr = (source PCRE.=~ pattern)
newStr = (PCRE.mrBefore mr) ++ (f mr) ++ (PCRE.mrAfter mr)
Someone else's fix: http://0xfe.blogspot.com/2010/09/regex-substitution-in-haskell.html
Another one, this time embedded in a major library: https://github.com/jaspervdj/hakyll/blob/master/src/Hakyll/Core/Util/String.hs
Another package for this purpose: https://hackage.haskell.org/package/pcre-utils
Update 2020
I totally agree with #rlpowell that
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this.
At the time of this writing, there is also Regex.Applicative.replace for regex substitution, though it's not Perl-compatible.
For pattern-matching and substitution with parsers instead of regex, there is Replace.Megaparsec.streamEdit
The regular expression API in regex-base is generic to the container of characters to match. Doing some kind of splicing generically to implements substitution would be very hard to make efficient. I did not want to provide a crappy generic routine.
Writing a small function to do the substitution exactly how you want is just a better idea, and it can be written to match your container.

Regular expressions versus lexical analyzers in Haskell

I'm getting started with Haskell and I'm trying to use the Alex tool to create regular expressions and I'm a little bit lost; my first inconvenience was the compile part. How I have to do to compile a file with Alex?. Then, I think that I have to import into my code the modules that alex generates, but not sure. If someone can help me, I would be very greatful!
You can specify regular expression functions in Alex.
Here for example, a regex in Alex to match floating point numbers:
$space = [\ \t\xa0]
$digit = 0-9
$octit = 0-7
$hexit = [$digit A-F a-f]
#sign = [\-\+]
#decimal = $digit+
#octal = $octit+
#hexadecimal = $hexit+
#exponent = [eE] [\-\+]? #decimal
#number = #decimal
| #decimal \. #decimal #exponent?
| #decimal #exponent
| 0[oO] #octal
| 0[xX] #hexadecimal
lex :-
#sign? #number { strtod }
When we match the floating point number, we dispatch to a parsing function to operate on that captured string, which we can then wrap and expose to the user as a parsing function:
readDouble :: ByteString -> Maybe (Double, ByteString)
readDouble str = case alexScan (AlexInput '\n' str) 0 of
AlexEOF -> Nothing
AlexError _ -> Nothing
AlexToken (AlexInput _ rest) n _ ->
case strtod (B.unsafeTake n str) of d -> d `seq` Just $! (d , rest)
A nice consequence of using Alex for this regex matching is that the performance is good, as the regex engine is compiled statically. It can also be exposed as a regular Haskell library built with cabal. For the full implementation, see bytestring-lexing.
The general advice on when to use a lexer instead of a regex matcher would be that, if you have a grammar for the lexemes you're trying to match, as I did for floating point, use Alex. If you don't, and the structure is more ad hoc, use a regex engine.
Why do you want to use alex to create regular expressions?
If all you want is to do some regex matching etc, you should look at the regex-base package.
If it is plain Regex you want, the API is specified in text.regex.base. Then there are the implementations text.regex.Posix , text.regex.pcre and several others. The Haddoc documentation is a bit slim, however the basics are described in Real World Haskell, chapter 8. Some more indepth stuff is descriped in this SO question.