Pattern matching on string - ocaml

say i have the following type definition:
type e = AES | DES | SALSA20
Then, i have a function which has an argument of type string, this argument could be "AES", "DES", "SALSA20", or something else. How could I first turn this string into a type and then match this type with possible types? Or, is it necessary because one may argue you can just match a string directly as the following?
match str with
| "AES" -> do something...
| "DES" -> do something else...
If possible, can I also get a basic intro to how pattern-matching works on string? I feel direct string matching will be much slower...

You can write a function like this:
let e_of_string = function
| "AES" -> AES
| "DES" -> DES
| "SALSA20" -> SALSA20
| _ -> raise (Invalid_argument "e_of_string")
I examined the native code generated by OCaml 4.01.0; it compares the supplied string against each of the given strings in order. If there are a lot of strings to check, this is, as you say, pretty slow.
To get more powerful string matching, you could use regular expressions from the Str module, or you could use ocamllex.

You can do it either way. If you are going to perform multiple matches then mapping the string to a data type first makes a lot of sense. For just a single match, I would pick whichever leads to the most straightforward code (or the most consistent, if this is something that is done elsewhere in your code base).
Mapping to a type might look something like:
type thing = AES | DES | SALSA20
let thing_of_string = function
| "AES" -> AES
| "DES" -> DES
| "SALSA20" -> SALSA20
| _ -> failwith "not a thing"
String matching was improved recently and should be acceptably efficient for what it is. I would still avoid performing the same string match many times.

Related

(Ocaml) Using 'match' to extract list of chars from a list of chars

I have just started to learn ocaml and I find it difficult to extract small list of chars from a bigger list of chars.
lets say I have:
let list_of_chars = ['#' ; 'a' ; 'b' ; 'c'; ... ; '!' ; '3' ; '4' ; '5' ];;
I have the following knowledge - I know that in the
list above I have '#' followed by a '!' in some location further in the list .
I want to extract the lists ['a' ;'b' ;'c' ; ...] and ['3' ; '4' ; '5'] and do something with them,
so I do the following thing:
let variable = match list_of_chars with
| '#'::l1#['!']#l2 -> (*[code to do something with l1 and l2]*)
| _ -> raise Exception ;;
This code doesn't work for me, it's throwing errors. Is there a simple way of doing this?
(specifically for using match)
As another answer points out, you can’t use pattern matching for this because pattern matching only lets you use constructors and # is not a constructor.
Here is how you might solve your problem
let split ~equal ~on list =
let rec go acc = function
| [] -> None
| x::xs -> if equal x on then Some (rev acc, xs) else go (x::acc) xs
in
go [] list
let variable = match list_of_chars with
| '#'::rest ->
match split rest ~on:'!' ~equal:(Char.equal) with
| None -> raise Exception
| Some (left,right) ->
... (* your code here *)
I’m now going to hypothesise that you are trying to do some kind of parsing or lexing. I recommend that you do not do it with a list of chars. Indeed I think there is almost never a reason to have a list of chars in ocaml: a string is better for a string (a chat list has an overhead of 23x in memory usage) and while one might use chars as a kind of mnemonic enum in C, ocaml has actual enums (aka variant types or sum types) so those should usually be used instead. I guess you might end up with a chat list if you are doing something with a trie.
If you are interested in parsing or lexing, you may want to look into:
Ocamllex and ocamlyacc
Sedlex
Angstrom or another parser generator like it
One of the regular expression libraries (eg Re, Re2, Pcre (note Re and Re2 are mostly unrelated)
Using strings and functions like lsplit2
# is an operator, not a valid pattern. Patterns need to be static and can't match a varying number of elements in the middle of a list. But since you know the position of ! it doesn't need to be dynamic. You can accomplish it just using :::
let variable = match list_of_chars with
| '#'::a::b::c::'!'::l2 -> let l1 = [a;b;c] in ...
| _ -> raise Exception ;;

haskell regex substitution

Despite the ridiculously large number of regex matching engines for Haskell, the only one I can find that will substitute is Text.Regex, which, while decent, is missing a few thing I like from pcre. Are there any pcre-based packages which will do substitution, or am I stuck with this?
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this. Including Scheme. So here's some actual resources; my code is from a project where I was trying to replace "qql foo bar baz qq" with text based on calling a function on the stuff inside the qq "brackets", because reasons.
Best option: pcre-heavy:
let newBody = gsub [re|\s(qq[a-z]+)\s(.*?)\sqq\s|] (unWikiReplacer2 titles) body in do
[snip]
unWikiReplacer2 :: [String] -> String -> [String] -> String
unWikiReplacer2 titles match subList = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> match
Note that pcre-heavy directly supports function-based replacement, with any
string type. So nice.
Another option: pcre-light with a small function that works but isn't exactly
performant:
let newBody = replaceAllPCRE "\\s(qq[a-z]+)\\s(.*?)\\sqq\\s" (unWikiReplacer titles) body in do
[snip]
unWikiReplacer :: [String] -> (PCRE.MatchResult String) -> String
unWikiReplacer titles mr = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> PCRE.mrMatch mr
where
subList = PCRE.mrSubList mr
-- A very simple, very dumb "replace all instances of this regex
-- with the results of this function" function. Relies on the
-- MatchResult return type.
--
-- https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs
-- was very helpful to me in constructing this
--
-- I also used
-- https://github.com/jaspervdj/hakyll/blob/ea7d97498275a23fbda06e168904ee261f29594e/src/Hakyll/Core/Util/String.hs
replaceAllPCRE :: String -- ^ Pattern
-> ((PCRE.MatchResult String) -> String) -- ^ Replacement (called on capture)
-> String -- ^ Source string
-> String -- ^ Result
replaceAllPCRE pattern f source =
if (source PCRE.=~ pattern) == True then
replaceAllPCRE pattern f newStr
else
source
where
mr = (source PCRE.=~ pattern)
newStr = (PCRE.mrBefore mr) ++ (f mr) ++ (PCRE.mrAfter mr)
Someone else's fix: http://0xfe.blogspot.com/2010/09/regex-substitution-in-haskell.html
Another one, this time embedded in a major library: https://github.com/jaspervdj/hakyll/blob/master/src/Hakyll/Core/Util/String.hs
Another package for this purpose: https://hackage.haskell.org/package/pcre-utils
Update 2020
I totally agree with #rlpowell that
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this.
At the time of this writing, there is also Regex.Applicative.replace for regex substitution, though it's not Perl-compatible.
For pattern-matching and substitution with parsers instead of regex, there is Replace.Megaparsec.streamEdit
The regular expression API in regex-base is generic to the container of characters to match. Doing some kind of splicing generically to implements substitution would be very hard to make efficient. I did not want to provide a crappy generic routine.
Writing a small function to do the substitution exactly how you want is just a better idea, and it can be written to match your container.

Regular expressions versus lexical analyzers in Haskell

I'm getting started with Haskell and I'm trying to use the Alex tool to create regular expressions and I'm a little bit lost; my first inconvenience was the compile part. How I have to do to compile a file with Alex?. Then, I think that I have to import into my code the modules that alex generates, but not sure. If someone can help me, I would be very greatful!
You can specify regular expression functions in Alex.
Here for example, a regex in Alex to match floating point numbers:
$space = [\ \t\xa0]
$digit = 0-9
$octit = 0-7
$hexit = [$digit A-F a-f]
#sign = [\-\+]
#decimal = $digit+
#octal = $octit+
#hexadecimal = $hexit+
#exponent = [eE] [\-\+]? #decimal
#number = #decimal
| #decimal \. #decimal #exponent?
| #decimal #exponent
| 0[oO] #octal
| 0[xX] #hexadecimal
lex :-
#sign? #number { strtod }
When we match the floating point number, we dispatch to a parsing function to operate on that captured string, which we can then wrap and expose to the user as a parsing function:
readDouble :: ByteString -> Maybe (Double, ByteString)
readDouble str = case alexScan (AlexInput '\n' str) 0 of
AlexEOF -> Nothing
AlexError _ -> Nothing
AlexToken (AlexInput _ rest) n _ ->
case strtod (B.unsafeTake n str) of d -> d `seq` Just $! (d , rest)
A nice consequence of using Alex for this regex matching is that the performance is good, as the regex engine is compiled statically. It can also be exposed as a regular Haskell library built with cabal. For the full implementation, see bytestring-lexing.
The general advice on when to use a lexer instead of a regex matcher would be that, if you have a grammar for the lexemes you're trying to match, as I did for floating point, use Alex. If you don't, and the structure is more ad hoc, use a regex engine.
Why do you want to use alex to create regular expressions?
If all you want is to do some regex matching etc, you should look at the regex-base package.
If it is plain Regex you want, the API is specified in text.regex.base. Then there are the implementations text.regex.Posix , text.regex.pcre and several others. The Haddoc documentation is a bit slim, however the basics are described in Real World Haskell, chapter 8. Some more indepth stuff is descriped in this SO question.

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

I need a regexp to match something like this,
'text' | 'text' | ... | 'text'(~text) = 'text' | 'text' | ... | 'text'
I just want to divide it up into two sections, the part on the left of the equals sign and the part on the right. Any of the 'text' entries can have "=" between the ' characters though. I was thinking of trying to match an even number of 's followed by a =, but I'm not sure how to match an even number of something.. Also note I don't know how many entries on either side there could be. A couple examples,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33) = '51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
Should extract,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33)
and,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637) = '227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Should extract,
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637)
and,
'227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Also note the little tilda bit at the end of the first section is optional, so I can't just look for that.
Actually I wouldn't use a regex for that at all. Assuming your language has a split operation, I'd first split on the | character to get a list of:
'51NL9637X33'
'ISL6262ACRZ-T'
'QFN'(~51NL9637X33) = '51NL9637X33'
'ISL6262ACRZ-T'
'INTERSIL'
'QFN7SQ-HT1_P49'
'()'
Then I'd split each of them on the = character to get the key and (optional) value:
'51NL9637X33' <null>
'ISL6262ACRZ-T' <null>
'QFN'(~51NL9637X33) '51NL9637X33'
'ISL6262ACRZ-T' <null>
'INTERSIL' <null>
'QFN7SQ-HT1_P49' <null>
'()' <null>
You haven't specified why you think a regex is the right tool for the job but most modern languages also have a split capability and regexes aren't necessarily the answer to every requirement.
I agree with paxdiablo in that regular expressions might not be the most suitable tool for this task, depending on the language you are working with.
The question "How do I match an even number of characters?" is interesting nonetheless, and here is how I'd do it in your case:
(?:'[^']*'|[^=])*(?==)
This expression matches the left part of your entry by looking for a ' at its current position. If it finds one, it runs forward to the next ' and thereby only matching an even number of quotes. If it does not find a ' it matches anything that is not an equal sign and then assures that an equal sign follows the matched string. It works because the regex engine evaluates OR constructs from left to right.
You could get the left and right parts in two capturing groups by using
((?:'[^']*'|[^=])*)=(.*)
I recommend http://gskinner.com/RegExr/ for tinkering with regular expressions. =)
As paxdiablo said, you almost certainly don't want to use a regex here. The split suggestion isn't bad; I myself would probably use a parser here—there's a lot of structure to exploit. The idea here is that you formally specify the syntax of what you have—sort of like what you gave us, only rigorous. So, for instance: a field is a sequence of non-single-quote characters surrounded by single quotes; a fields is any number of fields separated by white space, a |, and more white space; a tilde is non-right-parenthesis characters surrounded by (~ and ); and an expr is a fields, optional whitespace, an optional tilde, a =, optional whitespace, and another fields. How you express this depends on the language you are using. In Haskell, for instance, using the Parsec library, you write each of those parsers as follows:
import Text.ParserCombinators.Parsec
field :: Parser String
field = between (char '\'') (char '\'') $ many (noneOf "'\n")
tilde :: Parser String
tilde = between (string "(~") (char ')') $ many (noneOf ")\n")
fields :: Parser [String]
fields = field `sepBy` (try $ spaces >> char '|' >> spaces)
expr :: Parser ([String],Maybe String,[String])
expr = do left <- fields
spaces
opt <- optionMaybe tilde
spaces >> char '=' >> spaces
right <- fields
(char '\n' >> return ()) <|> eof
return (left, opt, right)
Understanding precisely how this code works isn't really important; the basic idea is to break down what you're parsing, express it in formal rules, and build it back up out of the smaller components. And for something like this, it'll be much cleaner than a regex.
If you really want a regex, here you go (barely tested):
^\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?(\(~[^)\n]*\))?\s*=\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?\s*$
See why I recommend a parser? When I first wrote this, I got at least two things wrong which I picked up (one per test), and there's probably something else. And I didn't insert capturing groups where you wanted them because I wasn't sure where they'd go. Now yes, I could have made this more readable by inserting comments, etc. And after all, regexen have their uses! However, the point is: this is not one of them. Stick with something better.

How to do Erlang pattern matching using regular expressions?

When I write Erlang programs which do text parsing, I frequently run into situations where I would love to do a pattern match using a regular expression.
For example, I wish I could do something like this, where ~ is a "made up" regular expression matching operator:
my_function(String ~ ["^[A-Za-z]+[A-Za-z0-9]*$"]) ->
....
I know about the regular expression module (re) but AFAIK you cannot call functions when pattern matching or in guards.
Also, I wish matching strings could be done in a case-insensitive way. This is handy, for example, when parsing HTTP headers, I would love to do something like this where "Str ~ {Pattern, Options}" means "Match Str against pattern Pattern using options Options":
handle_accept_language_header(Header ~ {"Accept-Language", [case_insensitive]}) ->
...
Two questions:
How do you typically handle this using just standard Erlang? Is there some mechanism / coding style which comes close to this in terms of conciseness and easiness to read?
Is there any work (an EEP?) going on in Erlang to address this?
You really don't have much choice other than to run your regexps in advance and then pattern match on the results. Here's a very simple example that approaches what I think you're after, but it does suffer from the flaw that you need to repeat the regexps twice. You could make this less painful by using a macro to define each regexp in one place.
-module(multire).
-compile(export_all).
multire([],_) ->
nomatch;
multire([RE|RegExps],String) ->
case re:run(String,RE,[{capture,none}]) of
match ->
RE;
nomatch ->
multire(RegExps,String)
end.
test(Foo) ->
test2(multire(["^Hello","world$","^....$"],Foo),Foo).
test2("^Hello",Foo) ->
io:format("~p matched the hello pattern~n",[Foo]);
test2("world$",Foo) ->
io:format("~p matched the world pattern~n",[Foo]);
test2("^....$",Foo) ->
io:format("~p matched the four chars pattern~n",[Foo]);
test2(nomatch,Foo) ->
io:format("~p failed to match~n",[Foo]).
A possibility could be to use Erlang Web-style annotations (macros) combined with the re Erlang module. An example is probably the best way to illustrate this.
This is how your final code will look like:
[...]
?MATCH({Regexp, Options}).
foo(_Args) ->
ok.
[...]
The MATCH macro would be executed just before your foo function. The flow of execution will fail if the regexp pattern is not matched.
Your match function will be declared as follows:
?BEFORE.
match({Regexp, Options}, TgtMod, TgtFun, TgtFunArgs) ->
String = proplists:get_value(string, TgtArgs),
case re:run(String, Regexp, Options) of
nomatch ->
{error, {TgtMod, match_error, []}};
{match, _Captured} ->
{proceed, TgtFunArgs}
end.
Please note that:
The BEFORE says that macro will be executed before your target function (AFTER macro is also available).
The match_error is your error handler, specified in your module, and contains the code you want to execute if you fail a match (maybe nothing, just block the execution flow)
This approach has the advantage of keeping the regexp syntax and options uniform with the re module (avoid confusion).
More information about the Erlang Web annotations here:
http://wiki.erlang-web.org/Annotations
and here:
http://wiki.erlang-web.org/HowTo/CreateAnnotation
The software is open source, so you might want to reuse their annotation engine.
You can use the re module:
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$").
re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$", [caseless]).
EDIT:
match(String, Regexps) ->
case lists:dropwhile(
fun({Regexp, Opts}) -> re:run(String, Regexp, Opts) =:= nomatch;
(Regexp) -> re:run(String, Regexp) =:= nomatch end,
Regexps) of
[R|_] -> R;
_ -> nomatch
end.
example(String) ->
Regexps = ["$RE1^", {"$RE2^", [caseless]}, "$RE3"]
case match(String, Regexps) of
nomatch -> handle_error();
Regexp -> handle_regexp(String, Regexp)
...
For string, you could use the 're' module : afterwards, you iterate over the result set. I am afraid there isn't another way to do it AFAIK: that's why there are regexes.
For the HTTP headers, since there can be many, I would consider iterating over the result set to be a better option instead of writing a very long expression (potentially).
EEP work : I do not know.
Erlang does not handle regular expressions in patterns.
No.
You can't pattern match on regular expressions, sorry. So you have to do
my_function(String) -> Matches = re:run(String, "^[A-Za-z]+[A-Za-z0-9]*$"),
...