Regex & String Libraries in Haskell

Regex & String Libraries in Haskell - regex

I'm trying to introduce Haskell into my daily life by using it to write incidental scripts and such.
readProcess is handy for getting the results of exterior commands, but I find myself searching when it comes to processing the String results. I'm coming from ruby where regexes are first-class, so I'm used to having them as a tool.
Any libraries I should read up on to do string processing in haskell? Searching for matching lines, pulling out matching regions of a string, and such?

I found this to be a good starting point: http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/ It only covers the basics, no advanced topics, but it's great to get started IMHO.
Things to note:
Regexes in haskell are different in that they have overloaded return types. This means that you can pull many different kinds of thing out of a regex match. (Bool, String, [String], etc...) Depending on the return type you use, it will give you back a different kind of answer (whether or not the regex matched, the test of the match, all matching subgroups, etc..) This is done using some fairly complex typeclass voodoo. The above link demonstrates the basic kinds, a more complete list is here
There are actually multiple standard modules in haskell that provide regex support (strange but true). The tutorial above shows the POSIX module, because it comes standard in haskell. If you have cabal, you can also pretty easily install other regex modules and use those instead. There's a pcre binding (regex-pcre), as well as some packages that work via DFAs (regex-dfa, among others). Install using a command like: cabal install regex-pcre and you should be good to go.
(The modules have a standardized interface, the difference is mainly in the implementation and the regex flavor)
There IS a regex object in haskell, but you don't really need it to use the =~ or =~~ match operators. (Just use a string, conversion happens automatically). If your task is complicated enough that you want a first class parsing object, consider looking into Parsec as has been mentioned in other answers.
DISCLAIMER: I only really user pcre, myself, so I don't really know much about the other packages.

When I was first teaching myself Haskell I found that learning to use a parser combinator library for string processing was a fantastic investment. They can do everything regular expressions can do, and much more, and writing combinator parsers is a great way to build up intuitions about type classes like monads, applicative functors, etc.
I tend to use Attoparsec these days, but Parsec is probably a better starting point because it's more widely documented and discussed, provides nicer error messages, etc.

A good introduction to regular expressions is to be found in Realworld Haskell
Update: On a side note, for command-processing and pipes and such, checkout HSH.

There are plenty of great regex libs in Haskell, but we have better tools. Let's stick with standard Haskell Strings for now (i.e. lists of Char). The basics are all in Data.List -- http://www.haskell.org/ghc/docs/latest/html/libraries/base-4.3.0.0/Data-List.html. You have lines, unlines, words, unwords, takewhile, dropwhile, etc.etc. Also isPrefixOf and isInfixOf, etc.
You may end up writing your own recursive functions fairly directly, but that's a breeze too. The only really missing operations are splitting ones, for which you can use brent's excellent package: http://hackage.haskell.org/package/split
Fundamentally, the notion is that you want to do incremental processing of streams of characters.
Not everything is as efficient as possible, especially since the string representation is not that efficient. But if/when you move on to other data types, the core concepts of how you process things will translate directly from basic strings.

Related

Can I do versatile mathematical (AST) pattern matching and manipulation with Boost.Spirit?

I was looking into pattern matching in C++, and among things like Mach7, which seems to be a functional approach to the problem, and the more general Visitor Pattern, which seems to be the lowest common denominator: it can do everything but excels at only specific cases.
I would like to manipulate mathematical expressions (simplify, evaluate, and also perform calculations like differential equation solving, and integration, symbollically). Yes, I'm looking to end up with a Computer Algebra System.
For the input, I'm looking at using Boost.Spirit (X3) to parse some form of input (currently playing with getting basic LaTeX support in there, although index vs sub/superscript is a problem for this...).
I then came to the crazy idea of using Boost.Spirit to not only parse the input "text", but also use the non-parser components of the library to actually perform the mathematical manipulations on a resulting AST.
Is this versatile enough for the pattern matching my goal requires or should I look at other solutions? I have tried to find documentation on how other CAS work internally, but short of going through the undoubtetly brilliant code of things like Maxima, I can't seem to find any information on anything but very simple implementations of mathematical ASTs. So I have little input info to determine if Boost.Spirit can do what I eventually will need to do.

I'm not qualified to advise on the subject of symbolic algebra and the requirements there.
I do however know a thing or two about Boost Spirit.
All I can say is: don't do it!
You do not want to burden the parser with such complex responsibilities that are just going to be more difficult to design right inside the "warped" reality that is EDSLs and Phoenix actors.
In fact, I have oft repeated this advice (see e.g. Boost Spirit: "Semantic actions are evil"?, is the most linked to for this, but I've deepened it out in several chat rooms and on occasion in answers where the problem seemed to arise from conflating parsing with processing).

Drop-in, portable parsing

I see umpteen posts a day about "how to do X with regexen". And the best response to most of them seems like it would honestly be, "Why are you trying to drive a screw with a hammer?" But regexen are everywhere, and the syntax is mostly portable, particularly if you keep away from the fancy bits.
Is there anything equivalent to regexen but at the next level up in power and configurability? A "you can use it anywhere" parsing library of some variety, preferably with a gloriously concise DSL as its interface?
I've used Ragel somewhat, but because of the preprocessing step, I'd hesitate to recommend it to someone as "use this instead of some hairy regex". It's awkward to use from Obj-C, and I expect it will be terribly awkward from a language that doesn't have compile-link-run as part of its standard operating procedure.
What I'm looking for is something that will pass the "inline-online-universal" test.
(inline) You can write the notation inline with your other code, as you would with a regex..
(online) You can run the resulting parser just as you would your other code, which would mean right after input to a REPL in the case of something like Python.
(universal) You can move to a different language/platform and use virtually the same code for your parser, modulo dialect differences. In reality, I'd be happy with something that works from Python, Ruby, C, Java, and Haskell.
Most tools I know of fall down at "online". They preprocess a grammar offline and spit out code in the target language (C, Python, Java, C++…). They're standalone tools that aren't themselves integrated into the language environment.
I've had suggestions of PEG parsers and lex/yacc combos. Parser combinator libraries might also be a good fit. Whatever you might propose, I'd like to see demonstrated that it meets these tests. Your answer should demonstrate that the proposed solution meets the inline-online-universal requirements by providing a working demo parser in Python, C, and Haskell. The demo example is up to the author, but it should be something painful using just regexen but trivial using a proper parser.

https://github.com/leblancmeneses/NPEG
Implements PEG.
Meets all 3 ... let me explain.
It is inline only with C# and offline with all the others. C# has an offline version also.
I currently support offline versions: C/C++/Javascript (local right now)/Java pass all unit tests - to make it universal. To add another language takes 25.84 hrs (how long it took to create the offline Javascript version)
To make it online for every language would be to much maintenance(possible) but it took me a lot of work and time just to support the current offline versions. I can now focus my energy on building grammar optimizers and tooling to unit test grammar rules where all offline versions benefit.

Have a look at Lex/Yacc or their counterparts Flex/Bison (or Coco, or all the other "compiler" generators). The combination can be used to parse complex textual data with an (arguably) much more readable syntax than with regexen.
For simple problems though, where regexen are more than sufficient, by any means do use them.

Finding functions with regex?

In notepad++, there is a lot of helpful syntax highlighting for various programming languages, and I was wondering how it does some of it. I want to know how it can tell the different scopes of functions.
For example, how would it differentiate between the inside and the outside function?:
function myFunction(arguments){
function functionInsideMyFunction(arguments){
return 0;
}
}
I'm sure it's very simple, but I'm new to regex and still have a bit of trouble understanding it.
Say, for example, I wanted a regex to only match functions that aren't in other functions. I would want to get myFunction, but not functionInsideMyFunction.

Does not Use RegEx, Uses something called a lexer. The lexer for Jedit is http://jflex.de lexers are confusing to me, but you can learn if you want. If you use java, you can futz with the internals of classes with the YourClass.class.whatev, and you can even manipulate that with http://Commons.apache.org/bcel. NotePad++ uses something similar. RegEx simply isn't expansive enough beyond basic line & String parsing

Regular expressions cover a set of functional languages which is on the lower end of 'expressiveness'... There are a lot of language constructs which you will not be able to solve/parse with regular expression, a good example here is the balanced-parentheses-problem.
What you're looking at with the function definitions above is basically the same thing, with the 'opening parenthesis' being function ...(...){ and the 'closing parenthesis' being a simple }.
This problem is not solvable using regex, as it belongs to a set of languages of higher expressiveness (see also Chomsky Hierarchy).
What you need to parse languages above (Chomsky-)level 3 (this is what regex can parse) is a proper parser. There are many different techniques/algorithms which are each suitable for languages of certain expressiveness. Explaining those here would probably be a bit of an overkill, if you want to really get into it I suggest reading about context free grammars, LR-parsers and LL-parsers (these are used a lot when parsing programming languages).

When is it best to use Regular Expressions over basic string splitting / substring'ing?

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.
The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.
So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.
If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.
EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.
I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.

My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.
One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.
For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:
1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3
A grammar is easy to write, and looks something like this:
DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"
With that grammar, you can build a recursive descent parser in a jiffy.
An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.
Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.
Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.
I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"
Given that interpretation, my answer is: never.
Okay.... one more edit.
I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)
I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:
if (str.equals("DooWahDiddy")) // No problemo.
if (str.contains("destroy the earth")) // Okay.
if (str.indexOf(";") < str.length / 2) // Not bad.
Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.
if (str.startsWith("I") && str.endsWith("Widget") &&
(!str.contains("Monkey") || !str.contains("Pox"))) // Madness.
Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).
Here's a great reference:
http://www.regular-expressions.info/
PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.

The regex can be
easier to understand
express more clearly the intent
much shorter
easier to change/adapt
In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.
The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.
Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).
Hm ... still no simple rule-of-thumb, sorry.

[W]e're evaluating a soap header's
action and making decisions on that
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.

I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.
As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.
I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.

When your required transformation isn't basic -- but is still conceptually simple.
no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace
on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out

I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.
I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.
Regexes are rarely overkill - if the match is simple, so is the regex.

I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.

Are regex tools (like RegexBuddy) a good idea?

One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:
Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. - Jamie Zawinski
And:
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as
cleverly as possible, you are, by
definition, not smart enough to debug
it. - Brian Kernighan
My concerns are (respectively:)
That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).
That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.
Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?

Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.

Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.
Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.
As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:
Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.

You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.

I'm not sure why there is so much diffidence against regex.
Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.
Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.
I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.
On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's

Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.

Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.
Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.
If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.
Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.

Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.
Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.

What you should be doing is getting your other devs hooked up with RB.
Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.

I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.
It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.

Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.
So, documentation can help.
This is a real trivial example is a table like the following (just a suggestion)
Expression Match Reason
^ Pos 0 Start of input
\s+ " " At least one space
(abs|floor|ceil) ceil One of "abs", "floor", or "ceil"
...
I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.
However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.
It's relative.

A couple of regex tools (for Node/JS, PHP and Python) i made (for some other projects) are available online to play and experiment.
regex-analyzer and regex-composer
github repo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js