Primitive but efficient grep clone in haskell? - regex

Whenever I consider learning a new language -- haskell in this case -- I try to hack together a primitive grep clone to see how good the language implementation and/or its libraries are at text processing, because that's a major use case for me.
Inspired by code on the haskell wiki, I came up with the following naive attempt:
{-# LANGUAGE FlexibleContexts, ExistentialQuantification #-}
import Text.Regex.PCRE
import System.Environment
io :: ([String] -> [String]) -> IO ()
io f = interact (unlines . f . lines)
regexBool :: forall r l .
(RegexMaker Regex CompOption ExecOption r,
RegexLike Regex l) =>
r -> l -> Bool
regexBool r l = l =~ r :: Bool
grep :: forall r l .
(RegexMaker Regex CompOption ExecOption r, RegexLike Regex l) =>
r -> [l] -> [l]
grep r = filter (regexBool r)
main :: IO ()
main = do
argv <- getArgs
io $ grep $ argv !! 0
This appears to be doing what I want it to, but unfortunately, it's really slow -- about 10 times slower than a python script doing the same thing. I assume it's not the regex library that's at fault here, because it's calling into PCRE which should be plenty fast (switching to Text.Regex.Posix slows things down quite a bit further). So it must be the String implementation, which is instructive from a theoretical point of view but inefficient according to what I've read.
Is there an alternative to Strings in haskell that's both efficient and convenient (i.e. there's little or no friction when switching to using that instead of Strings) and that fully and correctly handles UTF-8-encoded Unicode, as well as other encodings without too much hassle if possible? Something that everybody uses when doing text processing in haskell but that I just don't know about because I'm a complete beginner?

It's possible that the slow speed is caused by using the standard library's list type. I've often run into performance problems with it in the past.
It would be a good idea to profile your executable, to see where it spends its time: Tools for analyzing performance of a Haskell program. Profiling Haskell programs is really easy (compile with a switch and execute your program with an added argument, and the report is written to a text file in the current working directory).
As a side note, I use exactly the same approach as you when learning a new language: create something that works. My experience doing this with Haskell is that I can easily gain an order of magnitude or two in performance by profiling and making relatively simple changes (usually a couple of lines).

Related

"Eval" a string in OCaml

I'm trying to "eval" a string representing an OCaml expression in OCaml. I'm looking to do something equivalent to Python's eval.
So far I've not been able to find much. The Parsing module looks like it could be helpful, but I was not able to find a way to just eval a string.
Here is how to do it, but I didn't tell you. (Also the Parsing module is about Parsing, not executing code)
#require "compiler-libs" (* Assuming you're using utop, if compiling then this is the package you need *)
let eval code =
let as_buf = Lexing.from_string code in
let parsed = !Toploop.parse_toplevel_phrase as_buf in
ignore (Toploop.execute_phrase true Format.std_formatter parsed)
example:
eval "let () = print_endline \"hello\";;"
Notice the trailing ;; in the code sample.
To use ocamlbuild, you will need to use both compiler-libs and compiler-libs.toplevel.
OCaml is a compiled (not interpreted) language. So there's no simple way to do this. Certainly there are no language features that support it (as there are in almost every interpreted language). About the best you could do would be to link your program against the OCaml toplevel (which is an OCaml interpreter).

"smaller" keyword(?) in OCaml

In the solutions for the tutorials for OCaml form here, the one regarding eliminating consecutive duplicates of list elements, the code is written as such:
let rec compress = function
| a :: (b :: _ as t) -> if a = b then compress t else a :: compress t
| smaller -> smaller;;
I've never seen the keyword(?) "smaller" before, I looked up online but failed to find it. Although in this case, I understand its meaning, I still wonder if anyone can explain more about it. Thanks!
smaller is not a keyword, it's an identifier, just like a, b and t are on the line before.
The pattern smaller simply matches any possible value (that has not been matched by any previous pattern) and gives it the name smaller.
You may want to read the chapter Lists and Patterns in the book Real World OCaml.

Conduit, replacement for lists?

I was thinking about lists in Haskell, and I thought in other languages, one doesn't use lists for everything. Sure, you might want to store a list if you need the values later on, but if it's just a one off, say iterating from [1..n], why use a list where all that's really needed is a variable that's incremented?
I also read about "list fusion" and noted that whilst Haskell compilers try to implement this optimization to eliminate intermediate lists, they often are unsuccessful, resulting in the garbage collector having to clean up lists which are only used once.
Also, if you're not careful one can easily share a list, which means the garbage collector doesn't clean it up, which can result in running out of memory with an algorithm which was previously design to run in constant space.
So I thought it would be best to avoid lists completely, at least when one doesn't actually want to "store" the list.
I then came across conduit, which says it is:
a solution to the streaming data problem, allowing for production,
transformation, and consumption of streams of data in constant
memory.
This sounded perfect. I know conduit is designed for IO problems with resource acquisition and release issues, but can one just use it as a drop in replacement for lists?
For example, could I do the following:
fold f3 $ take 10 $ map f2 $ unfold f1 init_value
And with a few appropriately placed type annotations, use conduits for the whole process instead of lists?
I was hoping that perhaps classy-prelude would allow such code, but I'm not sure. If it's possible, could someone give an example, say like the above?
List computations stream in constant memory in the same circumstances as they would for conduit. The presence or absence of intermediate data structures does not affect whether or not it runs in constant memory. All it changes is the efficiency and the size of the constant memory that it inhabits.
Do not expect conduit to run in less memory than the equivalent list computation. It should actually take more memory because conduit steps have a greater overhead than list cells. Also, conduit currently does not have stream fusion. Somebody did experiment with that some time ago, although that did not get incorporated into the library. Lists, on the other hand, can and do fuse in many circumstances to remove intermediate data structures.
The important thing to remember is that streaming does not necessarily imply deforestation (i.e. removal of intermediate data structures).
conduit was definitely not designed for this kind of a use case, but it can in theory be used that way. I did so personally for the markdown package, where it was more convenient to have the extra conduit plumbing than to deal directly with lists.
If you put this together with classy-prelude-conduit, you can get some relatively simple code. And we could certainly add more exports to classy-prelude-conduit to better optimize for this use case. For now, here's an example following the basic gist of what you laid out above:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.List (unfold, isolate)
import Data.Functor.Identity (runIdentity)
main = putStrLn
$ runIdentity
$ unfold f1 init_value
$$ map f2
=$ isolate 10
=$ fold f3 ""
f1 :: (Int, Int) -> Maybe (Int, (Int, Int))
f1 (x, y) = Just (x, (y, x + y))
init_value = (1, 1)
f2 :: Int -> Text
f2 = show
f3 :: Text -> Text -> Text
f3 x y = x ++ y ++ "\n"

Is FC++ used by any open source projects?

The FC++ library provides an interesting approach to supporting functional programming concepts in C++.
A short example from the FAQ:
take (5, map (odd, enumFrom(1)))
FC++ seems to take a lot of inspiration from Haskell, to the extent of reusing many function names from the Haskell prelude.
I've seen a recent article about it, and it's been briefly mentioned in some answers on stackoverflow, but I can't find any usage of it out in the wild.
Are there any open source projects actively using FC++? Or any history of projects which used it in the past? Or does anyone have personal experience with it?
There's a Customers section on the web site, but the only active link is to another library by the same authors (LC++).
As background: I'm looking to write low latency audio plugins using existing C++ APIs, and I'm looking for tooling which allows me to write concise code in a functional style. For this project I wan't to use a C++ library rather than using a separate language, to avoid introducing FFI bindings (because of the complexity) or garbage collection (to keep the upper bound on latency in the sub-millisecond range).
I'm aware that the STL and Boost libraries already provide support from many FP concepts--this may well be a more practical approach. I'm also aware of other promising approaches for code generation of audio DSP code from functional languages, such as the FAUST project or the Haskell synthesizer package.
This isn't an answer to your question proper, but my experience with embedding of functional style into imperative languages has been horrid. While the code can be almost as concise, it retains the complexity of reasoning found in imperative languages.
The complexity of the embedding usually requires the most intimate knowledge of the details and corner cases of the language. This greatly increases the cost of abstraction, as these things must always be taken into careful consideration. And with a cost of abstraction so high, it is easier just to put a side-effectful function in a lazy stream generator and then die of subtle bugs.
An example from FC++:
struct Insert : public CFunType<int,List<int>,List<int> > {
List<int> operator()( int x, const List<int>& l ) const {
if( null(l) || (x > head(l)) )
return cons( x, l );
else
return cons( head(l), curry2(Insert(),x,tail(l)) );
}
};
struct Isort : public CFunType<List<int>,List<int> > {
List<int> operator()( const List<int>& l ) const {
return foldr( Insert(), List<int>(), l );
}
};
I believe this is trying to express the following Haskell code:
-- transliterated, and generalized
insert :: (Ord a) => a -> [a] -> [a]
insert x [] = [x]
insert x (a:as) | x > a = x:a:as
| otherwise = a:insert x as
isort :: (Ord a) => [a] -> [a]
isort = foldr insert []
I will leave you to judge the complexity of the approach as your program grows.
I consider code generation a much more attractive approach. You can restrict yourself to a miniscule subset of your target language, making it easy to port to a different target language. The cost of abstraction in a honest functional language is nearly zero, since, after all, they were designed for that (just as abstracting over imperative code in an imperative language is fairly cheap).
I'm the primary original developer of FC++, but I haven't worked on it in more than six years. I have not kept up with C++/boost much in that time, so I don't know how FC++ compares now. The new C++ standard (and implementations like VC++) has a bit of stuff like lambda and type inference help that makes some of what is in there moot. Nevertheless, there might be useful bits still, like the lazy list types and the Haskell-like (and similarly named) combinators. So I guess try it and see.
(Since you mentioned real-time, I should mention that the lists use reference counting, so if you 'discard' a long list there may be a non-trivial wait in the destructor as all the cells' ref-counts go to zero. I think typically in streaming scenarios with infinite streams/lists this is a non-issue, since you're typically just tailing into the stream and only deallocating things one node at a time as you stream.)

Is it possible to test the return value of Haskell I/O functions?

Haskell is a pure functional language, which means Haskell functions have no side affects. I/O is implemented using monads that represent chunks of I/O computation.
Is it possible to test the return value of Haskell I/O functions?
Let's say we have a simple 'hello world' program:
main :: IO ()
main = putStr "Hello world!"
Is it possible for me to create a test harness that can run main and check that the I/O monad it returns the correct 'value'? Or does the fact that monads are supposed to be opaque blocks of computation prevent me from doing this?
Note, I'm not trying to compare the return values of I/O actions. I want to compare the return value of I/O functions - the I/O monad itself.
Since in Haskell I/O is returned rather than executed, I was hoping to examine the chunk of I/O computation returned by an I/O function and see whether or not it was correct. I thought this could allow I/O functions to be unit tested in a way they cannot in imperative languages where I/O is a side-effect.
The way I would do this would be to create my own IO monad which contained the actions that I wanted to model. The I would run the monadic computations I want to compare within my monad and compare the effects they had.
Let's take an example. Suppose I want to model printing stuff. Then I can model my IO monad like this:
data IO a where
Return :: a -> IO a
Bind :: IO a -> (a -> IO b) -> IO b
PutChar :: Char -> IO ()
instance Monad IO where
return a = Return a
Return a >>= f = f a
Bind m k >>= f = Bind m (k >=> f)
PutChar c >>= f = Bind (PutChar c) f
putChar c = PutChar c
runIO :: IO a -> (a,String)
runIO (Return a) = (a,"")
runIO (Bind m f) = (b,s1++s2)
where (a,s1) = runIO m
(b,s2) = runIO (f a)
runIO (PutChar c) = ((),[c])
Here's how I would compare the effects:
compareIO :: IO a -> IO b -> Bool
compareIO ioA ioB = outA == outB
where ioA = runIO ioA ioB
There are things that this kind of model doesn't handle. Input, for instance, is tricky. But I hope that it will fit your usecase. I should also mention that there are more clever and efficient ways of modelling effects in this way. I've chosen this particular way because I think it's the easiest one to understand.
For more information I can recommend the paper "Beauty in the Beast: A Functional Semantics for the Awkward Squad" which can be found on this page along with some other relevant papers.
Within the IO monad you can test the return values of IO functions. To test return values outside of the IO monad is unsafe: this means it can be done, but only at risk of breaking your program. For experts only.
It is worth noting that in the example you show, the value of main has type IO (), which means "I am an IO action which, when performed, does some I/O and then returns a value of type ()." Type () is pronounced "unit", and there are only two values of this type: the empty tuple (also written () and pronounced "unit") and "bottom", which is Haskell's name for a computation that does not terminate or otherwise goes wrong.
It is worth pointing out that testing return values of IO functions from within the IO monad is perfectly easy and normal, and that the idiomatic way to do it is by using do notation.
You can test some monadic code with QuickCheck 2. It's been a long time since I read the paper, so I don't remember if it applies to IO actions or to what kinds of monadic computations it can be applied. Also, it may be that you find it hard to express your unit tests as QuickCheck properties. Still, as a very satisfied user of QuickCheck, I'll say it's a lot better than doing nothing or than hacking around with unsafePerformIO.
I'm sorry to tell you that you can not do this.
unsafePerformIO basically let's you accomplish this. But I would strongly prefer that you do not use it.
Foreign.unsafePerformIO :: IO a -> a
:/
I like this answer to a similar question on SO and the comments to it. Basically, IO will normally produce some change which may be noticed from the outside world; your testing will need to have to do with whether that change seems correct. (E.g. the correct directory structure was produced etc.)
Basically, this means 'behavioural testing', which in complex cases may be quite a pain. This is part of the reason why you should keep the IO-specific part of your code to a minimum and move as much of the logic as possible to pure (therefore super easily testable) functions.
Then again, you could use an assert function:
actual_assert :: String -> Bool -> IO ()
actual_assert _ True = return ()
actual_assert msg False = error $ "failed assertion: " ++ msg
faux_assert :: String -> Bool -> IO ()
faux_assert _ _ = return ()
assert = if debug_on then actual_assert else faux_assert
(You might want to define debug_on in a separate module constructed just before the build by a build script. Also, this is very likely to be provided in a more polished form by a package on Hackage, if not a standard library... If someone knows of such a tool, please edit this post / comment so I can edit.)
I think GHC will be smart enough to skip any faux assertions it finds entirely, wheras actual assertions will definitely crash your programme upon failure.
This is, IMO, very unlikely to suffice -- you'll still need to do behavioural testing in complex scenarios -- but I guess it could help check that the basic assumptions the code is making are correct.