Haskell Regex performance - regex

I've been looking at the existing options for regex in Haskell, and I wanted to understand where the gap in performance came from when comparing the various options with each other and especially with a simple call to grep...
I have a relatively small (~ 110M, compared to a usual several 10s of G in most of my use cases) trace file :
$ du radixtracefile
113120 radixtracefile
$ wc -l radixtracefile
1051565 radixtracefile
I first tried to find how many matches of the (arbitrary) pattern .*504.*ll were in there through grep :
$ time grep -nE ".*504.*ll" radixtracefile | wc -l
309
real 0m0.211s
user 0m0.202s
sys 0m0.010s
I looked at Text.Regex.TDFA (version 1.2.1) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Regex.TDFA
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-TDFA.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-TDFA.hs, test-TDFA.o )
Linking test-TDFA ...
$ time ./test-TDFA | wc -l
309
real 0m4.463s
user 0m4.431s
sys 0m0.036s
Then, I looked at Data.Text.ICU.Regex (version 0.7.0.1) with Unicode support:
import Control.Monad.Loops
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Text.ICU.Regex
main = do
re <- regex [] $ T.pack ".*504.*ll"
f <- TIO.readFile "radixtracefile"
setText re f
whileM_ (findNext re) $ do
a <- start re 0
putStrLn $ "last match at :"++(show a)
Building and running :
$ ghc -O2 test-ICU.hs
[1 of 1] Compiling Main ( test-ICU.hs, test-ICU.o )
Linking test-ICU ...
$ time ./test-ICU | wc -l
309
real 1m36.407s
user 1m36.090s
sys 0m0.169s
I use ghc version 7.6.3. I haven't had the occasion of testing other Haskell regex options. I knew that I would not get the performance that I had with grep and was more than happy with that, but more or less 20 times slower for the TDFA and ByteString... That is very scary. And I can't really understand why it is what it is, as I naively though that this was a wrapper on a native backend... Am I somehow not using the module correctly ?
(And let's not mention the ICU + Text combo which is going through the roof)
Is there an option that I haven't tested yet that would make me happier ?
EDIT :
Text.Regex.PCRE (version 0.94.4) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import Text.Regex.PCRE
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-PCRE.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRE.hs, test-PCRE.o )
Linking test-PCRE ...
$ time ./test-PCRE | wc -l
309
real 0m1.442s
user 0m1.412s
sys 0m0.031s
Better, but still with a factor of ~7-ish ...

So, after looking at other libraries for a bit, I ended up trying PCRE.Ligth (version 0.4.0.4) :
import Control.Monad
import Text.Regex.PCRE.Light
import qualified Data.ByteString.Char8 as B
main = do
f <- B.readFile "radixtracefile"
let lines = B.split '\n' f
let re = compile (B.pack ".*504.*ll") []
forM_ lines $ \l -> maybe (return ()) print $ match re l []
Here is what I get out of that :
$ ghc -O2 test-PCRELight.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRELight.hs, test-PCRELight.o )
Linking test-PCRELight ...
$ time ./test-PCRELight | wc -l
309
real 0m0.832s
user 0m0.803s
sys 0m0.027s
I think this is decent enough for my purposes. I might try to see what happens with the other libs when I manually do the line splitting like I did here, although I doubt it's going to make a big difference.

Related

Is there a complete example to write a mathematical expression in sympy to a Microsoft Word document?

this might be a silly question. But I am desperate. I am a math teacher and I try to generate Math tests. I tried Python for this and I get some things done. However, I am not a professional programmer, so I get lost with MathMl, prettyprint() and whatsoever.
Is there anybody who can supply me a complete example that I can execute? It may just contain one small silly equation, that does not matter. I just want to see how I can get it into a Word document. After that, I can use that as a basis. I work on a Mac.
I hope anyone can help me out. Thanks in advance!
Best regards, Johan
This works for me:
from sympy import *
from docx import Document
from lxml import etree
# create expression
x, y = symbols('x y')
expr1 = (x+y)**2
# create MathML structure
expr1xml = mathml(expr1, printer = 'presentation')
tree = etree.fromstring('<math xmlns="http://www.w3.org/1998/Math/MathML">'+expr1xml+'</math>')
# convert to MS Office structure
xslt = etree.parse('C:/MML2OMML.XSL')
transform = etree.XSLT(xslt)
new_dom = transform(tree)
# write to docx
document = Document()
p = document.add_paragraph()
p._element.append(new_dom.getroot())
document.save("simpleEq.docx")
How about the following. The capture captures whatever is printed. In this case I use pprint to print the expression that I want written to file. There are lots of options you can use with pprint (including wrapping which you might want to set to False). The quality of output will depend on the fonts you use. I don't do this at all so I don't have a lot of hints for that.
from pprint import pprint
from sympy.utilities.iterables import capture
from sympy.abc import x
from sympy import Integral
with open('out.doc','w',encoding='utf-8') as f:
f.write(capture(lambda:pprint(Integral(x**2, (x, 1, 3)))))
When I double click (in Windows) on the out.doc file, a word equation with the integral appears.
Here is the actual IPython session:
IPython console for SymPy 1.6.dev (Python 3.7.3-32-bit) (ground types: python)
These commands were executed:
>>> from __future__ import division
>>> from sympy import *
>>> x, y, z, t = symbols('x y z t')
>>> k, m, n = symbols('k m n', integer=True)
>>> f, g, h = symbols('f g h', cls=Function)
>>> init_printing()
Documentation can be found at https://docs.sympy.org/dev
In [1]: pprint(Integral(x**2, (x, 1, 3)))
3
(
? 2
? x dx
)
1
In [2]: from pprint import pprint
...: from sympy.utilities.iterables import capture
...: from sympy.abc import x
...: from sympy import Integral
...: with open('out.doc','w',encoding='utf-8') as f:
...: f.write(capture(lambda:pprint(Integral(x**2, (x, 1, 3)))))
...:
{problems pasting the unicode here, but it shows up as an integral symbol in console}

Printing abstract syntax tree using ppx_deriving

Could some please tell me why this code is not compiling. I am trying to print the abstract syntax tree using ppx_deriving library.
type prog = command list
[##deriving show]
and command =
| Incv | Decv
| Incp | Decp
| Input | Output
| Loop of command list
[##deriving show]
let _ = Format.printf "%s" (show_prog ([Incv, Incv]))
hello:brainfuckinter mukeshtiwari$ ocamlbuild -package ppx_deriving.std ast.byte
+ /Users/mukeshtiwari/.opam/4.02.1/bin/ocamlc.opt -c -I /Users/mukeshtiwari/.opam/4.02.1/lib/ppx_deriving -o ast.cmo ast.ml
File "ast.ml", line 10, characters 28-37:
Error: Unbound value show_prog
Command exited with code 2.
Compilation unsuccessful after building 2 targets (1 cached) in 00:00:00.
hello:brainfuckinter mukeshtiwari$ ocaml
OCaml version 4.02.1
Add -use-ocamlfind as first argument of ocamlbuild. It should solve the issue.
(You also have a typo in [Incv, Incv], the , should be a ;.

Python program calling prolog and print prolog output

I'm new using prolog and I have a python program that using os.system(prolog_command) call prolog and get a result (true or false) but I want my program to show in the console the same result (lines that prolog write).
Anyone can help me please?
Thank you in advance.
I tried this simple approach:
Contents of foo.pl:
foo :- write('hello, world'), nl.
And then in python:
>>> import commands
>>> commands.getoutput('echo "foo." | swipl -q -f foo.pl')
'hello, world\ntrue.\n\n'
>>> x = commands.getoutput('echo "foo." | swipl -q -f foo.pl')
>>> x
'hello, world\ntrue.\n\n'
>>>

Missing characters using Text.Regex.PCRE to parse web page title

I recently made a website that needs to retrieve talk titles from TED website.
So far, the problem is specific to this talk: Francis Collins: We need better drugs -- now
From the web page source, I get:
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
Now, in ghci, I tried this:
λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
Either way, the parsed title misses some characters on the left, and has some unintended characters on the right. It seems to have something to do with the -- in talk title. However,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
Luckily, this is not a problem with Text.Regex.Posix.
λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
My recommendation would be: don't use a regex for parsing HTML. Use a proper HTML parser instead. Here's an example using the html-conduit parser together with the xml-conduit cursor library (and http-conduit for download).
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid (mconcat)
import Network.HTTP.Conduit (simpleHttp)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (attributeIs, content, element,
fromDocument, ($//), (&//), (>=>))
main = do
lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
let doc = parseLBS lbs
cursor = fromDocument doc
print $ mconcat $ cursor $// element "title" &// content
print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content
The code is also available as active code on the School of Haskell.

Insert a newline character every 64 characters using Python

Using Python I need to insert a newline character into a string every 64 characters. In Perl it's easy:
s/(.{64})/$1\n/
How could this be done using regular expressions in Python?
Is there a more pythonic way to do it?
Same as in Perl, but with a backslash instead of the dollar for accessing groups:
s = "0123456789"*100 # test string
import re
print re.sub("(.{64})", "\\1\n", s, 0, re.DOTALL)
re.DOTALL is the equivalent to Perl's s/ option.
without regexp:
def insert_newlines(string, every=64):
lines = []
for i in xrange(0, len(string), every):
lines.append(string[i:i+every])
return '\n'.join(lines)
shorter but less readable (imo):
def insert_newlines(string, every=64):
return '\n'.join(string[i:i+every] for i in xrange(0, len(string), every))
The code above is for Python 2.x. For Python 3.x, you want to use range and not xrange:
def insert_newlines(string, every=64):
lines = []
for i in range(0, len(string), every):
lines.append(string[i:i+every])
return '\n'.join(lines)
def insert_newlines(string, every=64):
return '\n'.join(string[i:i+every] for i in range(0, len(string), every))
I'd go with:
import textwrap
s = "0123456789"*100
print('\n'.join(textwrap.wrap(s, 64)))
Taking #J.F. Sebastian's solution one step further (this is nearly criminal! :-) ):
import textwrap
s = "0123456789"*100
print textwrap.fill(s, 64)
Look ma... no regexes! because as you know... http://regex.info/blog/2006-09-15/247
Thanks for introducing us to textwrap module... although it's been in Python since 2.3, I wasn't aware of it until now (yes, i'll admit that publically)!!
Tiny, not nice:
"".join(s[i:i+64] + "\n" for i in xrange(0,len(s),64))
I suggest the following method:
"\n".join(re.findall("(?s).{,64}", s))[:-1]
This is, more-or-less, the non-RE method taking advantage of the RE engine for the loop.
On a very slow computer I have as a home server, this gives:
$ python -m timeit -s 's="0123456789"*100; import re' '"\n".join(re.findall("(?s).{,64}", s))[:-1]'
10000 loops, best of 3: 130 usec per loop
AndiDog's method:
$ python -m timeit -s "s='0123456789'*100; import re" 're.sub("(?s)(.{64})", r"\1\n", s)'
1000 loops, best of 3: 800 usec per loop
gurney alex's 2nd/Michael's method:
$ python -m timeit -s "s='0123456789'*100" '"\n".join(s[i:i+64] for i in xrange(0, len(s), 64))'
10000 loops, best of 3: 148 usec per loop
I don't consider the textwrap method to be correct for the specification of the question, so I won't time it.
EDIT
Changed answer because it was incorrect (shame on me!)
EDIT 2
Just for the fun of it, the RE-free method using itertools. It rates third in speed, and it's not Pythonic (too lispy):
"\n".join(
it.imap(
s.__getitem__,
it.imap(
slice,
xrange(0, len(s), 64),
xrange(64, len(s)+1, 64)
)
)
)
$ python -m timeit -s 's="0123456789"*100; import itertools as it' '"\n".join(it.imap(s.__getitem__, it.imap(slice, xrange(0, len(s), 64), xrange(64, len(s)+1, 64))))'
10000 loops, best of 3: 182 usec per loop
itertools has a nice recipe for a function grouper that is good for this, particularly if your final slice is less than 64 chars and you don't want a slice error:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Use like this:
big_string = <YOUR BIG STRING>
output = '\n'.join(''.join(chunk) for chunk in grouper(big_string, 64))