Using cython to speed up thousands of set operations - python-2.7

I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)
I have a function which takes 2 arguments, a set (we'll call it testSet), and a list of sets (we'll call that targetSets). The function then iterates through targetSets, and computes the length of the intersection with testSet, adding that value to a list, which is then returned.
Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.
So for a small number of simulations to test, the pure python implementation was taking ~50 secs.
I tried making a cython function, and it worked and it's now running at ~16 secs.
If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)
Here is my Cython implementation in overlapFunc.pyx
def computeOverlap(set testSet, list targetSets):
cdef list obsOverlaps = []
cdef int i, N
cdef set overlap
N = len(targetSets)
for i in range(N):
overlap = testSet & targetSets[i]
if len(overlap) <= 1:
obsOverlaps.append(0)
else:
obsOverlaps.append(len(overlap))
return obsOverlaps
and the setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]
setup(
name = 'computeOverlap function',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
and some code to build some random sets for testing and to time the function. test.py
import numpy as np
from overlapFunc import computeOverlap
import time
def simRandomSet(n):
for i in range(n):
simSet= set(np.random.randint(low=1, high=100, size=50))
yield simSet
if __name__ == '__main__':
np.random.seed(23032014)
targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]
simulatedTestSets = simRandomSet(200)
start = time.time()
for i in simulatedTestSets:
obsOverlaps = computeOverlap(i, targetSet)
print time.time()-start
I tried changing the def at the start of the computerOverlap function, as in:
cdef list computeOverlap(set testSet, list targetSets):
but I get the following warning message when I run the setup.py script:
'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]
and then when I run something that tries to use the function I get an import Error:
from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap
Thanks in advance for your help,
Cheers,
Davy

In the following line, the extension module name and the filename does not match actual filename.
ext_modules = [Extension("computeOverlapWithGeneList",
["computeOverlapWithGeneList.pyx"])]
Replace it with:
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]

Related

python-for-android, Cython, C++, CythonRecipe: Operation only allowed in c++

I have this setup.py for my Cython project:
from setuptools import setup
from Cython.Build import cythonize
setup(
name = 'phase-engine',
version = '0.1',
ext_modules = cythonize(["phase_engine.pyx"] + ['music-synthesizer-for-android/src/' + p for p in [
'fm_core.cc', 'dx7note.cc', 'env.cc', 'exp2.cc', 'fm_core.cc', 'fm_op_kernel.cc', 'freqlut.cc', 'lfo.cc', 'log2.cc', 'patch.cc', 'pitchenv.cc', 'resofilter.cc', 'ringbuffer.cc', 'sawtooth.cc', 'sin.cc', 'synth_unit.cc'
]],
include_path = ['music-synthesizer-for-android/src/'],
language = 'c++',
)
)
when I run buildozer, it gets angry about some Cython features only being available in C++ mode:
def __dealloc__(self):
del self.p_synth_unit
^
------------------------------------------------------------
phase_engine.pyx:74:8: Operation only allowed in c++
from which I understand it's ignoring my setup.py and doing its own somehow. How do I give it all these parameters?
CythonRecipe doesn't work well for Cython code that imports C/C++ code. Try CompiledComponentsPythonRecipe, or if you're having issues with #include <ios> or some other thing from the C++ STL, CppCompiledComponentsPythonRecipe:
from pythonforandroid.recipe import IncludedFilesBehaviour, CppCompiledComponentsPythonRecipe
import os
import sys
class MyRecipe(IncludedFilesBehaviour, CppCompiledComponentsPythonRecipe):
version = 'stable'
src_filename = "../../../phase-engine"
name = 'phase-engine'
depends = ['setuptools']
call_hostpython_via_targetpython = False
install_in_hostpython = True
def get_recipe_env(self, arch):
env = super().get_recipe_env(arch)
env['LDFLAGS'] += ' -lc++_shared'
return env
recipe = MyRecipe()
The dependency on setuptools is essential because of some weird stuff, otherwise you get an error no module named setuptools. The two other flags were also related to that error, the internet said they're relevant so I tried value combinations until one worked.
The LDFLAGS thing fixes an issue I had later (see buildozer + Cython + C++ library: dlopen failed: cannot locate symbol symbol-name referenced by module.so).

Compiled console app quits immediately when importing ConfigParser (Python 2.7.12)

I am very new to Python and am trying to append some functionality to an existing Python program. I want to read values from a config INI file like this:
[Admin]
AD1 = 1
AD2 = 2
RSW = 3
When I execute the following code from IDLE, it works as ist should (I already was able to read in values from the file, but deleted this part for a shorter code snippet):
#!/usr/bin/python
import ConfigParser
# buildin python libs
from time import sleep
import sys
def main():
print("Test")
sleep(2)
if __name__ == '__main__':
main()
But the compiled exe quits before printing and waiting 2 seconds. If I comment out the import of ConfigParser, exe runs fine.
This is how I compile into exe:
from distutils.core import setup
import py2exe, sys
sys.argv.append('py2exe')
setup(
options = {'py2exe': {'bundle_files': 1}},
zipfile = None,
console=['Test.py'],
)
What am I doing wrong? Is there maybe another way to read in a configuration in an easy way, if ConfigParser for some reason doesnt work in a compiled exe?
Thanks in advance for your help!

How can I specify a non-theano based likelihood?

I saw a post from a few days ago by someone else: pymc3 likelihood math with non-theano function. Even though I think the problem at its core is the same, I thought I would ask with a simpler example:
Inside logp_wrap, I put some made up definition of a likelihood function. It depends on the rv and an observation. In this case I could do this with theano operations, but let's say that I want this function to be more complex and so I cannot use theano.
The problem comes when I try to define the likelihood both in terms of an RV and observations. From what I have seen, this format would work if I was specifying everything in 'logp_wrap' as theano operations.
I have searched around for a solution to this, but haven't found anything where this problem is fully addressed.
The problem in my attempt to do this is actually that the logp_ function is correctly decorated, but the logp_wrap function is only correctly decorated for its input, and not for its output, so I get the error
TypeError: 'TensorVariable' object is not callable.
Would be great if someone had a solution - don't think I am the only one with this problem.
The theano version of this that works (and uses the same function within a function definition) without the #as_op code is here: https://pymc-devs.github.io/pymc3/notebooks/lda-advi-aevb.html?highlight=densitydist (Specifically the sections: "Log-likelihood of documents for LDA" and "LDA model section")
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
from theano import as_op
import theano.tensor as T
from scipy.stats import norm
#Some data that we observed
g_observed = [0.0, 1.0, 2.0, 3.0]
#Define a function to calculate the logp without using theano
#This as_op is where the problem is - the input is an rv but the output is a
#function.
#as_op(itypes=[T.dscalar],otypes=[T.dscalar])
def logp_wrap(rv):
#We are not using theano so we wrap the function.
#as_op(itypes=[T.dvector],otypes=[T.dscalar])
def logp_(ob):
#Some made up likelihood -
#The key here is that lp depends on the rv input and the observations
lp = np.log(norm.pdf(rv + ob))
return lp
return logp_
hb1_model = pm.Model()
with hb1_model:
I_mean = pm.Normal('I_mean', mu=0.1, sd=0.05)
xs = pm.DensityDist('x', logp_wrap(I_mean),observed = g_observed)
with hb1_model:
step = pm.Metropolis()
trace = pm.sample(1000, step)

Using c++ complex functions in Cython

I am trying to do a complex exponential in Cython.
I have been able to cobble together the following code for my pyx:
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI, floor, ceil
cdef extern from "complex.h":
double complex cexp(double complex z)
import numpy as np
cimport numpy as np
import cython
from cython.parallel cimport prange, parallel
def try_cexp():
cdef:
double complex rr1
double complex rr2
rr1 = 1j
rr2 = 2j
print(rr1*rr2)
#print(cexp(rr1))
Note that the print(cexp(rr1)) is commented. When the line is active, I get the following error when running setup.py:
error: command 'C:\\WinPYthon\\Winpython-64bit-3.4.3.6\\python-3.4.3.amd64\\scripts\\gcc.exe' failed with exit status 1
Note that when cexp is commented out, everything runs expected... I can run setup.py, and when I test the function it prints out the product of the two complex numbers.
Here is my setup.py file. Note that it includes code to run openmp in Cython using g++:
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np
import os
os.environ["CC"] = "g++-4.7"
os.environ["CXX"] = "g++-4.7"
# These were added based on some examples I had seen of cexp in Cython. No effect.
#import pyximport
#pyximport.install(reload_support=True)
ext_modules = [
Extension('complex_test',
['complex_test.pyx'],
language="c++",
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp', '-lm']) # Note that '-lm' was
# added due to an example where someone mentioned g++ required
# this. Same results with and without it.
]
setup(
name='complex_test',
cmdclass={'build_ext': build_ext},
ext_modules=ext_modules,
include_dirs=[np.get_include()]
)
Ultimately my goal is to speed up a calcualtion that looks like k*exp(z), where k and z are complex. Currently I am using Numerical expressions, however that has a large memory overhead, and I believe that it's possible to optimize further than it can.
Thank you for your help.
You're using cexp instead of exp as it is in C++. Change your cdef extern to:
cdef extern from "<complex.h>" namespace "std":
double complex exp(double complex z)
float complex exp(float complex z) # overload
and your print call to:
print(exp(rr1))
and it should work as a charm.
I know the compilation messages are lengthy, but in there you can find the error that points to the culprit:
complex_test.cpp: In function ‘PyObject* __pyx_pf_12complex_test_try_cexp(PyObject*)’:
complex_test.cpp:1270:31: error: cannot convert ‘__pyx_t_double_complex {aka std::complex<double>}’ to ‘__complex__ double’ for argument ‘1’ to ‘__complex__ double cexp(__complex__ double)’
__pyx_t_3 = cexp(__pyx_v_rr1);
It's messy, but you can see the cause. *You're supplying a C++ defined type (__pyx_t_double_complex in Cython Jargon) to a C function which expects a different type (__complex__ double).

Why can't I access par inside the function even if I use the keyword global in the following python program?

I am trying to access the variable par (declared inside the main module) inside the function func(). But I am getting the exception 'global name par is not defined'. What am I doing wrong?
Main.py
if __name__ == '__main__':
import Second as S
par = {1 : 'one'}
S.func2()
def func():
global par
print('In func')
print(par[1])
Second.py
import Main as M
def func2():
M.func()
If you import the file, then the value of __name__ wont be "__main__" and the par dict never gets defined. (__name__ will instead be the name of the module, in this case "temp")
if __name__ == "__main__": is used shield bits of code designed only to run when the script is run directly (ie python temp.py). If the file is imported, then that if condition will evaluate False.
I think the root of your confusion, is that normally if multiple python files import the same file they enter sys.modules as the same entry/object so they have the same namespace. However the main script that's invoked gets a special name (__main__) so if you happen to import it, python doesn't see it and creates a new python module object.
import sys
if __name__ == '__main__':
import Second as S
par = {1 : 'one'}
S.func2()
def func():
print(sys.modules["__main__"]) # here par is defined
print(sys.modules["Second"].M) # here it isn't
global par
print('In func')
print(par[1])
Your first example ran just fine on repl.it, but you probably did not run it as the main file. Your new example shows that you source the file in your second script, so __name__=="__main__" evaluates to false, ergo:
par does not get initialized!
Why do you include the if branch in the first place?
By the way, you do not need the global declaration if you just want to print par. Global is only required if you intend to change the value.