python ThreadPool not working (executed sequentially instead of in parallel) - python-2.7

I'm using ThreadPool from multiprocessing.pool but it seems it is not working (i.e. it is executed sequentially rather than in parallel: res1 takes 14s, res2 takes 11s, and using ThreadPool takes 27s instead of 14s).
I think the problem is in another_method because it uses a (shared) read_only_resource.
I've tried having a time.sleep(val) in another_method instead of calling another method (works as expected - it takes as long as the maximum value I pass) and I've also tried to pass a deep copy of the read_only_resource (this doesn't work, it takes 27s).
I have run out of things to try to make this work:
def method(text_type, read_only_resource):
value = some_processesing(text_type)
return another_method(value, read_only_resource)
def main():
same_read_only_resource = get_read_only_resource()
pool = ThreadPool(processes=2)
res1 = pool.apply_async(method, (some_text_type, same_read_only_resource))
res2 = pool.apply_async(method, (other_text_type, same_read_only_resource))
results1 = res1.get()
results2 = res2.get()

It looks like you want to use map instead of apply_async. The apply_async function isn't for parallelizing multiple function calls. Rather, it's to asynchronously call a single instance of the function. Since you call it twice and get the results in order, you get serialized performance.
Calling map will run multiple instances of a function in parallel. It requires packing the inputs into a single object, e.g. a tuple, since it only allows a single argument to be passed to your function. Then all packed inputs can be placed in a list or other iterable and given to map, for example:
work_args =[(some_text, read_only_resource), (other_text, read_only_resource), ... ]
results = pool.map(method, work_args)
Additionally you can use something like itertools.izip() to create work_args, e.g.:
work_args = itertools.izip([some_text, other_text, ...], read_only_resource)
Note that since you are using ThreadPool, you still might not get much performance increase, depending on the work being done. The topic has been discussed in many places, but Python may not parallelize with threads like you expect, due to the Global Interpreter Lock. See here for a good summary. In short, if your function is going to do I/O, ThreadPool can help.
However if you use multiprocessing.Pool, you will have multiple processes executing simultaneously. Just replace ThreadPool with Pool to use it.

Related

How to hit cache?

I am using tdm64/ mingw32.
Lets say I have a class with constructor:
class test{
int number;
int dependent;
test( int set_number){
number = set_number;
dependent = some_function(number);
}
}
Will code be faster if I would switch to:
dependent = some_function(set_number);
And if I need some explanation. Basicly does first option have to wait with function call until number will be sent back to RAM? Or does it have to wait because queue of instructions is partially empty waiting for not calculated yet variable? Is number pulled back for operation from cache? L1? L2? L3? Will it have to wait multiple CPIs or just one? Which assembly instructions will be generated out of those two following lines?
number = set_number;
dependent = some_function(number);
What sort of assembly instruction will be generated to assign number and than to handle number sending for some operation in function?
What if I will have multiple situations like this mixed with array operations in the middle of them?

Nim: Parallel loop with mutating state

I'm new to the Nim language. I wanted to learn it by implementing a simple Genetic Algorithm to evolve strings (array of integers atm) by distributing the work to the CPU cores:
https://github.com/peheje/nim_genetic
I have successfully parallelized the creation of the "Agents", but cannot do it to call a function "life" that must mutate the passed in state:
...
type
Agent* = ref object
data*: seq[int]
fitness*: float
...
var pool: seq[Agent] = newAgentsParallel(POPULATION_SIZE)
# Run generations
for gen in 0..<N_GENERATIONS:
let wheel = createWheel(pool)
let partitions: seq[seq[Agent]] = partition(pool, N_THREADS)
parallel:
for part in partitions:
echo "spawning"
spawn life(part, pool, wheel)
pool = createPool(pool, wheel)
...
proc life(part: seq[Agent], pool: seq[Agent], wheel: seq[float]) =
for a in part:
if random(1.0) < CROSSOVER_PROP:
a.crossover(pool, wheel)
if random(1.0) < MUTATE_PROP:
a.mutate()
a.calcFitness()
echo "life done"
CPU is bound at 100 % and it seems like Nim is copying the data to "life" as the RAM usage skyrockets after "spawning". In the Nim manual it says about the parallel block:
"Every other complex location loc that is used in a spawned proc
(spawn f(loc)) has to be immutable for the duration of the parallel
section. This is called the immutability check. Currently it is not
specified what exactly "complex location" means. We need to make this
an optimization!"
And I'm very much using it as mutable state, so maybe that is why Nim is copying the data? How can I get around passing pointers only? A few points:
I guess I could avoid mutating, instead returning new instances that are modified, but I still need to pass in the pool and wheel to read from.
If the parallel: statement is not possible to use, how would I implement it using threads?
Is random() threadsafe? How else?
Anything else I could do different? E.g. easier unwrapping of the FlowVar?
Coming from Kotlin with Java8 Streams I feel really spoiled.

Marshaling object code for a Numba function

I have a problem that could be solved by Numba: creating Numpy ufuncs for a query server to (a) coalesce simple operations into a single pass over data, reducing my #1 hotspot (memory bandwidth), and (b) to wrap up third party C functions as ufuncs on the fly, providing more functionality to users of the query system.
I have an accumulator node that splits up the query and collects results and compute nodes that actually run Numpy (distinct computers in a network). If the Numba compilation happens on the compute nodes, it will be duplicated effort since they're working on different data partitions for the same query--- same query means same Numba compilation. Moreover, even the simplest Numba compilation takes 96 milliseconds--- as long as running a query calculation over millions of points, which is time better served on the compute nodes.
So I want to do a Numba compilation once on the accumulate node, then send it to the compute nodes so they can run it. I can guarantee that both have the same hardware, so that object code is compatible.
I've been searching the Numba API for this functionality and haven't found it (apart from a numba.serialize module with no documentation; I'm not sure what its purpose is). The solution might not be a "feature" of the Numba package, but a technique that takes advantage of someone's insider knowledge of Numba and/or LLVM. Does anyone know how to get at the object code, marshal it, and reconstitute it? I can have Numba installed on both machines if that helps, I just can't do anything too expensive on the destination machines.
Okay, it's possible, and the solution makes heavy use of the llvmlite library under Numba.
Getting the serialized function
First we define some function with Numba.
import numba
#numba.jit("f8(f8)", nopython=True)
def example(x):
return x + 1.1
We can get access to the object code with
cres = example.overloads.values()[0] # 0: first and only type signature
elfbytes = cres.library._compiled_object
If you print out elfbytes, you'll see that it's an ELF-encoded byte array (bytes object, not a str if you're in Python 3). This is what would go into a file if you were to compile a shared library or executable, so it's portable to any machine with the same architecture, same libraries, etc.
There are several functions inside this bundle, which you can see by dumping the LLVM IR:
print(cres.library.get_llvm_str())
The one we want is named __main__.example$1.float64 and we can see its type signature in the LLVM IR:
define i32 #"__main__.example$1.float64"(double* noalias nocapture %retptr, { i8*, i32 }** noalias nocapture readnone %excinfo, i8* noalias nocapture readnone %env, double %arg.x) #0 {
entry:
%.14 = fadd double %arg.x, 1.100000e+00
store double %.14, double* %retptr, align 8
ret i32 0
}
Take note for future reference: the first argument is a pointer to a double that gets overwritten with the result, the second and third arguments are pointers that never get used, and the last argument is the input double.
(Also note that we can get the function names programmatically with [x.name for x in cres.library._final_module.functions]. The entry point that Numba actually uses is cres.fndesc.mangled_name.)
We transmit this ELF and function signature from the machine that does all the compiling to the machine that does all the computing.
Reading it back
Now on the compute machine, we're going to use llvmlite with no Numba at all (following this page). Initialize it:
import llvmlite.binding as llvm
llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter() # yes, even this one
Create an LLVM execution engine:
target = llvm.Target.from_default_triple()
target_machine = target.create_target_machine()
backing_mod = llvm.parse_assembly("")
engine = llvm.create_mcjit_compiler(backing_mod, target_machine)
And now hijack its caching mechanism to have it load our ELF, named elfbytes:
def object_compiled_hook(ll_module, buf):
pass
def object_getbuffer_hook(ll_module):
return elfbytes
engine.set_object_cache(object_compiled_hook, object_getbuffer_hook)
Finalize the engine as though we had just compiled an IR, but in fact we skipped that step. The engine will load our ELF, thinking it's coming from its disk-based cache.
engine.finalize_object()
We should now find our function in this engine's space. If the following returns 0L, something's wrong. It should be a function pointer.
func_ptr = engine.get_function_address("__main__.example$1.float64")
Now we need to interpret func_ptr as a ctypes function we can call. We have to set up the signature manually.
import ctypes
pdouble = ctypes.c_double * 1
out = pdouble()
pointerType = ctypes.POINTER(None)
dummy1 = pointerType()
dummy2 = pointerType()
# restype first then argtypes...
cfunc = ctypes.CFUNCTYPE(ctypes.c_int32, pdouble, pointerType, pointerType, ctypes.c_double)(func_ptr)
And now we can call it:
cfunc(out, dummy1, dummy2, ctypes.c_double(3.14))
print(out[0])
# 4.24, which is 3.14 + 1.1. Yay!
More complications
If the JITed function has array inputs (after all, you want to do the tight loop over many values in the compiled code, not in Python), Numba generates code that recognizes Numpy arrays. The calling convention for this is quite complex, including pointers-to-pointers to exception objects and all the metadata that accompanies a Numpy array as separate parameters. It does not generate an entry point that you can use with Numpy's ctypes interface.
However, it does provide a very high-level entry point, which takes a Python *args, **kwds as arguments and parses them internally. Here's how you use that.
First, find the function whose name starts with "cpython.":
name = [x.name for x in cres.library._final_module.functions if x.name.startswith("cpython.")][0]
There should be exactly one of them. Then, after serialization and deserialization, get its function pointer using the method described above:
func_ptr = engine.get_function_address(name)
and cast it with three PyObject* arguments and one PyObject* return value. (LLVM thinks these are i8*.)
class PyTypeObject(ctypes.Structure):
_fields_ = ("ob_refcnt", ctypes.c_int), ("ob_type", ctypes.c_void_p), ("ob_size", ctypes.c_int), ("tp_name", ctypes.c_char_p)
class PyObject(ctypes.Structure):
_fields_ = ("ob_refcnt", ctypes.c_int), ("ob_type", ctypes.POINTER(PyTypeObject))
PyObjectPtr = ctypes.POINTER(PyObject)
cpythonfcn = ctypes.CFUNCTYPE(PyObjectPtr, PyObjectPtr, PyObjectPtr, PyObjectPtr)(fcnptr)
The first of these three arguments is a closure (global variables that the function accesses), and I'm going to assume we didn't need that. Use explicit arguments instead of closures. We can use the fact that CPython's id() implementation returns the pointer value to make PyObject pointers.
def wrapped(*args, **kwds):
closure = ()
return cpythonfcn(ctypes.cast(id(closure), PyObjectPtr), ctypes.cast(id(args), PyObjectPtr), ctypes.cast(id(kwds), PyObjectPtr))
Now the function can be called as
wrapped(whatever_numpy_arguments, ...)
just like the original Numba dispatcher function.
Bottom line
After all that, was it worth it? Doing the end-to-end compilation with Numba--- the easy way--- takes 50 ms for this simple function. Asking for -O3 instead of the default -O2, I can make this 40% slower.
Splicing in a pre-compiled ELF file, however, takes 0.5 ms: a factor of 100 faster. Moreover, compilation times will increase with more complex functions but the splicing-in procedure should always take 0.5 ms for any function.
For my application, this is absolutely worth it. It means that I can perform computations on 10 MB at a time and be spending most of my time computing (doing real work), rather than compiling (preparing to work). Scale this up by a factor of 100 and I'd have to perform computations on 1 GB at a time. Since a machine is limited to order-of 100 GB and it has to be shared among order-of 100 processes, I'd be in greater danger of hitting resource limitations, load balancing issues, etc., because the problem would be too granular.
But for other applications, 50 ms is nothing. It all depends on your application.

How to make something lwt supported?

I am trying to understand the term lwt supported.
So assume I have a piece of code which connect a database and write some data: Db.write conn data. It has nothing to do with lwt yet and each write will cost 10 sec.
Now, I would like to use lwt. Can I directly code like below?
let write_all data_list = Lwt_list.iter (Db.write conn) data_list
let _ = Lwt_main.run(write_all my_data_list)
Support there are 5 data items in my_data_list, will all 5 data items be written into the database sequentially or in parallel?
Also in Lwt manually or http://ocsigen.org/tutorial/application, they say
Using Lwt is very easy and does not cause troubles, provided you never
use blocking functions (non cooperative functions). Blocking functions
can cause the entre server to hang!
I quite don't understand how to not using blocking functions. For every my own function, can I just use Lwt.return to make it lwt support?
Yes, your code is correct. The principle of lwt supported is that everything that can potentially takes time in your code should return an Lwt value.
About Lwt_list.iter, you can choose whether you want the treatment to be parallel or sequential, by choosing between iter_p and iter_s :
In iter_s f l, iter_s will call f on each elements
of l, waiting for completion between each element. On the
contrary, in iter_p f l, iter_p will call f on all
elements of l, then wait for all the threads to terminate.
About the non-blocking functions, the principle of the Light Weight Threads is that they keep running until they reach a "cooperation point", i.e. a point where the thread can be safely interrupted or has nothing to do, like in a sleep.
But you have to declare you enter a "cooperation point" before actually doing the sleep. This is why the whole Unix library has been wrapped, so that when you want to do an operation that takes time (e.g. a write), a cooperation point is automatically reached.
For your own function, if you use IOs operations from Unix, you should instead use the Lwt version (Lwt_unix.sleep instead of Unix.sleep)

Strange behavior of go routine

I just tried the following code, but the result seems a little strange. It prints odd numbers first, and then even numbers. I'm really confused about it. I had hoped it outputs odd number and even number one after another, just like 1, 2, 3, 4... . Who can help me?
package main
import (
"fmt"
"time"
)
func main() {
go sheep(1)
go sheep(2)
time.Sleep(100000)
}
func sheep(i int) {
for ; ; i += 2 {
fmt.Println(i,"sheeps")
}
}
More than likely you are only running with one cpu thread. so it runs the first goroutine and then the second. If you tell go it can run on multiple threads then both will be able to run simultaneously provided the os has spare time on a cpu to do so. You can demonstrate this by setting GOMAXPROCS=2 before running your binary. Or you could try adding a runtime.Gosched() call in your sheep function and see if that triggers the runtime to allow the other goroutine to run.
In general though it's better not to assume ordering semantics between operations in two goroutines unless you specify specific synchronization points using a sync.Mutex or communicating between them on channels.
Unsynchronized goroutines execute in a completely undefined order. If you want to print out something like
1 sheeps
2 sheeps
3 sheeps
....
in that exact order, then goroutines are the wrong way to do it. Concurrency works well when you don't care so much about the order in which things occur.
You could impose an order in your program through synchronization (locking a mutex around the fmt.Println calls or using a channel), but it's pointless since you could more easily just write code that uses a single goroutine.