How to execute multiple sessions in tensorflow via one graph in CPU

How to execute multiple sessions in tensorflow via one graph in CPU - c++

Currently I am building a Tensorflow Graph and then performing, graph->getSession()->Run(input,model,output) in C++ in CPU
I want to achieve concurrency. What are my options to execute in parallel so I can support multiple requests executed concurrently.
Can I run sessions in multi threaded way?
By executing multiple sessions parallel will the processing time be constant? Example : If one session takes 100 ms then running 2 sessions concurrently takes approximately 100 ms.
Note: I want to run this on CPU

First thing to note is that tensorflow will use all cores for processing by default. You have some limited manner of control over this via inter and intra op perallelism discussed in this authoritative answer:
Tensorflow: executing an ops with a specific core of a CPU
The second point to note is that a session is thread safe. You can call call it from multiple threads. Each call will see a consistent point-in-time snapshot of the variables as they were when the call began, this is a question I asked once upon a time:
How are variables shared between concurrent `session.run(...)` calls in tensorflow?
The moral:
If you are running lots of small, sequential operations you can run them concurrently against one session and may be able to squeak out some improved performance if you limit tensorflow's use of parallelism. If you are running large operations (such as large matrix multiples, for example) which benefit more from distributed multi-core processing, you don't need to deal with parallelism yourself, tensorflow is already distributing across all CPU cores by default.
Also if your graph dependencies lend themselves to any amount of parallelization tensorflow handles this as well. You can set up profiling to see this in action.
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d

Related

Determining the number of parallel processes in a multi-instance subprocess

I am modelling a process which at times will require a very large number of parallel sub-processes (tens of thousands) to be launched. Obviously it’s not possible for these all to run in parallel simultaneously - how will the Camunda process engine handle this? Is it possible to control how many subprocesses will run at a time?

Camunda 7 uses a job executor thread pool. This determines the concurrency level of jobs such an asynchronously started call activity.
The amount of sub processes you mentioned is very high though. What history level did you have in mind? It is likely better to handle this differently.
Camunda 8 was release two days ago. It has a fundamentally different architecture, no relational DB, applying event streaming concepts, designed for massive volumes. It may be more suitable for your use case.

What is the point of being able to spawn dozens of processes efficiently if only very few of them can be executed in parallel?

Erlang is very efficient in spawning new processes, but what is the point, if the CPU can only execute only e.g. 4 of them in parallel?
Therefore the rest should wait for the Erlang-"context switch".
Do you get more things done faster if you have for example 10k processes, than you would by using Java/C#/C++?

There are many reasons:
Conceptually, processes are easy to reason about. Asynchronous callbacks and promises in languages like JavaScript are harder to reason about because the code in the callbacks can change the values of variables used by other code in the thread.
Processes provide isolation for the code running inside them. A process can only affect other processes by placing messages in their mailboxes. A process cannot meddle with the state of other processes.
Processes are granular. This means:
If you have 400 processes on a 4 core machine the scheduler will make sure to distribute them across the threads in such a way as to fully utilize the 4 cores. One core is always going to be handling OS stuff, so the scheduler would likely end up giving the thread running on that core less work than the other 3 threads. But it adapts, so in any situation the scheduler will do it's best to make sure processes wait as little as possible and threads always have a queue of processes waiting for CPU time.
Moving to better hardware with more cores doesn't require changes to the code or architecture of the application. Moving your Erlang application from a 4 core machine to a 64 core machine will mean your application will run about ~16 times faster without any changes, assuming your application is structured in such a way that it can take advantage of the extra cores (usually this means making sure tasks that could be done in parallel are executed in separate processes).
Processes are very lightweight, so there is very little overhead. In most applications the benefits provided by processes and the scheduler far outweigh the small overhead from running thousands of processes. Commodity hardware can easily handle hundreds of thousands of processes.
So in closing, whether or not processes execute in parallel isn't that important. The other benefits they provide are enough to justify their usage.

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?

Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).

http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

how to run each thread on other core?

I have a udp server that receive data and computing it.
I have two thread for each role.
In my cpu is a 8 multi-core and I send data in varius speed.
but at maximun I use ony %14 percent of my cpu two core 50%. if I send more data valume my buffer will fulled and don't use more cpu.
why each core arise only 50% and not more?
I think to divide this two role to multi-core.
I want to be sure that each one on other core.
how I can Explicitly to choose each thread run on other core?
my program worte on c++ visaul studio 9 and run on windows7 and I use boost::thread.

The scheduler will deal with where your threads etc will run. This is OS specific, therefore if you want to attempt to alter how code is run you would need an OS specific API that lets you set a threads affinity etc.
Also, depends what you application is like, its a client server by the looks of it, so its not totally CPU bound. How many threads do you have in total, you mention 2 per role? A thread can only be run on one CPU. Try make units of work that can truly run in parallel, that way they can be truly run independently, ideally on different cores.
The OS will generally do a good job of running your code since it will have a better overall picture.

You cannot make one thread use more than one core. To achieve better CPU utilization you need to redesign your program to create more threads and let the OS schedule them for you. There's no need to manually restrict the threads to specific cores. OSes are really good at figuring out how to allocate cores to threads.
In your case, if the data computing tasks are CPU heavy, you could spawn a new thread per request or have a worker thread pool that would be picking incoming tasks and processing them. This is just one of ideas. It's difficult to say without knowing more about your application architecture and the problems it's trying to solve.

In each thread you can use SetThreadAffinityMask to choose CPUs that your thread should run on it. But I suggest you create a new worker thread for each incoming request (also if you use a thread pool you see considerable performance boost)

Be care that the compiler and linker settings are enabling multithreading.
Best practice is also not to start many threads but long living thread which do some amount of queued work liked computations or downloads.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.

Well, if .net is an option, they have put a lot of effort into Parallel Computing.

If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread

You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.

I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...

If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.

The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.

I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce

You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.

Intel's TBB or boost::mpi might be of interest to you also.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js