I am doing OCR using Tesseract on a quad-core processor.
For better speed, I want to read 4 words at a time, using 4 threads.
Is it safe to call Tesseract from multiple threads concurrently?
Note: each thread will be working on a different, non-shared image.
Note: guarding with locks is not ok because of speed.
From the release notes, Tesseract is (mostly, and to the degree that you describe needing) thread-safe as of 3.01 (Oct 21 2011)
Thread-safety! Moved all critical globals and statics to members of
the appropriate class. Tesseract is now thread-safe (multiple
instances can be used in parallel in multiple threads.) with the minor
exception that some control parameters are still global and affect all
threads.
I've been successfully using it on multiple cores for that long (or longer, from dev branch).
I don't think tesseract is currently parallelizable (see this thread), although one of the main goals for v3.0 is to make it more thread-safe.
However, you could always parallelize by running n concurrent processes of tesseract. If you want to parallelize the OCRing of a single image, it would be up to you to split it and feed each part to each of these n processes (basically a mapreduce).
Related
I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.
Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.
Currently I am building a Tensorflow Graph and then performing, graph->getSession()->Run(input,model,output) in C++ in CPU
I want to achieve concurrency. What are my options to execute in parallel so I can support multiple requests executed concurrently.
Can I run sessions in multi threaded way?
By executing multiple sessions parallel will the processing time be constant? Example : If one session takes 100 ms then running 2 sessions concurrently takes approximately 100 ms.
Note: I want to run this on CPU
First thing to note is that tensorflow will use all cores for processing by default. You have some limited manner of control over this via inter and intra op perallelism discussed in this authoritative answer:
Tensorflow: executing an ops with a specific core of a CPU
The second point to note is that a session is thread safe. You can call call it from multiple threads. Each call will see a consistent point-in-time snapshot of the variables as they were when the call began, this is a question I asked once upon a time:
How are variables shared between concurrent `session.run(...)` calls in tensorflow?
The moral:
If you are running lots of small, sequential operations you can run them concurrently against one session and may be able to squeak out some improved performance if you limit tensorflow's use of parallelism. If you are running large operations (such as large matrix multiples, for example) which benefit more from distributed multi-core processing, you don't need to deal with parallelism yourself, tensorflow is already distributing across all CPU cores by default.
Also if your graph dependencies lend themselves to any amount of parallelization tensorflow handles this as well. You can set up profiling to see this in action.
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?
Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).
http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.
I have seen in some posts it has been said that to use multiple cores of processor use Boost thread (use multi-threading) library. Usually threads are not visible to operating system. So how can we sure that multi-threading will support usage of multi-cores. Is there a difference between Java threads and Boost threads?
The operating system is also called a "supervisor" because it has access to everything. Since it is responsible for managing preemptive threads, it knows exactly how many a process has, and can inspect what they are doing at any time.
Java may add a layer of indirection (green threads) to make many threads look like one, depending on JVM and configuration. Boost does not do this, but instead only wraps the POSIX interface which usually communicates directly with the OS kernel.
Massively multithreaded applications may benefit from coalescing threads, so that the number of ready-to-run threads matches the number of logical CPU cores. Reducing everything to one thread may be going too far, though :v) and #Voo says that green threads are only a legacy technology. A good JVM should support true multithreading; check your configuration options. On the C++ side, there are libraries like Intel TBB and Apple GCD to help manage parallelism.
I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.