Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !
Related
Is there any way to run two tasks with the 8051 μC simultaneously? For example,
Task one
Delay 1 sec
P2.B2 = 1
Delay 1 sec
P2.B2 = 0
Task 2
If P1.B0 = 1
P2.B3=1
So at any time, press the switch connected to P2.0 is 1, LED at P2.3=ON, and P2.2 keeps LED at P2.2 blinking.
A task is something what is typically provided by the underlying OS. If you are running on a bare metal system without any OS, you have no tasks at the first point.
But your application can build its own tasks. The job is more or less easy. You have to build a scheduler, typically triggered by a hardware clock for task switching, create stacks for each of the tasks and some control structures for the maintenance of the tasks. As you have no MMU and no memory protection on bare metal systems like 8051, you simply can modify stack pointers to do the task switching.
That is exactly what a library like FreeRTos can do for you. There is a port for 8051 available as I know. Searching on the web returns a lot of links to 8051 FreeRtos. Maybe there are some more libraries which offering tasks for you.
But mostly the overhead of scheduling and all the administration effort is much to high. Running an endless loop which is doing some jobs by reading some kind of queues or flags is much easier and often the more efficient solution. Also running some jobs in interrupt service routines fits well to bare metal requirements.
I assume you are running on bare metal with no battery saving requirements. I assume you can now write a program, load it to your device and run it. What I suggest you do is roughly this.
This program should have a main loop, which in its most simple would be like this:
MAX_TIME is the largest possible value of system clock, should never be reached
task_table is table with
next execution time as system clock time (MAX_TIME means disabled)
function pointer
initialize task-table with the three tasks below
forever
for each task with time 0
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
find a task with lowest non-zero time in task_queue
if task time is in past or now
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
Time 0 tasks are checked separately, and then tasks with time, so that the time 0 tasks don't block each others or the tasks with time from ever being called. Same could be achieved in different ways, this is just an example.
Then your requirements really call for 3 "tasks":
task_p2_b2_0:
P2.B2 = 0
enable task task_p2_b2_1 at current_time + 1 second
task_p2_b2_1:
P2.B2 = 1
enable task task_p2_b2_0 at current_time + 1 second
task_p1_b0_poll:
If P1.B0 = 1
P2.B3=1
enable task task_p1_b0_poll at time 0 (or current time + 10 ms or whatever)
Future development: Above is for a small number of static tasks. Iterating up to... 5-10 item table is so fast that there is no point trying to optimize it. Once you have more tasks than that, you should consider using a priority heap to store the tasks. Then you could also consider making main loop sleep when it has nothing to do, and use interrupt to wake it up (timer interrupt, serial port interrupt, pin activation interrupt etc). Also, you could have different task types, such as tasks which are activated when there is some IO (button press, byte from serial port, whatever). Etc. At the upper end of this adding features is a complete operating system really, but for simple things what I wrote above is really enough.
I have a certain program that recieves input and returnes an output with a run-time of about 2 seconds,
Now, i want to run this program online on a server that can handle multiple connections (lets say up to 100k),
on each client-server session the program will launch,
the client will hand the server the program's input and will wait for the program to end to recieve the server's respond (program's output),
Lets say the server's host is a very powerful machine - e.g 16 cores,
Can this work or it is to much runtime for each client?
What is the maximum runtime this kind of program can have?
I'm posting this as an answer because it's too large to place as a comment.
Can this work? It depends. It depends because there are a lot of variables in this problem. Let's look at some of them:
you say it takes 2 seconds to compute a result. Where and how are those seconds spent? Is this pure computation or are you accessing a database, or the file system? Is this CPU bound or I/O bound? If you run computations for the full 2 seconds then you are consuming CPU which means that you can simultaneously serve only 16 clients, one per core. Are you hitting a database? Is this on the powerful server or on some other machine? If the database is the bottleneck than move this to the powerful server and have SSD drives on it.
can you improve processing for one client? What's more efficient, doing the processing on one core or spread it across all the cores? If you can parallelize, can you limit thread contention?
is CPU all you need? How about memory? Any backend service you access? Are those over the network? Do you have enough bandwidth?
related to memory, what language/platform are you using? Does it have a garbage collector? Do you generate a lot of object to compute a result? Does the GC kick in and pauses your application so it cleans up and compacts the memory? Do you allocate enough memory for the application to run?
can you cache responses and serve them to other clients or are responses custom to each client? Can you precompute the results and then just serve them to clients or can't you predict the inputs?
Have you tried running some performance tests and profile the application to see where hotspots might show up? Can you do something about them?
have you any imposed performance criteria? How many clients do you want to support simultaneously? Is 2 seconds too much? Can clients live with more? How much more? How many seconds does it mean an unacceptable response time?
do you need a big server to run this setup or smaller ones work better (i.e. scale horizontally instead of vertically)?
etc
Nobody can answer this for you. You have to do an analysis of your application, run some tests, profile it, optimize it, then repeat until you are satisfied with the results.
I implemented a simple http server link, but the result of the test (ab -n 10000 -c 100 http://localhost:8080/status) is very bad (look through the test.png in the previous link)
I don't understand why it doesn't work correctly with multiple threads.
I believe that, by default, Netty's default thread pool is configured with as many threads as there are cores on the machine. The idea being to handle requests asynchronously and non-blocking (where possible).
Your /status test includes a database transaction which blocks because of the intrinsic design of database drivers etc. So your performance - at high level - is essentially a result of:-
a.) you are running a pretty hefty test of 10,000 requests attempting to run 100 requests in parallel
b.) you are calling into a database for each request so this is will not be quick (relatively speaking compared to some non-blocking I/O operation)
A couple of questions/considerations for you:-
Machine Spec.?
What is the spec. of the machine you are running your application and test on?
How many cores?
If you only have 8 cores available then you will only have 8 threads running in parallel at any time. That means those batches of 100 requests per time will be queueing up
Consider what is running on the machine during the test
It sound like you are running the application AND Apache Bench on the same machine so be aware that both your application and the testing tool will both be contending for those cores (this is in addition to any background processes going on also contending for those cores - such as the OS)
What will the load be?
Predicting load is difficult right. If you do think you are likely to have 100 requests into the database at any one time then you may need to think about:-
a. your production environment may need a couple of instance to handle the load
b. try changing the config. of Netty's default thread pool to increase the number of threads
c. think about your application architecture - can you cache any of those results instead of going to the database for each request
May be linked to the usage of Database access (synchronous task) within one of your handler (at least in your TrafficShappingHandler) ?
You might need to "make async" your database calls (other threads in a producer/consumer way for instance)...
If something else, I do not have enough information...
(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.
The thing i want to ask from you is : i have several steps of a big source code (each step has a virtual computation time and virtual communication data,i am taking only virtual as i wants to model the latency and somehow i managed to measure them throughout the source code) . I need to test this by make the code sleep for the computation time and transferring the data equivalent to the communication data . Can you suggest some configuration models to the same ? My aim is to minimizing the overall execution time of the program and thus obviously i want to diminish the overhead that the process can have.
The simplest that strikes my mind are :
Do the computation on all processes and send the virtual data by making asynchronous call to the root processes
Do the same with Synchronous call .
Assume the communication time is linear with the communication data . Use some algorithm to divide the tasks formerly to each process (inspired from load balancing)
Start from the first task with the root process and sends data to the next process , do sleep on that process and show on.
can you please give me some idea or verify ,if this strategy makes a lot of difference ?