The fault tolerance in Mapreduce when map worker fail - mapreduce

Recently I was reading Google's paper, "MapReduce: Simplified Data Processing on Large Clusters". The words below confuse me. It says
When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that has not already read the data from worker A will read the data from worker B.
I guess the wokers who executing reduce tasks are just doing what they should do. If they have read data from worker A, they can continue their tasks. Instead, if they haven't, they fail to do the task and report error to master. Then master can re-assign the reduce task to others after worker B finished. So why should they be notified of the reexecution immediately? I think it's unnecessary for some reducers who have read the data they want from worker A.

So why should they be notified of the reexecution immediately? I think
it's unnecessary for some reducers who have read the data they want
from worker A
The thing is reducers do not know that they already read all the data from mapper they want because mapper has failed and did not finish writing data.
Reducers has started reading data early before mapper completed and read some partial data. Mapper could produce more data if not failed.
Mapper has produced partial result files, then failed and new attempt has started.
Typically mappers and reducers are single-threaded and deterministic, this allows restarts and speculative execution. Suppose you do not use some non-deterministic functions like rand(), multi-threading in mapper (custom non-deterministic mapper). Also network/shuffle adds non-determinism. Mapper with multi core/multi threading can produce differently ordered output after restart. Mappers can use output of another mappers and even reducers (for example map-side join in modern implementations). The whole result should be deterministic to make it possible to restart but the order may not, it can be differently grouped files and number of files.
If reducer is commutative and also deterministic (typically yes), you can restart it and get the same result, if it is commutative, no problem with order of rows.
But is it possible to use partial results from one mapper instance (failed) and partial results from another one (new attempt), like read files 0000 - 0004 fron Map1_attempt1 and files 0005 - 0006 from Map1_attempt2 ? Only if mapper produces exactly the same number of files with the same order always. You see, if the whole result of Mapper should be deterministic, partial result may not. It depends on implementation.

Related

Threading - The fastest way to handle reoccuring threads?

I am writing my first threaded application for an industrial machine that has a very fast line speed. I am using the MFC for the UI and once the user pushes the "Start" machine button, I need to be simultaneously executing three operations. I need to collect data, process it and output results very quickly as well as checking to see if the user has turned the machine "off". When I say very quickly, I expect the analyze portion of the execution to take the longest and it needs to happen in well under a second. I am mostly concerned about overhead elimination associated with threads. What is the fastest way to implement the loop below:
void Scanner(CString& m_StartStop) {
std::thread Collect(CollectData);
while (m_StartStop == "Start") {
Collect.join();
std::thread Analyze(AnalyzeData);
std::thread Collect(CollectData);
Analyze.join();
std::thread Send(SendData);
Send.join();
}
}
I realize this sample is likely way off base, but hopefully it gets the point across. Should I be creating three threads and suspending them instead of creating and joining them over and over? Also, I am a little unclear if the UI needs its own thread since the user needs to able to pause or stop the line at anytime.
In case anyone is wondering why this needs to be threaded as opposed to sequential, the answer is that the line speed of the machine will cause the need to be collecting data for the second part while the first part is being analyzed. Every 1 second equates to 3 ft of linear part movement down this machine.
Think about functionnal problem before thinking about implementation.
So we have a continuous flow of data that need to be collected, analyzed and sent elsewhere, with a supervision point to be able to stop of pause the process.
collection should be limited by the input flow
analyze should only be cpu limited
sending should be io bound
You just need to make sure that the slowest part must be collection.
That is a correct use case for threads. Implementation could use:
a pool of input buffers that would be filled by collect task and used by analyze task
one thread that continuously:
controls if it should exit (a dedicated variable)
takes an input object from the pool
fills it with data
passes it to analyze task
one thread that continuously
waits for the first of an input object from collect task and a request to exit
analyzes the object and prepares output
send the output
Optionnaly, you can have a separate thread for processing the output. In that case, the last lines becomes
passes an output object to the sending task
and we must add:
one thread that continuously
waits for the first of an output object from analze task and a request to exit
send the output
And you must provide a way to signal the request for pause or exit, either with a completely external program and a signalisation mechanism, or a GUI thread
Any threads you need should already be running, waiting for work. You should not create or join threads.
If job A has to finish before job B can start, the completion of job A should trigger the start of job B. That is, when the thread doing job A finished doing job A, it should either do job B itself or trigger the dispatch of job B. There shouldn't need to be some other thread that's waiting for job A to finish so that it can start job B.

CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.
First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?
It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.
Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

Unbalanced load (v2.0) using MPI

(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.

Why use a semaphore for each processor in a spooling simulation

I'm working on a project that simulates multiple processors handling commands and queuing strings to be printed via one spooler.
There are up to ten processors, each executing a series of jobs that have "compute" and "print" statements. Compute is just a mathematical process to take up time to simulate other work, while print transfers a short string to the spooler to be printed. There is one spooler, with one printer hooked up to the spooler. Each processor will handle a number of jobs before termination, all print statements from a specific job on a specific processor should print together (no interleaving of printing from individual jobs), and the spooler should never be blocked on a process that is computing.
I generally understand how to code this using semaphore and mutex structures, but a statement in the specifications confused me:
Try to maximize the concurrency of your system. (You might consider using
an array of semaphores indexed by processor id.)
Is there a specific advantage I'm missing to using a semaphore for each individual process?
If further clarification is needed, let me know--I tried to describe the problem in a concise way.
EDIT:
Another possibly important piece: each processor has a buffer that can hold up to ten strings for sending to the spooler. Could the sempahores for each process be for waiting when the buffer is full?
EDIT 2:
A job can contain multiple compute and print statements mixed in with each other:
Job 1
Calculate 4
Print Foo
Calculate 2
Print Bar
End Job
Print statements within a job should all be printed in order (Foo and Bar should be printed sequentially without a print from another job/processor in between).
The important information is here:
(no interleaving of printing from individual jobs),
This implies a new Semaphore(1) (if you are using Java).
And
and the spooler should never be blocked on a process that is
computing.
If you had a semaphore that accepts one party this last piece would not be satisfied. An executing processor should not have to wait for another to complete, it can be done in parallel.
You can do this by creating a striped set of semaphores. You have it indexed by the processor ID so that each thread/processor would run without interleaving but without waiting for other processors to complete.
Semaphore[] semaphores = new Semaphore[Number_of_proessors];
//initialize all semaphore indexes
semaphores[Process.id].acquire();
//work
semaphores[Process.id].release();

beginner needs help with design/planning report-functionality in C++ Application

I'm somehow stuck with implementing a reporting functionailty in my Log-Parser Application.
This is what I did so far:
I'm writing an Application that reads Logfiles and searches the strings for multiple regular Expressions that can be defined in a user-configuration file. For every so called "StringPipe"-defintion that is parsed from the configuration the Main-Process spawns a worker thread that will search for a single regex. The more definitons the user creates, the more worker threads are spawned. The Main Function reads a bunch of Logstrings and then sends the workers to process the strings and so on.
Now I want every single worker thread that is spawned to report information about the number of matches it has found, how long it took, what it did with those strings and so on. These Information are used to export as csv, write to DB and so on.
Now I'm stuck at the point where I created a Class "Report". This Class provides member functions that are called by the worker threads to make the Report-Class gather the Infos needed for generating the report.
For that my workers (which are boost::threads / functors) have to create a Report-Object which they can call those reporting functions for.
The problem is in my Design, because when a worker-thread finishes his job, it is destroyed and for the next bunch of strings that needs to be processed a new instance of this worker functor is spawned and so it needs to create a new Report Object.
This is a problem from my understanding, because I need some kind of container where every worker can store it's reported infos into and finally a global report that contains such infos as how long the whole processing has taken, which worker was slowest and so on.
I just need to collect all these infos together, but how can I do this? Everytime a worker stops, reports, and then starts again, it will destroy the Report-Object and it's members, so all the infos from previous work is gone.
How can I solve this problem or how is such a thing handled in general?
First, I would not spawn a new thread do the RE searching and such. Rather, you almost certainly want a pool of threads to handle the jobs as they arise.
As far as retrieving and processing the results go, it sounds like what you want are Futures. The basic idea is that you create an object to hold the result of the computation, and a Future to keep track of when the computation is complete. You can either wait for the results to be complete, or register a call-back to be called when a future is complete.
Instead of having the worker thread create the report object, why don't you have the main thread create the empty report and pass a pointer to the worker thread when created. Then the worker thread can report back when it has completed the report, then the main thread can add the data from that report to some main report.
So, the worker thread will never have ownership of the actual report, it fill just populate its data fields and report back to the main thread.