kill mapper with error file in mapreduce - mapreduce

Hello I need to kill the mapper which contains the error file.
When the file is parsed in mapper if the file contains error it should kill the current mapper and discard the file. Other mapper with correct files should continue and entire job should be successful.

Depending on the File input format, the mapper may or may not be processing entire file a time.
Generally each mapper will process a Split at time.
If there is any parsing issue you can simply skip processing records and return from mapper. No need to kill the mapper. If you kill that mapper, MapReduce job will fail.
So, just skip the processing and return from mapper.

Related

How to get correct file size only on the completion of a detected file change, not at the beginning?

I'm using libuv's uv_fs_event_t to monitor file changes. And once a change is detected, I open the file in the callback uv_fs_event_cb.
However, my program requires to also get the full file size when opening the file, so I would know how much memory is to be allocated based on the file size. I found that no matter I use libuv's uv_fs_fstat or POSIX's stat/stat64, or fseek+ftell I never get the correct file size immediately. It's because when my program is opening the file, the file is still being updated.
My program runs in a tight single thread with callbacks so delay/sleep isn't the best option here (and no guaranteed correctness either).
Is there any way to handle this with or without leveraging libuv, so that I can, say hold off opening and reading the file, until the write to the file has completed? In other words, instead of immediately detects the start of a change of a file event, can I in some way detects a completion of a file change?
One approach is to have the writer create an intermediate file, and finish I/O by renaming it to the target file. e.g. this is what happens in most browsers, the file has an "downloading.tmp" name until download is complete to discourage you from opening it.
Another approach is to write/touch a "finished" file after writing the main target file, and wait to see that file before the reader starts his job.
Last option I can see, if the file format can be altered slightly, have the writer print the file size as first bytes of the file, then the reader can preallocate correctly even if the file is not fully written, and then it will insist on reading all the data.
Overall I'm suggesting instead of a completion event, make the writer produce any event that can be monitored after it has completed it's task, and have the reader wait/synchronize on that event.

Co-ordinating batch processing in Akka

I need to batch process 2 large files one after the other in Akka and I'm trying to figure out the best way to co-ordinate that in a controlling actor. The lines in each file can be processed in parallel but all of the lines from the first file must be processed before any of the lines from the second file can be processed.
I was thinking of having the following actors:
File1WorkerActor - Processes a single line from the first file.
File2WorkerActor - Processes a single line from the second file.
File1Actor - Delegates the lines from the first file to multiple worker actors.
File2Actor - Delegates the lines from the second file to multiple worker actors.
TopLevelActor - Asks File1Actor to process file 1, waits for it to complete then asks File2Actor to process file 2.
The thing I'm not sure about is, how do the file actors know when all the workers have finished and how does the TopLevelActor know when File1Actor is finished?
I was thinking that the FileActor would just hold a counter for the number of lines in a given file and the workers would send a message back for each processed line. After the counter counts down it would send a message back to TopLevelActor. Is there any problem with this approach? Or would it be better to implement some kind of Future handling?
Your solution sounds correct to me. Also you me be interesting to check FSM and/or become/unbecome functionality to avoid another task submission to workers when previous task are not completed

update a file simultaneously without locking file

Problem- Multiple processes want to update a file simultaneously.I do not want to use file locking functionality as highly loaded environment a process may block for a while which i don't want. I want something like all process send data to queue or some shared place or something else and one master process will keep on taking data from there and write to the file.So that no process will get block.
One possibility using socket programming.All the processes will send data to to single port and master keep on listening this single port and store data to file.But what if master got down for few seconds.if it happen than i may write to some file based on timestamp and than later sync.But i am putting this on hold and looking for some other solution.(No data lose)
Another possibility may be tacking lock for the particular segment of the file on which the process want to write.Basically each process will write a line.I am not sure how good it will be for high loaded system.
Please suggest some solution for this problem.
Have a 0mq instance handle the writes (as you initially proposed for the socket) and have the workers connect to it and add their writes to the queue (example in many languages).
Each process can write to own file (pid.temp) and periodically rename file (pid-0.data, pid-1.data, ...) for master process that can grab all this files.
You may not need to construct something like this. If you do not want to get processes blocked just use the LOCK_NB flag of perl flock. Periodically try to flock. If not succeeds continue the processing and the values can stored in an array. If file locked, write the data to it from the array.

Multi threaded application in C++

I'm working on a Multi Threaded application programmed in C++. I uses some temporary files to pass data between my threads. One thread writes the data to be processed into files in a directory. Another thread scans the directory for work files and reads the files and process them further, then delete those files. I have to use these files , because if my app gets killed when , i have to retain the data which has not been processed yet.
But i hate to be using multiple files. I just want to use a single file. One thread continuously writing to a file and other thread reading the data and deleting the data which has been read.
Like a vessel is filled from top and at bottom i can get and delete the data from vessel. How to do this efficiently in C++ , first is there a way ..?
As was suggested in the comments to your questions using a database like SQLite may be a very good solution.
However if you insist on using a file then this is of course possible.
I did it myself once - created a persistent queue on disk using a file.
Here are the guidelines on how to achieve this:
The file should contain a header which point to the next unprocessed record (entry) and to the next available place to write to.
If the records have variable length then each record should contain a header which states the record length.
You may want to add to each record a flag that indicates whether the record was processed
file locking can be used to ensure no one reads from the portion of the file that is being written to
Use low level IO - don't use buffered streams of any kind, use direct write semantics
And here is the schemes for reading and writing (probably with some small logical bugs but you should be able to take it from there):
READER
Lock the file header and read it and unlock it back
Go to the last record position
Read the record header and the record
Write the record header back with the processed flag turned on
If you are not at the end of file Lock the header and write the new location of the next unprocessed record else write some marking to indicate there are no more records to process
Make sure that the next record to write points to the correct place
You may also want the reader to compact the file for you once in a while:
Lock the entire file
Copy all unprocessed records to the beginning of the file (You may want to keep some logic as not to overwrite your unprocessed records - maybe compact only if processed space is larger than unprocessed space)
Update the header
Unlock the file
WRITER
Lock the header of the file and see where the next record is to be written then unlock it
Lock the file from the place to be written to the length of the record
Write the record and unlock
Lock the header if the unprocessed record mark indicates there are no records to process let it point to the new record unlock the header
Hope this sets you on the write track
The win32Api function CreateFileMapping() enables processes to share data, multiple processes can use memory-mapped files that the system paging file stores.
A few good links:
http://msdn.microsoft.com/en-us/library/aa366551(VS.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx
http://www.codeproject.com/Articles/34073/Inter-Process-Communication-IPC-Introduction-and-S
http://www.codeproject.com/Articles/19531/A-Wrapped-Class-of-Share-Memory
http://www.bogotobogo.com/cplusplus/multithreaded2C.php
you can write data that was process line per line and delimeter for each line indicate if this record processing or not

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.