How likely are two processes to grab a "free" directory? - concurrency

If I have a multiprocess system that needs to process a bunch of directories, 1 directory per process, how likely is it that two processes will happen to grab the same directory?
Say I have dir/1 all the way to dir/99. I figure that if I touch a .claimed file in the dir that the process is working on, there won't be conflicts. Are there problems with my approach?
There's a bit more complexity. It's not only multi-process, but it's distributed across several computers.

I recall something about directory creation being atomic, but not file creation, so your .claimed ought to be a directory - however I don't recall what OS that applied to.
I'd take a different approach: list all the directories you want to process, writing the output to a pipe, which acts as a work queue that each process will read from. IIRC system pipe semantics (named or anonymous) are that reading from a pipe is an atomic operation: two processes will not be able to read the same data.
A master process could write the list to a pipe and spawn the worker processes, or the worker processes could just block trying to read until you manually write the list to the pipe.

If you are worried about collisions, then I would have a master process that delegates the directories out to the processes. Another option that I've used before is to list all of your directories in a database table. Then you can use the database's built-in concurrency features to pull out records and mark them as locked.

If the number of worker threads and the number of directories is known, you can divide the range of directories between the processes and thus avoid collisions.
So e.g. process 1 knows to take care of dir/1 to dir/10.

I do not know how your application works, but if your application is processing the folders recursively given a root folder, it is very likely you will double process your files.
Here are some options
Option 1
if you have full control of the application, you can modify your application to read in a list of folders (from a configuration file).
myprogram.exe file1.config
myprogram.exe file2.config
where file1.config contains the names of directories 1-50
and file2.config contains the names of directories 51-100
Option 2
use the for loop in your o/s to specify explicitly which folders your program should process. (Note: I have specify a DOS command syntax. Pleae modify yours according to your O/S).
for %f in (dir1, dir2, dir3, dir4) do start myprogram.exe %f
for %f in (dir11, dir12, dir13, dir14) do start myprogram.exe %f

Related

log4cpp stops working properly after sometime

I have a log4cpp implementation in a multiple process environment . Logger is configured once during initialization and then is shared among forked processes which server http requests.
During first minute or so , I see the logs rolls perfectly fine at the query per second load( say it runs at 100qps).
After that, the log slows down dramatically. So, I logged pid as well and notice that only one process gets to write to the log for a time duration ( around 10-15 seconds) and then another process starts writing and so on so forth . Processes don't die. They just don't get a chance to write.
This is different from what happens when the server starts . At that time, every other log line is written by a different process. ( Also, I write one-log-line per process at the end of serving the request. )
At this point, I can't think of what could be going wrong.
This is how my log4cpp conf file looks
log4cpp.rootCategory=DEBUG,rootAppender
log4cpp.appender.rootAppender=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.rootAppender.fileName=/tmp/mylogfile.log
log4cpp.appender.rootAppender.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.rootAppender.layout.ConversionPattern=%d|%p|%m%n
log4cpp.category.http.server.main=INFO,MAIN
log4cpp.additivity.http.server.main=false
log4cpp.appender.MAIN=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.MAIN.maxBackupIndex=10
log4cpp.appender.MAIN.maxFileAge=1
log4cpp.appender.MAIN.append=true
log4cpp.appender.MAIN.fileName=/tmp/mylogfile.log
log4cpp.appender.MAIN.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.MAIN.layout.ConversionPattern=%d|%p|%m%n
Edit: more updates : Thanks #Botje for your time.
I see that whenever a new child process is created , it is only that process that gets to write to the log. That tells me that all the reference other processes were holding become invalid.
I also tried setting additive property to true. With that , server starts properly writing into the /tmp/myfile.log and then switches to writing into /tmp/myfile.log.1 withing a minute . And then stops writing after a minute.
At that point logs gets directed to stderr which is directed to another log file.
Also,
I did notice that the log4cpp FileAppender uses seek to determine the file size before writing log entries. If the file handle is shared between processes that will cause writes to end up at the start of the file instead of the end. Even if you fix that, you still have multiple processes that think they are in charge of log file rotation.
I suggest you have all processes write to a common udp/tcp/Unix socket and designate one process that collects all log entries and actually writes it to a file. You don't have to reinvent the wheel, you can use the syslog protocol and either the system syslog or a copy running in userspace.

Multithreading – known similarity solution

I am looking for a known solution (as producer-consumer problem) for this situation .
In my case there are two options:
link to image,
text file with links to images and links to other text files (with other links).
I'm trying to create a multi-threading downloader in C++ (on unix) using posix mutex and posix semaphore.
Application has link to the first text file.
Threads sleep (semaphore = 0).
Main thread downloads first text file.
Parse for other links -- put links in some queue (semaphore += links_count --> other threads wake up).
Other threads produce other links.
What with main thread?
How to check other threads -- finish state?
With use finite queue there can be deadlock: text file contains many links (queue as full with other text files). No text file can be finished.
Thank you for your ideas.
Well, your problem is still kind of a producer/consumer problem but your consumers are also producers. Some ways to deal with the problem:
Do not limit your queue size. Simply fail when your process runs out of memory. Not very elegant but will probably work in 99.99% of all download scenarios (assuming 100 bytes per download link on average and about 2GB available memory you would have to store more than 20 million links in your queue before running out of memory).
Split your producer and consumer by using the hard drive as buffer. Download files into a temporary folder. Have a thread watch that folder for new files. Once a new file appears, parse it and put the items in the consumer queue. Once the file is finished parsing put it into the final download location. This way you are only limited by disk space. This way your producer (parser) is a different thread than your consumers (downloader).
Edit
You can wait on your worker threads with pthread_join in the main thread.

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.

Preventing multiple process instances on Linux

What is the best way on Linux platform for the process (C++ application) to check its instance is not already running?
The standard way to do this is to create a pidfile somewhere, typically containing the pid of your program.
You don't need to put the pid in there, you could just put an exclusive lock on it. If you open it for reading/writing and flock it with LOCK_EX | LOCK_NB, it will fail if the file is already locked. This is race-condition free, and the lock will be automatically released if the program crashes.
Normally you'd want to do it per-user, so the user's home directory is a good place to put the file.
If it's a daemon, somewhere like /var/run is better.
You can use files and file locks to accomplish this, but, beware it isn't perfect and don't copy the infamous Firefox bug where it refuses to start sometimes even if it isn't already running.
The basic logic of it is:
Invariant:
File xxxxx will exist if and only if the program is running, and the
contents of the file will contain the PID of that program.
On startup:
If file xxxxx exists:
If there is a process with the PID contained in the file:
Assume there is some instance of the program, and exit
Else:
Assume that the program terminated abnormally, and
overwrite file xxxx with the PID of this program
Else:
Create file xxxx, and save the current PID to that file.
On termination (typically registered via atexit):
Delete file xxxxx
In addition to the logic above, you should also use a second file that you lock in order to synchronize access to the PID file (i.e. to act as a mutex to make it safe in terms of process-level concurrency).
A related alternative to Michael's solution is to create a directory in a known location (probably under /var/run or /tmp) and use the success/failure of the system call as the mechanism for ensuring mutual exclusion. This is the same mutual-exclusion trick CVS has used for years as directory creation is atomic on most (maybe all) commodity OSes. A PID file is still useful in the case where the directory + PID creating process dies unexpectedly and fails to clean up. Additionally, when checking to see if the existing directory + PID is valid, I'd suggest explicitly checking the /proc/<PID>/exe symlink to verify that it points to your executable rather than just assuming the PID hasn't been recycled.
For a desktop app, it is probably more feasible to check whether an instance is started for current user, so that two users can have their own instances running.
You could use either some libraries (libunique (GTK+) or QtSingleApplication (Qt)), or do it yourself. In addition to pid-file mentioned earlier, you can open a FIFO or UNIX-domain socket somewhere in user's home directory. This way, you could communicate with running instance, eg. raise window of running instance or tell running instance to open new file/URI/whatever.
You could use a POSIX named semaphore to do this. It is much safer than using a file lock.

Writing concurrently to a file

I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?
If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.
As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.
I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.
How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.
I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.
One option is to use ACE::logging. It has an efficient implementation of concurrent logging.