How to keep data for progress bar - c++

My program downloads files from servers, and parse it.
For download files i have got a bar, but i want to make bar for parsing.
Parsing take a lot of time, and power, so my solution dont have to use a lot of power.
few servers -> few files -> line in file
I.e in one time, i download files from servers (about 4-5 files) and when downloaded, just start parsing.
But when servers is more than 1, my program download files from two servers so i have 2x more files. Files name on servers are the same but when i download file, i change name of these files to "world"+"orginalfile.txt"
I thought about something like that:
map<int server,std::map<int file(<make it enum),{current line, max lines} >> (struktura)
Because when reading file, i want to make emit to send data to window.
When start reading i want to send (file,lines_in_file,server)
And when reading send (file,current_line,world)
Then in window which read this data push this data to some variable (like above example) and run second function to calculate progress bar.
i.e
servers[] -> files [] -> thread download -> thread reading (these
threads start per file, so if servers are 2 and files 4 these threads
start 8x) -> emit init signal (send file,lines_in_file,server_number)
+ emit (currentLineWhenReading,file,server) signal when reading line-per-line
So how to make it the best, to get a lot of data and hold it + use litte power to calculate this ?

Parse in a separate thread. Increment (with mutex) some counter while parsing (or use some atomic increment), like e.g. in this answer.
Probably the parsing thread is mostly IO bound. So it will most often wait for disk IO. In that case, the small overhead of mutex locking is insignificant.
In the main GUI thread, set up things (e.g. some Qt timer...) to read every second the parse counter (with mutex), and update the progress bar

Related

log4cpp stops working properly after sometime

I have a log4cpp implementation in a multiple process environment . Logger is configured once during initialization and then is shared among forked processes which server http requests.
During first minute or so , I see the logs rolls perfectly fine at the query per second load( say it runs at 100qps).
After that, the log slows down dramatically. So, I logged pid as well and notice that only one process gets to write to the log for a time duration ( around 10-15 seconds) and then another process starts writing and so on so forth . Processes don't die. They just don't get a chance to write.
This is different from what happens when the server starts . At that time, every other log line is written by a different process. ( Also, I write one-log-line per process at the end of serving the request. )
At this point, I can't think of what could be going wrong.
This is how my log4cpp conf file looks
log4cpp.rootCategory=DEBUG,rootAppender
log4cpp.appender.rootAppender=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.rootAppender.fileName=/tmp/mylogfile.log
log4cpp.appender.rootAppender.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.rootAppender.layout.ConversionPattern=%d|%p|%m%n
log4cpp.category.http.server.main=INFO,MAIN
log4cpp.additivity.http.server.main=false
log4cpp.appender.MAIN=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.MAIN.maxBackupIndex=10
log4cpp.appender.MAIN.maxFileAge=1
log4cpp.appender.MAIN.append=true
log4cpp.appender.MAIN.fileName=/tmp/mylogfile.log
log4cpp.appender.MAIN.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.MAIN.layout.ConversionPattern=%d|%p|%m%n
Edit: more updates : Thanks #Botje for your time.
I see that whenever a new child process is created , it is only that process that gets to write to the log. That tells me that all the reference other processes were holding become invalid.
I also tried setting additive property to true. With that , server starts properly writing into the /tmp/myfile.log and then switches to writing into /tmp/myfile.log.1 withing a minute . And then stops writing after a minute.
At that point logs gets directed to stderr which is directed to another log file.
Also,
I did notice that the log4cpp FileAppender uses seek to determine the file size before writing log entries. If the file handle is shared between processes that will cause writes to end up at the start of the file instead of the end. Even if you fix that, you still have multiple processes that think they are in charge of log file rotation.
I suggest you have all processes write to a common udp/tcp/Unix socket and designate one process that collects all log entries and actually writes it to a file. You don't have to reinvent the wheel, you can use the syslog protocol and either the system syslog or a copy running in userspace.

Multithreading – known similarity solution

I am looking for a known solution (as producer-consumer problem) for this situation .
In my case there are two options:
link to image,
text file with links to images and links to other text files (with other links).
I'm trying to create a multi-threading downloader in C++ (on unix) using posix mutex and posix semaphore.
Application has link to the first text file.
Threads sleep (semaphore = 0).
Main thread downloads first text file.
Parse for other links -- put links in some queue (semaphore += links_count --> other threads wake up).
Other threads produce other links.
What with main thread?
How to check other threads -- finish state?
With use finite queue there can be deadlock: text file contains many links (queue as full with other text files). No text file can be finished.
Thank you for your ideas.
Well, your problem is still kind of a producer/consumer problem but your consumers are also producers. Some ways to deal with the problem:
Do not limit your queue size. Simply fail when your process runs out of memory. Not very elegant but will probably work in 99.99% of all download scenarios (assuming 100 bytes per download link on average and about 2GB available memory you would have to store more than 20 million links in your queue before running out of memory).
Split your producer and consumer by using the hard drive as buffer. Download files into a temporary folder. Have a thread watch that folder for new files. Once a new file appears, parse it and put the items in the consumer queue. Once the file is finished parsing put it into the final download location. This way you are only limited by disk space. This way your producer (parser) is a different thread than your consumers (downloader).
Edit
You can wait on your worker threads with pthread_join in the main thread.

Request for suggestions on doing IPC/event capture

I have a simple python server script which forks off multiple instances (say N) of C++ program. The C++ program generates some events that need to be captured.
The events are currently being captured in a log file (1 logfile per forked process). In addition, i need to periodically (T minutes) get the rate at which the events are being produced across all child processes to either the python server or some other program listening for these events (still not sure). Based on rate of these events, some "re-action" may be taken by the server (say reduce the number of forked instances)
Some pointers i have briefly looked at:
grep log files - go through the running process log files (.running), filter those entries generated in the last T minutes, analyse the data and report
socket ipc - add code to c++ program to send the events to some server program which analyses the data after T minutes, reports and starts all over again
redis/memcache (not sure completely) - add code to c++ program to use some distributed store to capture all the generated data, analyses the data after T minutes, reports and starts all over again
Please let me know your suggestions.
Thanks
if time is not of the essence (T minutes sounds like it is long compared to whatever events are happening in the C++ programs that are kicked off) then dont make things any more complicated than they need to be. forget IPC (sockets, shared mem, etc), just have each C++ program log what you need to know about time/performance and let the python script check logs every T minutes that you need the data. dont waste time overcomplicating something that you can do in a simple manner
As a alternative to your socket IPC suggestion, how about 0mq. It's a library (in C with python bindings available) that can do message transfer on an inter-thread, inter-process or inter-machine level. Pretty simple to get going, and pretty quick.
I'm not affiliated with it. I'm just evaluating it for other uses and thought it might be a fit for you as well.

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.

Writing concurrently to a file

I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?
If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.
As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.
I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.
How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.
I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.
One option is to use ACE::logging. It has an efficient implementation of concurrent logging.