Monitoring file changes in C++ on Linux

Monitoring file changes in C++ on Linux - c++

I am working with Linux, and I have a directory which has subdirectories and there are files inside subdirectories.
I have to monitor the changes in the file. In C++ I am using Boost. I go through all the directories every 30 seconds and check the last_write_time. Principally, it works. But every time this action is executed, my CPU goes nuts and I see 15%-25% CPU usage
just for this program in top. I have read about inotify. If I use inotify, would I have the more or less same CPU usage or would it be improved? Is there a good alternative to what I am doing?

When you use inotify, you do not require to poll for all files to check if there are changes. You get a callback system that notifies you when a watched file or directory got changed.
The kernel/filesystem already has this information, so the resource/CPU usage is not just moved to another application, it is actually reduced.
Monitor file system activity with inotify provides more details why to use inotify, shows its basic usage and helps you set it up.

Related

How to check if a file is used by another process in C++?

I need to check if a file is currently opened by another process, e.g. a text editor (but needs to apply to everything else too).
I tried using std::ofstream::is_open() etc., but this did not work. I could open the file in my text editor while my program was checking if it was open. The program saw it as a closed file and went on. Only if I opened it as another ofstream would this work.
I'm using the filesystem library to copy files and they may only be copied (and later removed) if the file is not currently written to by another process on the client server.
Really curious about this one. Been wondering this for quite some time but never found a good way for it myself.
I'm currently making a program that needs to be able to run on both linux and windows. every 5 seconds it copies all files from directory a,b,c,d to x. This can be set by the client in rules. after it copied everything. all the files may be removed. After a day (or whatever the client tells the program) all those files from x need to be zipped and archived on location y. Hence the problem, files may only be deleted (and copied) if the other programs that place all the files in directories a,b,c,d are not touching that specific file right now. Hope that makes the question clearer.
And before anybody starts. Yes I know about the data race condition. I do not care about this for now. The program does absolutely nothing with the contents of a file. And after a file is closed by the other process, it will be closed forever.

I need to check if a file is currently opened by another process
This is heavily operating system specific (and might be useless)
So read first a good textbook on operating systems.
On Linux specifically you might use inotify(7) facilities, or /proc/ pseudo-file system (see proc(5)), or perhaps lsof(8). They work only for local file systems (not remote ones, like NFS). See also Advanced Linux Programming and syscalls(2).
And you could have surprises (e.g. a process being scheduled so quickly that removes a file that you won't have time to do anything)
For Windows take more time to read its documentation.
I'm currently making a program that needs to be able to run on both linux and windows. every 5 seconds it copies all files from directory a,b,c,d to x.
You might look, at least for inspiration, inside the source code of rsync.
I don't understand what your actual problem is, but rsync might be part of the solution and is rumored to run on both Windows and Linux

Syncing independent applications. (How to check if a file was modified by another program on runtime)

It is easier to explain with example.
When 2 text editors edit the same text file in the same time, when one editor saves the file, the other one understands that it was modified and asks to do smth.
How is it possible to get a signal that a file was modified outside the program?
I am working with c++ (though I think it isn't important) and on linux. (solution for windows would be good too)

ISO-C++ does not offer this functionality, so you have to stick with what the operating system provides.
On Linux that would be inotify, on Windows you would use directory change notifications.

① Check the timestamp of the file as close as possible before writing. If it is not what it was when you last opened this file for reading, then beware!
② You can build a checksum of the file and compare this to one you built earlier.
③ Register to a system service which informs you about file activities. This depends on the goodwill of the OS you are using; if this notification service isn't working properly, your stuff will fail. On Linux have a look at Inotify.

What is your strategy to write logs in your software to deal with possible HUGE amount of log messages?

Thanks for your time and sorry for this long message!
My work environment
Linux C/C++(but I'm new to Linux platform)
My question in brief
In the software I'm working on we write a LOT of log messages to local files which make the file size grow fast and finally use up all the disk space(ouch!). We want these log messages for trouble-shooting purpose, especially after the software is released to the customer site. I believe it's of course unacceptable to take up all the disk space of the customer's computer, but I have no good idea how to handle this. So I'm wondering if somebody has any good idea here. More info goes below.
What I am NOT asking
1). I'm NOT asking for a recommended C++ log library. We wrote a logger ourselves.
2). I'm NOT asking about what details(such as time stamp, thread ID, function name, etc) should be written in a log message. Some suggestions can be found here.
What I have done in my software
I separate the log messages into 3 categories:
SYSTEM: Only log the important steps in my software. Example: an outer invocation to the interface method of my software. The idea behind is from these messages we could see what is generally happening in the software. There aren't many such messages.
ERROR: Only log the error situations, such as an ID is not found. There usually aren't many such messages.
INFO: Log the detailed steps running inside my software. For example, when an interface method is called, a SYSTEM log message is written as mentioned above, and the entire calling routine into the internal modules within the interface method will be recorded with INFO messages. The idea behind is these messages could help us identify the detailed call stack for trouble-shooting or debugging. This is the source of the use-up-disk-space issue: There are always SO MANY INFO messages when the software is running normally.
My tries and thoughts
1). I tried to not record any INFO log messages. This resolves the disk space issue but I also lose a lot of information for debugging. Think about this: My customer is in a different city and it's expensive to go there often. Besides, they use an intranet that is 100% inaccessible from outside. Therefore: we can't always send engineers on-site as soon as they meet problems; we can't start a remote debug session. Thus log files, I think, are the only way we could make use to figure out the root of the trouble.
2). Maybe I could make the logging strategy configurable at run-time(currently it's before the software runs), that is: At normal run-time, the software only records SYSTEM and ERROR logs; when a problem arises, somebody could change the logging configuration so the INFO messages could be logged. But still: Who could change the configuration at run-time? Maybe we should educate the software admin?
3). Maybe I could always turn the INFO message logging on but pack the log files into a compressed package periodically? Hmm...
Finally...
What is your experience in your projects/work? Any thoughts/ideas/comments are welcome!
EDIT
THANKS for all your effort!!! Here is a summary of the key points from all the replies below(and I'll give them a try):
1). Do not use large log files. Use relatively small ones.
2). Deal with the oldest ones periodically(Either delete them or zip and put them to a larger storage).
3). Implement run-time configurable logging strategy.

There are two important things to take note of:
Extremely large files are unwieldy. They are hard to transmit, hard to investigate, ...
Log files are mostly text, and text is compressible
In my experience, a simple way to deal with this is:
Only write small files: start a new file for a new session or when the current file grows past a preset limit (I have found 50 MB to be quite effective). To help locate the file in which the logs have been written, make the date and time of creation part of the file name.
Compress the logs, either offline (once the file is finished) or online (on the fly).
Put up a cleaning routine in place, delete all files older than X days or whenever you reach more than 10, 20 or 50 files, delete the oldest.
If you wish to keep the System and Error logs longer, you might duplicate them in a specific rotating file that only track them.
Put altogether, this gives the following log folder:
Log/
info.120229.081643.log.gz // <-- older file (to be purged soon)
info.120306.080423.log // <-- complete (50 MB) file started at log in
(to be compressed soon)
info.120306.131743.log // <-- current file
mon.120102.080417.log.gz // <-- older mon file
mon.120229.081643.log.gz // <-- older mon file
mon.120306.080423.log // <-- current mon file (System + Error only)
Depending on whether you can schedule (cron) the cleanup task, you may simply spin up a thread for cleanup within your application. Whether you go with a purge date or a number of files limit is a choice you have to make, either is effective.
Note: from experience, a 50MB ends up weighing around 10MB when compressed on the fly and less than 5MB when compressed offline (on the fly is less efficient).

Your (3) is standard practice in the world of UNIX system logging.
When log file reaches a certain age or maximum size, start a new one
Zip or otherwise compress the old one
throw away the nth oldest compressed log

One way to deal with it is to rotate log files.
Start logging into a new file once you reach certain size and keep last couple of log files before you start overwriting the first one.
You will not have all possible info but you will have at least some stuff leading up to the issue.
The logging strategy sounds unusual but you have your reasons.

I would
a) Make the level of detail in the log messages configurable at run time.
b) Create a new log file for each day. You can then get cron to either compress them and/or delete them or perhaps transfer to off-ling storage.

My answer is to write long logs and then tweat out the info you want.
Compress them on a daily basis - but keep them for a week

I like to log a lot. In some programs I've kept the last n lines in memory and written to disk in case of an error or the user requesting support.
In one program it would keep the last 400 lines in memory and save this to a logging database upon an error. A separate service monitored this database and sent a HTTP request containing summary information to a service at our office which added this to a database there.
We had a program on each of our desktop machines that showed a list (updated by F5) of issues, which we could assign to ourselves and mark as processed. But now I'm getting carried away :)
This worked very well to help us support many users at several customers. If an error occurred on a PDA somewhere running our software then within a minute or so we'd get a new item on our screens. We'd often phone a user before they realised they had a problem.
We had a filtering mechanism to automatically process or assign issues that we knew we'd fixed or didn't care much about.
In other programs I've had hourly or daily files which are deleted after n days either by the program itself or by a dedicated log cleaning service.

Check directory's sharing mode in windows

My question is seems to be simple, but google is silent. I'm banned may be?:)
So the question is can I check is there any blocked from deleting file in directory or it's subdirectories before delete it? Is there simple way to do it?

No, there isn't.
And even if there was, it wouldn't work. Consider this sequence of events:
You perform the check and it succeeds (there is no blocked files).
Another process receives CPU quantum and opens a file without FILE_SHARE_DELETE flag.
Your process gains the CPU back and proceeds to delete the directory -- only to discover that it can't, because now there is a blocked file.

How to determine when files are done copying for further processing?

Alright so to start this is strictly for Windows and I'd prefer to use C++ over .NET but I'm not opposed to boost::filesystem although if it can be avoided in favor of straight Windows API I'd prefer that.
Now the scenario is an application on another machine I can't change is going to create files in a particular directory on the machine that I need to make backups of and do some extra processing. Currently I've made a little application which will sit and listen for change notifications in a target directory using FindFirstChangeNotification and FindNextChangeNotification windows APIs.
The problem is that while I can get notified when new files are created in the directory, modified, size changes, etc it only notifies once and does not specifically tell me which files. I've looked at ReadDirectoryChangesW as well but it's the same story there except that I can get slightly more specific information.
Now I can scan the directory and try to acquire locks or open the files to determine what specifically changed from the last notification and whether they are available for further use but in the case of copying a large file I've found this isn't good enough as the file won't be ready to be manipulated and I won't get any other notifications after the first so there is no way to tell when it's actually done copying unless after the first notification I continually try to acquire locks until it succeeds.
The only other thing I can think of that would be less hackish would be to have some kind of end token file but since I don't have control over the application creating the files in the first place I don't see how I'd go about doing that and it's still not ideal.
Any suggestions?

This is a fairly common problem and one that doesn't have an easy answer. Acquiring locks is one of the best options when you cannot change the thing at the remote end. Another I have seen is to watch the file at intervals until the size doesn't change for an interval or two.
Other strategies include writing a no-byte file as a trigger when the main file is complete and writing to a temp directory then moving the complete file to the real destination. But to be reliable, it must be the sender who controls this. As the receiver, you are constrained to watching the directory and waiting for the file to settle.

It looks like ReadDirectoryChangesW is going to be your best bet. For each file copy operation, you should be receiving FILE_ACTION_ADDED followed by a bunch of FILE_ACTION_MODIFIED notifications. On the last FILE_ACTION_MODIFIED notification, the file should no longer be locked by the copying process. So, if you try to acquire a lock after each FILE_ACTION_MODIFIED of the copy, it should fail until the copy completes. It's not a particularly elegant solution, but there doesn't seem to be any notifications available for when a file copy completes.

You can process the data once the file is closed, right? So the task is to track when the file is closed. This can be done using file system filter driver. You can write your own or you can use our CallbackFilter product.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js