What is your strategy to write logs in your software to deal with possible HUGE amount of log messages?

What is your strategy to write logs in your software to deal with possible HUGE amount of log messages? - c++

Thanks for your time and sorry for this long message!
My work environment
Linux C/C++(but I'm new to Linux platform)
My question in brief
In the software I'm working on we write a LOT of log messages to local files which make the file size grow fast and finally use up all the disk space(ouch!). We want these log messages for trouble-shooting purpose, especially after the software is released to the customer site. I believe it's of course unacceptable to take up all the disk space of the customer's computer, but I have no good idea how to handle this. So I'm wondering if somebody has any good idea here. More info goes below.
What I am NOT asking
1). I'm NOT asking for a recommended C++ log library. We wrote a logger ourselves.
2). I'm NOT asking about what details(such as time stamp, thread ID, function name, etc) should be written in a log message. Some suggestions can be found here.
What I have done in my software
I separate the log messages into 3 categories:
SYSTEM: Only log the important steps in my software. Example: an outer invocation to the interface method of my software. The idea behind is from these messages we could see what is generally happening in the software. There aren't many such messages.
ERROR: Only log the error situations, such as an ID is not found. There usually aren't many such messages.
INFO: Log the detailed steps running inside my software. For example, when an interface method is called, a SYSTEM log message is written as mentioned above, and the entire calling routine into the internal modules within the interface method will be recorded with INFO messages. The idea behind is these messages could help us identify the detailed call stack for trouble-shooting or debugging. This is the source of the use-up-disk-space issue: There are always SO MANY INFO messages when the software is running normally.
My tries and thoughts
1). I tried to not record any INFO log messages. This resolves the disk space issue but I also lose a lot of information for debugging. Think about this: My customer is in a different city and it's expensive to go there often. Besides, they use an intranet that is 100% inaccessible from outside. Therefore: we can't always send engineers on-site as soon as they meet problems; we can't start a remote debug session. Thus log files, I think, are the only way we could make use to figure out the root of the trouble.
2). Maybe I could make the logging strategy configurable at run-time(currently it's before the software runs), that is: At normal run-time, the software only records SYSTEM and ERROR logs; when a problem arises, somebody could change the logging configuration so the INFO messages could be logged. But still: Who could change the configuration at run-time? Maybe we should educate the software admin?
3). Maybe I could always turn the INFO message logging on but pack the log files into a compressed package periodically? Hmm...
Finally...
What is your experience in your projects/work? Any thoughts/ideas/comments are welcome!
EDIT
THANKS for all your effort!!! Here is a summary of the key points from all the replies below(and I'll give them a try):
1). Do not use large log files. Use relatively small ones.
2). Deal with the oldest ones periodically(Either delete them or zip and put them to a larger storage).
3). Implement run-time configurable logging strategy.

There are two important things to take note of:
Extremely large files are unwieldy. They are hard to transmit, hard to investigate, ...
Log files are mostly text, and text is compressible
In my experience, a simple way to deal with this is:
Only write small files: start a new file for a new session or when the current file grows past a preset limit (I have found 50 MB to be quite effective). To help locate the file in which the logs have been written, make the date and time of creation part of the file name.
Compress the logs, either offline (once the file is finished) or online (on the fly).
Put up a cleaning routine in place, delete all files older than X days or whenever you reach more than 10, 20 or 50 files, delete the oldest.
If you wish to keep the System and Error logs longer, you might duplicate them in a specific rotating file that only track them.
Put altogether, this gives the following log folder:
Log/
info.120229.081643.log.gz // <-- older file (to be purged soon)
info.120306.080423.log // <-- complete (50 MB) file started at log in
(to be compressed soon)
info.120306.131743.log // <-- current file
mon.120102.080417.log.gz // <-- older mon file
mon.120229.081643.log.gz // <-- older mon file
mon.120306.080423.log // <-- current mon file (System + Error only)
Depending on whether you can schedule (cron) the cleanup task, you may simply spin up a thread for cleanup within your application. Whether you go with a purge date or a number of files limit is a choice you have to make, either is effective.
Note: from experience, a 50MB ends up weighing around 10MB when compressed on the fly and less than 5MB when compressed offline (on the fly is less efficient).

Your (3) is standard practice in the world of UNIX system logging.
When log file reaches a certain age or maximum size, start a new one
Zip or otherwise compress the old one
throw away the nth oldest compressed log

One way to deal with it is to rotate log files.
Start logging into a new file once you reach certain size and keep last couple of log files before you start overwriting the first one.
You will not have all possible info but you will have at least some stuff leading up to the issue.
The logging strategy sounds unusual but you have your reasons.

I would
a) Make the level of detail in the log messages configurable at run time.
b) Create a new log file for each day. You can then get cron to either compress them and/or delete them or perhaps transfer to off-ling storage.

My answer is to write long logs and then tweat out the info you want.
Compress them on a daily basis - but keep them for a week

I like to log a lot. In some programs I've kept the last n lines in memory and written to disk in case of an error or the user requesting support.
In one program it would keep the last 400 lines in memory and save this to a logging database upon an error. A separate service monitored this database and sent a HTTP request containing summary information to a service at our office which added this to a database there.
We had a program on each of our desktop machines that showed a list (updated by F5) of issues, which we could assign to ourselves and mark as processed. But now I'm getting carried away :)
This worked very well to help us support many users at several customers. If an error occurred on a PDA somewhere running our software then within a minute or so we'd get a new item on our screens. We'd often phone a user before they realised they had a problem.
We had a filtering mechanism to automatically process or assign issues that we knew we'd fixed or didn't care much about.
In other programs I've had hourly or daily files which are deleted after n days either by the program itself or by a dedicated log cleaning service.

Related

File modification time gets overwritten by background cache flushing

I have code that performs following steps:
open file
write data
set file timestamps (via SetFileInformationByHandle(FileBasicInfo))
close file
When file is stored on certain NAS devices (and accessed via share) it's modification time ends up being set to current time.
According to Process Monitor Close() in step 4 results in a Write (local cache gets flushed/pushed to NAS device) that (seemingly) updates file's mtime on server.
If I add FlushFileBuffers() (or sleep for few seconds) between steps 2 and 3 -- everything is fine.
Is this a bug in SMB implementation of this NAS device (Dell EMC Isilon) or SetFileInformationByHandle() never promised anything?
What is the best way to deal with this situation? I would really like to avoid having to call FlushFileBuffers()...
Edit: Great... :-/ It looks like for executables (and only executables) atime (last access time) gets screwed up too (in the same way). Only these are harder to reproduce -- need to run this logic few times. Could be some antivirus... I am still investigating.
Edit 2: According to procmon access time gets updated by EXPLORER.EXE -- when it sees an executable, it can't resist opening it and reading portions of it (probably extracting the icon).

You can't really do anything -- I guess Isilon's SMB implementation doesn't support certain things (that would've preserved timestamps).
I simply added FlushFileBuffers() before SetFileInformationByHandle() and made sure there are no related race conditions in my code.

Manipulating large files that are being used

I want to retrieve handles of very large files (hundreds GB to x TB) that are being used by some processes. I am thinking about turning off the running processes for a while then copying them to some specific location. But this approach looks clumsy for a couple of reasons.
Files are too large, so copying them from place to place takes time
on different disk types.
After the file copy process is done, I have to turn on the stopped processes. But what if my users need to load/copy other large
files whose handles the said processes are controling ? I have to stop them again. I don't want to do this
because they have to do many critical tasks in my machine.
So I have 2 questions,
Please explain that my approach is wrong. What I say above is only
my personal idea, no coding is done yet for any of it.
Are there any methods to clone ~50 large files (1-5 TB) fast (some ten seconds or so) and silently in the background ?

If the original process opens the file in non-sharing mode (seems likely for such large files) you are probably out of luck for doing this without closing that process - or at least getting it to relinquish the file. If it allows at least read sharing I would suggest you use transactional NTFS - despite the warnings on all the docs about that.
Create a KTM transaction manager (via CreateTransactionManager), create a KTM transaction (with CreateTransaction) then use CopyFileTransacted to do the actual copy. Finally commit the transaction (CommitTransaction) and then close all the handles. Doing it this way will ensure that the file is in a consistent state (no partial writes from the original process).
It may also be that the backup API can ignore share mode (I know it can ignore security checks if your process has the appropriate privileges enabled, not sure about sharing).

How do I debug lower level File access exceptions/crashes in C++ unmanaged code?

I'm currently working on trying to resolve a crash/exception on an unmanaged C++ application.
The application crashes with some predicatibility. The program basically
process a high volume of files combined with running a bunch of queries through
the access DB.
It's definitely occuring during a file access. The error message is:
"failed reading. Network name is no longer available."
It always seems to be crashing in the same lower level file access code.
It's doing a lower level library Seek(), then a Read(). The exception occurs
during the read.
To further complicate things, we can only get the errors to occur when
we're running an disk balancing utility. The utility essentially examines file
access history and moves more frequently/recently used files to faster storage retrieval
while files that are used less frequently are moved to a slower retrieval area. I don't fully
understand the architecture of the this particular storage device,
but essentially it's got an area for "fast" retrieval and one for "archived/slower."
The issues are more easily/predicably reproducible when the utility app is started and
stopped several times. According to the disk manufacturer, we should be able to run
the utility in the background without effecting the behaviour of the client's main application.
Any suggestions how to proceed here? There are theories floating around here that it's somehow related to latency on the storage device. Is there a way to prove/disprove that? We've written a small sample app that basically goes out accesses/reads a whole mess of files on the drive. We've (so far) been unable to reproduce the issue even running with SmartPools. My thought is to try push the latency theory is to have multiple apps basically reading volumes of files from disk while running the utility application.
The memory usage and CPU usage do not look out of line in the Task Manager.
Thoughts? This is turning into a bit of a hairball.
Thanks,
JohnB

Grab your debug binaries.
Setup Application Verifier and add your application to its list.
Hopefully wait for a crash dumb.
Put that through WinDBG.
Try command: !avrf
See what you get....

Running out of file descriptors for mmaped files despite high limits in multithreaded web-app

I have an application that mmaps a large number of files. 3000+ or so. It also uses about 75 worker threads. The application is written in a mix of Java and C++, with the Java server code calling out to C++ via JNI.
It frequently, though not predictably, runs out of file descriptors. I have upped the limits in /etc/security/limits.conf to:
* hard nofile 131072
/proc/sys/fs/file-max is 101752. The system is a Linode VPS running Ubuntu 8.04 LTS with kernel 2.6.35.4.
Opens fail from both the Java and C++ bits of the code after a certain point. Netstat doesn't show a large number of open sockets ("netstat -n | wc -l" is under 500). The number of open files in either lsof or /proc/{pid}/fd are the about expected 2000-5000.
This has had me grasping at straws for a few weeks (not constantly, but in flashes of fear and loathing every time I start getting notifications of things going boom).
There are a couple other loose threads that have me wondering if they offer any insight:
Since the process has about 75 threads, if the mmaped files were somehow taking up one file descriptor per thread, then the numbers add up. That said, doing a recursive count on the things in /proc/{pid}/tasks/*/fd currently lists 215575 fds, so it would seem that it should be already hitting the limits and it's not, so that seems unlikely.
Apache + Passenger are also running on the same box, and come in second for the largest number of file descriptors, but even with children none of those processes weigh in at over 10k descriptors.
I'm unsure where to go from there. Obviously something's making the app hit its limits, but I'm completely blank for what to check next. Any thoughts?

So, from all I can tell, this appears to have been an issue specific to Ubuntu 8.04. After upgrading to 10.04, after one month, there hasn't been a single instance of this problem. The configuration didn't change, so I'm lead to believe that this must have been a kernel bug.

your setup uses a huge chunk of code that may be guilty of leaking too; the JVM. Maybe you can switch between the sun and the opensource jvms as a way to check if that code is not by chance guilty. Also there are different garbage collector strategies available for the jvm. Using a different one or different sizes will cause more or less garbage collects (which in java includes the closing of a descriptor).
I know its kinda far fetched, but it seems like all the other options you already followed ;)

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.

You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.

You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.

I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.

Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js