MRTG ThreshProgI/O stopped working after start using rrdtool - rrdtool

I was using MRTG system and it was log usage data to .log files. I was use 'ThreshProgI' to get threshold alerts and it was working fine.
I changed the usage logging format to rrdtool and now my system is logging all usage to .rrd files without any issues.
the problem is now it is not sending threshold alerts. I have testing it with low value 'ThreshMaxI' but the 'ThreshProgI' script not get run.
Have anyone faced this issue?

Have you defined a ThreshDir that is writeable by the MRTG process?
To make Thresholds work with MRTG/RRD, you need to have a writeable ThreshDir defined, as well as your ThreshProgI[] and ThreshMaxI[]. If the directory is not defined, or is not writeable, thresholds do not get checked.
EG:
ThreshDir: /u01/rrdtool/thresholds
ThreshMaxO[offset]: 5
ThreshProgO[offset]: /usr/local/bin/notify

Related

Why is build time of local application affected by network?

Build time of XPages application containing several JARs, Java sources and ~50 XP/CC elements takes about minute to build on server via WAN. I have replicated application to local, build time dropped to ~10s.
Since few days ago build of local application is extremely slow, about 2-5 minutes. After some experiments there is workaround: to disable TCP port in location document - it drops build times to just few seconds. Even tho it works, it does not help much - testing requires user to be authenticated, so I need to replicate design changes to remote or local server - and that means to change location (online/offline) every time.
UPDATE 2013-04-04: I have duplicated my current location document and removed home and directory servers. To my surprise, with this location build times went back to few seconds - with TCP port enabled so replication is possible. Bigger surprise was the fact, that returning home/directory servers back to new location did not reproduce the problem - in fact they do not affect performance. I know it because I have renamed current location document and everything went to normal. From my understanding, "something" in client configuration was connected to location name. Thanks to Simon's tips I will investigate further.
The question is still open: I am looking for some (eclipse) preference controlling this behavior - unintended communication with server during build of local application.
Solution:
Teamstudio CIAO hooks into designer and checks for every update of design element. Seems to be lack of code optimization to me: it checks whether currently built design element (every single one, one by one) should be controlled in CIAO config database.
This explains why the problem was solved by renaming of location document. I was disappointed yesterday, when performance problems started again. Fortunately, I recalled CIAO setup to that location document about that time. CIAO uses teamstudio.ini file in DATA directory to configure what CIAO configuration database is used for every location document. Look for entry:
CIAOConfigDb[location name]=server name;CIAO\CIAOConfig.nsf
For development on local replicas with connection to server (for replication or local server), use location document with CIAO disabled.
This works only with property ForceConfigLocation=0.
Not a solution (yet!), but may help in the investigation. I'll update further if you post results later.
Debug instructions.
Add the following to the shortcut that launches the Designer client.
-RPARAMS -console -debug -separateSysLogFiles -consoleLog
Start the designer client. This will also open up the OSGi console.
Reproduce the issue. While it is still in progress in the OSGi console type the following:
dump threads
Do this three times, with a small amount of time between completion of each dump. Once done open the three heap dumps (in the IBM_TECHNICAL_SUPPORT folder) in the Heap Dump Analyser.
It will show you what threads are consistent through all three dumps. Take a look at those and look for package names/calls which may appear to be a functional area. Once you have that then you can try adding the debug for the related class.
For example: Let's say you notice "com.ibm.designer.domino.ui.commons." in the thread, then you would edit the rcpinstall.properties file. It will be in:
<Notes Install>\Data\workspace\.config\rcpinstall.properties
and you would add (start with FINE, then FINEST if nothing):
com.ibm.designer.domino.ui.commons.level=FINE
Now when you restart the designer client it will generate debug output in the workspace\logs folder for that package. You need to then go through the trace logs looking for the time when the delay occurred and see if it makes any references to related design elements.
Other open applications may get built at the same time (which looks like a bug top me). Be sure to close all other applications and the server based replica. Open applications have their icon showing in the application list and they stay open even if you close and reopen the Designer. In Designer 9 right click application and select "Close Application". In 8.5 you need to use Package Exprorer for closing.
Another good way is to use Working Sets. Only applications in open Working Set will be built (AFAIK). Have a Working Set with this one app only (and the app only in this Working Set).
update 1
If these don't help I would delete/rename bookmark.nsf, Cache.NDK and desktop8.ndk. Then open just this one app and see what happens.
update 2
Check that there are no referenced projects. Right click the application and select "Project Properties". From there "Project Referencies" and make sure no check boxes are checked.
update 3
Based on your update I would check the item names starting with $ in location document. Sometimes there are saved IP addresses etc. which could cause this problem. All those items can be removed.
If possible (and if You are not using it yet) try to use version 9 of the Domino designer (You do not have to use Domino 9 to do that - it works fine with Domino 8.5.3).
For our projects build times went down to only few seconds from few minutes. I guess that they finally noticed at IBM that the build process used to heavily relay on connection to server and done something with it.
With new designer You don't event have to replicate to local. You can directly work on Your local server.

Monitoring file changes in C++ on Linux

I am working with Linux, and I have a directory which has subdirectories and there are files inside subdirectories.
I have to monitor the changes in the file. In C++ I am using Boost. I go through all the directories every 30 seconds and check the last_write_time. Principally, it works. But every time this action is executed, my CPU goes nuts and I see 15%-25% CPU usage
just for this program in top. I have read about inotify. If I use inotify, would I have the more or less same CPU usage or would it be improved? Is there a good alternative to what I am doing?
When you use inotify, you do not require to poll for all files to check if there are changes. You get a callback system that notifies you when a watched file or directory got changed.
The kernel/filesystem already has this information, so the resource/CPU usage is not just moved to another application, it is actually reduced.
Monitor file system activity with inotify provides more details why to use inotify, shows its basic usage and helps you set it up.

Jython 2.5.3 and time.sleep

I'm developing a small in house alternative to Tripwire, so I've coded a small script to hash files in a JBoss EAP server, and store the path and the hash in a MySQL database.
Every day the script compares the hashes in the filesystem with those saved in the DB, so any change is logged and finally reported using JasperServer.
The script runs at night using cron, to avoid a large number of scripts quering the DB at the same time it uses time.sleep(RANDOM_NUMBER_OF_SECONDS) before doing the fun stuff, but sometimes time.sleep seems to sleep forever and the script ends without any error, I check the mail cron sends and no error is logged. Any help would be appreciated. I'm Using jython-standalone-2.5.3, IBM's JDK and RHEL 5.6 running inside VMWare.
I just found http://bugs.jython.org/issue1974 and a code comment seems to point that OS signals can cause this behavior, but not sure if this is my case.
If you want to see the code checkout at http://code.google.com/p/pysnapshot/
Luis GarcĂ­a Bustos.
I don't know why do you think time.sleep() can make less number of scripts querying the DB.
IMO ot is better to use cron to call that program periodically. After it is started it should check if in /tmp/ directory is "semaphore" file, for example /tmp/snapshot_working.txt. If there is no semaphore file, then create it and write to it something like: "snapshot started: 2012-12-05 22:00:00". After your program completes checking it should remove this file. If at start program will find semaphore file then it could just stop or check if date & time saved in this file looks "old". If it is "old", then remove it and start normally writing in log that "old" file was found (administrator can find such long working snaphots and terminate it).
The only reason do make time.sleep() in your case is if you want to use such script at normal working hours without making Denial Of Service attack to your DB. Example: after making 100 DB queries you can make little sleep and give DB time to serve other user queries. But I think the sooner program finishes the better.

How Can I Increase the Timeout for Visual Studio Tests?

I'm working on a pretty sizable suite of tests for some code I'm writing (in Visual Studio 2012). For the most part, running the unit tests is no big deal. But I'm also including a lot of integration tests which have more external infrastructure dependencies. The number of tests, combined with re-setting the infrastructure dependencies between tests, has resulting in a rather lengthy test run for the full suite (around 45 minutes at the moment).
Running the tests is no big deal. Unit tests will be run on check-in, integration tests nightly. However, I'm running into an issue when trying to analyze code coverage for all of the tests. No code coverage results are created, and the output window says the following:
This request operation sent to net.pipe://megara/vstest.discoveryengine/14108 did not receive a reply within the configured timeout (00:30:00). The time allotted to this operation may have been a portion of a longer timeout. This may be because the service is still processing the operation or because the service was unable to send a reply message. Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.
I'm not sure where it's directing me here. I don't use any iContextChannel for anything, all of the test-running is built in to Visual Studio. So I don't really know where/how I can increase any kind of timeout. Does anybody know where I should look?
Try changing the time-out values in your solution .testsettings file.
If you don't have one, you can add it to the solution by using the right-click on solution -> Add New Item -> TestSettings menu. In there you can do time-outs on individual tests (default is 30 minutes), or set the timeout for an entire test run.
It's not clear if this is the root cause or not, but it is worth ruling out.
Old topic, but perhaps some new info: I didn't have any luck trying to set the timeout in the .testsettings file with Visual Studio 2015. No matter what I set it to in test settings, my tests would stop after 30 minutes.
There is now a [Timeout(milliseconds)] attribute which can be applied to individual test methods. This is even better than testsettings since you can fine tune individual tests to make sure they aren't taking longer than expected.
Unfortunately, I could not get this attribute to have any effect when attempting to set it higher than 30 minutes when using a .testsettings file, even if the .testsettings file defined a higher timeout. Values lower than 30 minutes were honored, but higher values would still stop at 30 minutes regardless of what testsettings said.
After I removed the .testsettings file, the timeout attributes seem to be working as expected - the test will run up to whatever timeout I set it to.
If you have trouble getting the timeout attribute to work, try removing .testsettings.

What is your strategy to write logs in your software to deal with possible HUGE amount of log messages?

Thanks for your time and sorry for this long message!
My work environment
Linux C/C++(but I'm new to Linux platform)
My question in brief
In the software I'm working on we write a LOT of log messages to local files which make the file size grow fast and finally use up all the disk space(ouch!). We want these log messages for trouble-shooting purpose, especially after the software is released to the customer site. I believe it's of course unacceptable to take up all the disk space of the customer's computer, but I have no good idea how to handle this. So I'm wondering if somebody has any good idea here. More info goes below.
What I am NOT asking
1). I'm NOT asking for a recommended C++ log library. We wrote a logger ourselves.
2). I'm NOT asking about what details(such as time stamp, thread ID, function name, etc) should be written in a log message. Some suggestions can be found here.
What I have done in my software
I separate the log messages into 3 categories:
SYSTEM: Only log the important steps in my software. Example: an outer invocation to the interface method of my software. The idea behind is from these messages we could see what is generally happening in the software. There aren't many such messages.
ERROR: Only log the error situations, such as an ID is not found. There usually aren't many such messages.
INFO: Log the detailed steps running inside my software. For example, when an interface method is called, a SYSTEM log message is written as mentioned above, and the entire calling routine into the internal modules within the interface method will be recorded with INFO messages. The idea behind is these messages could help us identify the detailed call stack for trouble-shooting or debugging. This is the source of the use-up-disk-space issue: There are always SO MANY INFO messages when the software is running normally.
My tries and thoughts
1). I tried to not record any INFO log messages. This resolves the disk space issue but I also lose a lot of information for debugging. Think about this: My customer is in a different city and it's expensive to go there often. Besides, they use an intranet that is 100% inaccessible from outside. Therefore: we can't always send engineers on-site as soon as they meet problems; we can't start a remote debug session. Thus log files, I think, are the only way we could make use to figure out the root of the trouble.
2). Maybe I could make the logging strategy configurable at run-time(currently it's before the software runs), that is: At normal run-time, the software only records SYSTEM and ERROR logs; when a problem arises, somebody could change the logging configuration so the INFO messages could be logged. But still: Who could change the configuration at run-time? Maybe we should educate the software admin?
3). Maybe I could always turn the INFO message logging on but pack the log files into a compressed package periodically? Hmm...
Finally...
What is your experience in your projects/work? Any thoughts/ideas/comments are welcome!
EDIT
THANKS for all your effort!!! Here is a summary of the key points from all the replies below(and I'll give them a try):
1). Do not use large log files. Use relatively small ones.
2). Deal with the oldest ones periodically(Either delete them or zip and put them to a larger storage).
3). Implement run-time configurable logging strategy.
There are two important things to take note of:
Extremely large files are unwieldy. They are hard to transmit, hard to investigate, ...
Log files are mostly text, and text is compressible
In my experience, a simple way to deal with this is:
Only write small files: start a new file for a new session or when the current file grows past a preset limit (I have found 50 MB to be quite effective). To help locate the file in which the logs have been written, make the date and time of creation part of the file name.
Compress the logs, either offline (once the file is finished) or online (on the fly).
Put up a cleaning routine in place, delete all files older than X days or whenever you reach more than 10, 20 or 50 files, delete the oldest.
If you wish to keep the System and Error logs longer, you might duplicate them in a specific rotating file that only track them.
Put altogether, this gives the following log folder:
Log/
info.120229.081643.log.gz // <-- older file (to be purged soon)
info.120306.080423.log // <-- complete (50 MB) file started at log in
(to be compressed soon)
info.120306.131743.log // <-- current file
mon.120102.080417.log.gz // <-- older mon file
mon.120229.081643.log.gz // <-- older mon file
mon.120306.080423.log // <-- current mon file (System + Error only)
Depending on whether you can schedule (cron) the cleanup task, you may simply spin up a thread for cleanup within your application. Whether you go with a purge date or a number of files limit is a choice you have to make, either is effective.
Note: from experience, a 50MB ends up weighing around 10MB when compressed on the fly and less than 5MB when compressed offline (on the fly is less efficient).
Your (3) is standard practice in the world of UNIX system logging.
When log file reaches a certain age or maximum size, start a new one
Zip or otherwise compress the old one
throw away the nth oldest compressed log
One way to deal with it is to rotate log files.
Start logging into a new file once you reach certain size and keep last couple of log files before you start overwriting the first one.
You will not have all possible info but you will have at least some stuff leading up to the issue.
The logging strategy sounds unusual but you have your reasons.
I would
a) Make the level of detail in the log messages configurable at run time.
b) Create a new log file for each day. You can then get cron to either compress them and/or delete them or perhaps transfer to off-ling storage.
My answer is to write long logs and then tweat out the info you want.
Compress them on a daily basis - but keep them for a week
I like to log a lot. In some programs I've kept the last n lines in memory and written to disk in case of an error or the user requesting support.
In one program it would keep the last 400 lines in memory and save this to a logging database upon an error. A separate service monitored this database and sent a HTTP request containing summary information to a service at our office which added this to a database there.
We had a program on each of our desktop machines that showed a list (updated by F5) of issues, which we could assign to ourselves and mark as processed. But now I'm getting carried away :)
This worked very well to help us support many users at several customers. If an error occurred on a PDA somewhere running our software then within a minute or so we'd get a new item on our screens. We'd often phone a user before they realised they had a problem.
We had a filtering mechanism to automatically process or assign issues that we knew we'd fixed or didn't care much about.
In other programs I've had hourly or daily files which are deleted after n days either by the program itself or by a dedicated log cleaning service.