Percent difference in log files - regex

While going through log files, I often come across the same error message time and time again. Of course, two lines are never identical due to time stamps, usernames, IP addresses, etc.
I'm looking for a way to set a "percent difference", and ignore any lines that are say 90% similar to an already reported error message. Another thought is to always ignore time stamp differences too.
Procedure:
User inputs search term(s) (either regex or simple text)
User inputs tolerance for differences
[Start]
Grep finds string matching search term and sends to new text file
Grep continues searching logs, and finds the same error message. Difference might be the time stamp, date, and possibly username. Since the line is at least 90% similar to what's already in the new file, grep doesn't copy it over and continues searching
Grep finds new line that matches search term. Line is less than 90% similar, so it gets copied to new file and becomes another line that grep matches future results against.
*Edit: Sorry if I was not clear the first time. I'll gladly explain more if need be.
Thanks.
Log.1 - DD:MM:YYYY HH:MM:SS:MS Error - USER failed to login at IPADDRESS
Log.1 - DD:MM:YYYY HH:MM:SS:MS Hardware failed when booting up
Log.2 - DD:MM:YYYY HH:MM:SS:MS Resources are stretched thin, warning - check RAM

I'm not aware of any full out-of-the-box solutions but Text::Levenshtein and similiar algorithms can help you come up with how similiar one generic string is to another.

Another idea is to cache your log messages with a timestamp, so you don't repeat a message you've seen in the last, say, minute.
my %msg_cache = ();
sub log_filter {
my $msg = shift;
if (defined($msg_cache{$msg}) && $msg_cache{$msg} < time-60) {
# we've logged this message in the last minute - skip
return;
}
$msg_cache{$msg} = time;
return 1;
}

Related

Are word timestamps always immediately consecutive and always start from 0?

In google cloud speech to text, I'm getting the timestamps of the words as documented here using PHP.
Two issues:
The first word always starts at 0s, even if the audio file doesn't have any sound until after.
Each word timestamp is immediately followed by another, even when the speaker pauses between words.
Is it possible to get a more precise word timestamp with PHP?
Based on the documentation, it seems there isn’t an option to modify any parameter in order to get a more precise word timestamp.
However, you can report this issue by providing all the information requested within the form.

mIRC Search for multiple words in text file

I am trying to search a text file that will return a result if more than one word is found in that line. I don't see this explained in the documentation and I have tried various loops with no success.
What I would like to do is something similar to this:
$read(name.txt, s, word1|word2|word3)
or even something like this:
$read(name.txt, w, word1*|*word2*|*word3)
I don't know RegEx that well so I'm assuming this can be done with that but I don't know how to do that.
The documentation in the client self is good but I also recommend this site: http://en.wikichip.org/wiki/mirc. And with your problem there is a nice article : http://en.wikichip.org/wiki/mirc/text_files
All the info is taken from there. So credits to wikichip.
alias testForString {
while ($read(file.txt, nw, *test*, $calc($readn + 1))) {
var %line = $v1
; you can add your own words in the regex, seperate them with a pipe (|)
noop $regex(%line,/(word1|word2|word3|test)/))
echo -a Amount of results: $regml(0)
}
}
$readn is an identifier that returns the line that $read() matched. It is used to start searching for the pattern on the next line. Which is in this case test.
In the code above, $readn starts at 0. We use $calc() to start at line 1. Every match $read() will start searching on the next line. When no more matches are after the line specified $read will return $null - terminating the loop.
The w switch is used to use a wildcard in your search
The n switch prevents evaluating the text it reads as if it was mSL code. In almost EVERY case you must use the n switch. Except if you really need it. Improper use of the $read() identifier without the 'n' switch could leave your script highly vulnerable.
The result is stored in a variable named %line to use it later in case you need it.
After that we use a noop to execute a regex to match your needs. In this case you can use $regml(0) to find the amount of matches which are specified in your regex search. Using an if-statement you can see if there are two or more matches.
Hope you find this helpful, if there's anything unclear, I will try to explain it better.
EDIT
#cp022
I can't comment, so I'll post my comment here, so how does that help in any way to read content from a text file?

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.
Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)
Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}
The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.
I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

eval failing to match regex after sometime

I get first input from user which is a tree (having significant height and depth) of nodes. Each of the node contains a regex and modifiers. This tree gets saved in memory. This is taken only once at the application startup.
The second input is a value which is matched starting at the root node of the tree till an exact matching leaf node is found (Depth First Search). The match is determined as follows :
my $evalstr = <<EOEVAL;
if(\$input_value =~ /\$node_regex/$node_modifiers){
1;
}else{
-1;
}
EOEVAL
no strict 'refs';
my $return_value = eval "no strict;$evalstr";
The second input is provided continuously throughout the application's life time by a source.
problem:
The above code works very well for some time (approx. 10 hours), but after continuous input for this time, the eval continuously starts failing and I get -1 in $return_value. All other features of the application work very fine including other comparison statements.If I restart the application, the matching again starts and gives proper results.
Observations:
1) I get deep recursion warning many times, but I read somewhere it is normal as stack size for me would be more than 100 many a times, considering the size of the input tree.
2) If I use simple logic for regex match without eval as above, I don't get any issue for any continuous run of the application.
if($input_value =~ /$node_regex/){
$return_value=1;
}else{
$return_value=-1;
}
but then I have to sacrifice dynamic modifiers, as per Dynamic Modifiers
Checks:
1) I checked $# but it is empty.
2) Also printed the respective values of $input_value,$node_regex and $node_modifiers, they are correct and should have matched the value with regex at the failure point.
3) I checked for memory usage, but it's fairly constant over the time for the perl process.
4) Was using perl 5.8.8 then updated it to 5.12, but still face the same issue.
Question :
What could be the cause of above issue? Why it fails after some time, but works well when the application is restarted?
A definitive answer would require more knowledge of perl internals than I have. But given what you are doing, continuous parsing of large trees, it seems safe to assume that some limit is being reached, some resource is exhausted. I would take a close look at things and make sure that all resources are being released between each iteration of a parse. I would be especially concerned with circular references in the complex structures, and making sure that there are none.

Automatically finding numbering patterns in filenames

Intro
I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again.
They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point
Question
Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.
The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.
What algorithm would you use to parse all the file name list and generate a most likely regex pattern?
There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is
my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";
outputs:
this\ is\ (?:Perl|Ruby)
Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.
First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.
What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.
I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.
At the end, I would report as follows:
nnnnnnnnnn filenames processed
Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.
T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.
If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.
For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.
Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]
For Python, see this question about TemplateMaker.