I'm writing a loganalysis application and wanted to grab apache log records between two certain dates. Assume that a date is formated as such: 22/Dec/2009:00:19 (day/month/year:hour:minute)
Currently, I'm using a regular expression to replace the month name with its numeric value, remove the separators, so the above date is converted to: 221220090019 making a date comparison trivial.. but..
Running a regex on each record for large files, say, one containing a quarter million records, is extremely costly.. is there any other method not involving regex substitution?
Thanks in advance
Edit: here's the function doing the convertion/comparison
function dateInRange(t, from, to) {
sub(/[[]/, "", t);
split(t, a, "[/:]");
match("JanFebMarAprMayJunJulAugSepOctNovDec", a[2]);
a[2] = sprintf("%02d", (RSTART + 2) / 3);
s = a[3] a[2] a[1] a[4] a[5];
return s >= from && s <= to;
}
"from" and "to" are the intervals in the aforementioned format, and "t" is the raw apache log date/time field (e.g [22/Dec/2009:00:19:36)
I once had the same problem of a very slow AWK program that involved regular expressions. When I translated the whole program to Perl, it ran at much greater speed. I guess it was because GNU AWK compiles a regular expression every time it interprets the expression, where perl just compiles the expression once.
Well, here is an idea, assuming records in a log are ordered by date.
Instead of running regexp on every line in a file and checking if that record is within the required range, do a binary search.
Get total number of lines in a file. Read a line from the middle and check its date. If it is older than your range - then anything before that line can be ignored. Split what's left in half and check a line from the middle again. And so on until you find your range boundaries.
Here is a Python program I wrote to do a binary search through a log file based on dates. It could be adapted to work for your use.
It seeks to the middle of the file then syncs to a newline, reads and compares the date, repeats the process splitting the previous half in half, doing that until the date matches (greater than or equal), rewinds to make sure there's no more with the same date right before, then reads and outputs lines until the end of the desired range. It's very fast.
I have a more advanced version in the works. Eventually I'll get it completed and post the updated version.
Chopping files just to identify a range sounds a bit heavy handed for such a simple task (binary search is worth considering, though)
here's my modified function, which is obviously much faster since the regex is eliminated
BEGIN {
months["Jan"] = 1
months["Feb"] = 2
....
months["Dec"] = 12
}
function dateInRange(t, from, to) {
split(t, a, "[/:]");
m = sprintf("%02d", months[a[2]]);
s = a[3] m a[1] a[4] a[5];
ok = s >= from && s <= to;
if(!ok && seen == 1){exit;}
return ok;
}
An array is defined and subsequently used to index months.
It's ensured that the program doesnt continue checking records once date is out of the range (variable seen is set on first match)
Thank you all for your inputs.
Related
Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.
I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.
I am implementing a function which is to check a blurb (e.g. a message/forum post, etc) against a (potentially long) list of banned words/phrases, and simply return true if any one or more of the words is found in the blurb, and false if not.
This is to be done in vbScript.
The old developer currently has a very large IF statement using instr() e.g.
If instr(ucase(contactname), "KORS") > 0 OR _
instr(ucase(contactname), "D&G") > 0 OR _
instr(ucase(contactname), "DOLCE") > 0 OR _
instr(ucase(contactname), "GABBANA") > 0 OR _
instr(ucase(contactname), "TIFFANY") > 0 OR _
'...
Then
I am trying to decide between two solutions to replace the above code:
Using regular expression to find matches, where the regex would be a simple (but potentially long) regex like this: "KORS|D&G|DOLCE|GABBANA|TIFFANY" and so on, and we would do a regular expression test to return true if any one or more of the words is found.
Using an array where each array item contains a banned word, and loop through each array item checking it against the blurb. Once a match is found the loop would terminate and a variable would be set to TRUE, etc.
It seems to me that the regular expression option is the best, since it is one "check" e.g. the blurb tested against the pattern. But I am wondering if the potentially very long regex pattern would add enough processing overhead to negate the simplicity and benefit of doing the one "check" vs. the many "checks" in the array looping scenario?
I am also open to additional options which I may have overlooked.
Thanks in advance.
EDIT - to clarify, this is for a SINGLE test of one "blurb" e.g. a comment, a forum post, etc. against the banned word list. It only runs one time during a web request. The benchmarking should test size of the word list and NOT the number of executions of the use case.
You could create a string that contains all of your words. Surround each word with a delimiter.
Const TEST_WORDS = "|KORS|D&G|DOLCE|GABBANA|TIFFANY|"
Then, test to see if your word (plus delimiter) is contained within this string:
If InStr(1, TEST_WORDS, "|" & contactname & "|", vbTextCompare) > 0 Then
' Found word
End If
No need for array loops or regular expressions.
Seems to me (without checking) that such complex regexp would be slower, and also evaluating such complex 'Or' statement wold be slow (VBS will evaluate all alternatives).
Should all alternatives be evaluated to know expression value - of course not.
What I would do, is to populate an array with banned words and then iterate through it, checking if the word is within text being searched - and if word is found discontinue iteration.
You could store the most 'popular' banned words on the top of the array (some kind of rank), so you would be most likely to find them in few first steps.
Another benefit of using array is that it is easier to manage its' values compared to 'hardcoded' values within if statement.
I just tested 1 000 000 checks with regexp ("word|anotherword") vs InStr for each word and it seems I was not right.
Regex check took 13 seconds while InStr 71 seconds.
Edited: Checking each word separately with regexp took 78 seconds.
Still I think that if you have many banned words checking them one by one and breaking if any is found would be faster (after last check I would consider joining them by (5? 10?) and checking not such complex regexp each time).
I've written a program to parse through a log file. The file in question is around a million entries or so, and my intent is to remove all duplicate entries by date. So if there's 100 unique log-ins on a date, it will only show one log-in per name. The log output I've created is in the form:
AA 01/Jan/2013
AA 01/Jan 2013
BB 01/Jan 2013
etc. etc. all through the month of January.
This is what I've written so far, the constant i in the for loop is the amount of entries to be sorted through and namearr & datearr are the arrays used for name and date. My end game is to have no repeated values in the first field that correspond to each date. I'm trying to follow proper etiquette and protocols so if I'm off base with this question I apologize.
My first thought in solving this myself is to nest a for loop to compare all previous names to the date, but since I'm learning about Data Structures and Algorithm Analysis, I don't want to creep up to high run times.
if(inFile.is_open())
{
for(int a=0;a<i;a++)
{
inFile>>name;//Take input file name
namearr[a]=name;//Store file name into array
//If names are duplicates, erase them
if(namearr[a]==temp)
{
inFile.ignore(1000,'\n');//If duplicate, skip to next line
}
else
{
temp=name;
inFile.ignore(1,' ');
inFile>>date;//Store date
datearr[a]=date;//Put date into array
inFile.ignore(1000,'\n');//Skip to next like
cout<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to window
oFile<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to file
}
}
}
Ughhh... You better use a Regular Expression library to easily deal with that size of a file. Check Boost Regex
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/index.html
You can construct a key composed of the name and the date with simple string concatenation. That string becomes the index to a map. As you are processing the file line by line, check to see if that string is already in the map. If it is, then you have encountered the name on that day once before. If you've seen it already do one thing, if it's new do another.
This is efficient because you're constructing a string that will only be found a second time if the name has already been seen on that date and maps efficiently search the space of keys to find if a key exists in the map or not.
I am new to perl and I have a problem that I'm trying to solve. At this stage in my program, I have placed a file into an array and created a hash where all the keys are numbers, that increase by a user specified bin size, within a range The values of all keys are set to 0. My goal is to loop through the array and find numbers that match the keys of my hash, and increment the corresponding value by 1 in the event of a match. To make finding the specific value within the array a bit easier, each line of the array will only contain one number of interest, and this number will ALWAYS be a decimal, so maybe I can use the regex:
=~ m{(\d+\.\d+)}
to pick out the numbers of interest. After finding the number of interest, I need to round down the number (at the minute I an using "Math::Round 'nlowmult';") so that it can drop into the appropriate bin (if it exists), and if the bin does not exist, the loop needs to continue until all lines of the array have been scanned.
So the overall aim is to have a hash which has a record of the number of times that values in this array appear, within a user specified range and increment (bin size).
At the minute my code attempting this is (MathRound has been called earlier in the program):
my $msline;
foreach $msline (#msfile) {
chomp $msline;
my ($name, $pnum, $m2c, $charge, $missed, $sequence) = split (" ", $msline);
if ($m2c =~ /[$lowerbound,$upperbound]/) {
nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
}
NOTE: each line of the array contains 6 fields, with the number of interest always appearing in the third field "m2c".
The program isn't rounding the values down, neither is it adding values to the keys, it is making new keys and incrementing these. I also don't think using split is a good idea, since a real array will contain around 40,000 lines. This may make the hash population process really slow.
Where am I going wrong? Can anybody give me any tips as to how I can go about solving this problem? If any aspects of the problem needs explaining further, let me know!
Thank you in advance!
Change:
if ($m2c =~ /[$lowerbound,$upperbound]/) {
nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
to:
if ($m2c >= $lowerbound && $m2c <= $upperbound) {
$m2c = nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
You can't use a regular expression like that to test numeric ranges. And you're using the original value of $m2c as the hash key, not the rounded value.
I think the main problem is your line:
nlowmult ($binsize, $m2c);
Changing this line to:
$m2c = nlowmult ($binsize, $m2c);
would solve at least that problem, because nlowmult() doesn't actually modify $m2c. It just returns the rounded result. You need to tell perl to store that result back into $m2c.
You could combine that line and the one below it if you don't want to actually modify the contents of $m2c:
$hash{nlowmult ($binsize, $m2c)}++;
Probably not a compete answer, but I hope that helps.
As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.