Perl: Looping Through an Array to Increment the Values of a Hash - regex

I am new to perl and I have a problem that I'm trying to solve. At this stage in my program, I have placed a file into an array and created a hash where all the keys are numbers, that increase by a user specified bin size, within a range The values of all keys are set to 0. My goal is to loop through the array and find numbers that match the keys of my hash, and increment the corresponding value by 1 in the event of a match. To make finding the specific value within the array a bit easier, each line of the array will only contain one number of interest, and this number will ALWAYS be a decimal, so maybe I can use the regex:
=~ m{(\d+\.\d+)}
to pick out the numbers of interest. After finding the number of interest, I need to round down the number (at the minute I an using "Math::Round 'nlowmult';") so that it can drop into the appropriate bin (if it exists), and if the bin does not exist, the loop needs to continue until all lines of the array have been scanned.
So the overall aim is to have a hash which has a record of the number of times that values in this array appear, within a user specified range and increment (bin size).
At the minute my code attempting this is (MathRound has been called earlier in the program):
my $msline;
foreach $msline (#msfile) {
chomp $msline;
my ($name, $pnum, $m2c, $charge, $missed, $sequence) = split (" ", $msline);
if ($m2c =~ /[$lowerbound,$upperbound]/) {
nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
}
NOTE: each line of the array contains 6 fields, with the number of interest always appearing in the third field "m2c".
The program isn't rounding the values down, neither is it adding values to the keys, it is making new keys and incrementing these. I also don't think using split is a good idea, since a real array will contain around 40,000 lines. This may make the hash population process really slow.
Where am I going wrong? Can anybody give me any tips as to how I can go about solving this problem? If any aspects of the problem needs explaining further, let me know!
Thank you in advance!

Change:
if ($m2c =~ /[$lowerbound,$upperbound]/) {
nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
to:
if ($m2c >= $lowerbound && $m2c <= $upperbound) {
$m2c = nlowmult ($binsize, $m2c);
$hash{$m2c}++;
}
You can't use a regular expression like that to test numeric ranges. And you're using the original value of $m2c as the hash key, not the rounded value.

I think the main problem is your line:
nlowmult ($binsize, $m2c);
Changing this line to:
$m2c = nlowmult ($binsize, $m2c);
would solve at least that problem, because nlowmult() doesn't actually modify $m2c. It just returns the rounded result. You need to tell perl to store that result back into $m2c.
You could combine that line and the one below it if you don't want to actually modify the contents of $m2c:
$hash{nlowmult ($binsize, $m2c)}++;
Probably not a compete answer, but I hope that helps.

Related

perform mathematical operations on a number without changing the attached text

I need a formula that can multiply or divide all the numbers in a string without changing the text attached to the numbers.
I need the numbers in the next column to automatically change according to the given mathematical operation, but the text from the original line must remain unchanged.
I've tried using a combination of REGEXMATCH and REGEXEXTRACT and by doing this I just get the result of multiplying/dividing all the numbers in the string (no text whatsoever).
I also had no success using REGEXREPLACE. I'm not even sure we can actually use it in this case, and maybe I need a different formula instead. Maybe you first need to extract the numbers, multiply them and use something like TEXTJOIN or CONCATENATE to put them together in a string with the values already changed, and is this even possible in this specific example? It's totally fine to perform the operation in several steps if needed (for example, adding SPLIT function or something like that), but the format of the raw data we need to enter and recalculate, unfortunately, cannot be modified.
A sample table for better visualisation can be seen below. Any help would be greatly appreciated!
Raw data
Operation
Desired outcome
25STR/40DEX/70FRES
*0.25
6.25STR/10DEX/17.5FRES
80VIT/30INT/50CRES
*0.75
60STR/22.5INT/37.5CRES
60VIT/20STR/45LRES
*1.25
75VIT/25STR/56.25LRES
You may try:
=byrow(index(bycol(split(A2:A,"/"),lambda(z,ifna(ifs(left(B2:B,1)="*",regexextract(z,"\d+")*mid(B2:B,2,99),left(B2:B,1)="/",round(regexextract(z,"\d+")/mid(B2:B,2,99),2))&regexextract(z,"\d+(.*)"))))),lambda(y,if(y="",,join("/",y))))

Best way to show blank cell if value if zero

=COUNTIFS(Orders!$T:$T,$B4)
is a code that gives 0 or a +ve result
I use this across 1500 cells which makes the sheet gets filled with 0s
I'd like to remove the Zeros by using the following formula
if(COUNTIFS(Orders!$T:$T,$B3,Orders!$F:$F,""&P$1&"*")=0,
"",
COUNTIFS(Orders!$T:$T,$B3,Orders!$F:$F,""&P$1&"*"))
This calculates every formula twice and increases the calculation time.
How can we do this in 1 formula where if the value is 0 - keep empty - otherwise display the answer
I suggest this cell-function:
=IFERROR(1/(1/COUNTIFS(Orders!$T:$T,$B4)))
EDIT:
I'm not sure what to add as explanation. Basically to replace the result of a complex calculation with blank cells if it results in 0, you can wrap the complex function in
IFERROR(1/(1/ ComplexFunction() ))
It works by twice taking the inverse (1/X) of the result, thus returning the original result in all cases except 0 where a DIV0 error is generated. This error is then caught by IFERROR to result in a blank cell.
The advantage of this method is that it doesn't need to calculate the complex function twice, so can give a significant speed/readability increase, and doesn't fool the output like a custom number format which can be important if this cell is used in further functions.
You only need to set the number format for your range of cells.
Go to the menu Format-->Number-->More Formats-->Custom Number Format...
In the entry area at the top, enter the following: #;-#;""
The "format" of the format string is
(positive value format) ; (negative value format) ; (zero value format)
You can apply colors or commas or anything else. See this link for details
instead of your =COUNTIFS(Orders!$T:$T,$B4) use:
=REGEXREPLACE(""&COUNTIFS(Orders!$T:$T,$B4), "^0$", )
also, to speed up things you should avoid "per row formulae" and use ArrayFormulas

How to use environments for lookups

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!
R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

regex matching multiple values when they might not exist

I am trying to right a preg_match_all to match horse race distance.
My source lists races as:
xmxfxy
I want to match the m value, the f value, the y value. However different races will maybe only have m, or f, or y, or two of them or even all three.
// e.g. $raw = 5f213y;
preg_match_all('/(\d{1,})m|(\d{1,})f|(\d{1,})y/', $raw, $distance);
The above sort of works, but for some reason the matches appear in unpredictable positions in the returned array. I guess it is because it is running the match 3 times for each OR. How do I match all three (that may or may not exist) in a single run.
EDIT
A full sample string is:
Hardings Catering Services Handicap (Div I) Cl6 5f213y
If I understand you correctly, you're processing listings (like the one in your question) one at a time. If that's the case, you should be using preg_match, not preg_match_all, and the regex should match the whole "distance" code, not individual components of it. Try this:
preg_match('#\b(?:(?<M>\d+)m|(?<F>\d+)f|(?<Y>\d+)y){1,3}\b#',
$raw, $distance);
The results are now stored in a one-dimensional array, but you don't need to worry about the group numbers anyway; you can access them by name instead (e.g., $distance['M'], $distance['F'], $distance['Y']).
Note that, while this regex matches codes with one, two, or three components, it doesn't require the letters to be unique. There's nothing to stop it from matching something like 1m2m3m (a weakness shared by your own approach, by the way).
you can use "?" as a conditional
preg_match_all('/((\d{1,})m)?|((\d{1,})f)?|((\d{1,})y)?/', $raw, $distance);
If I understand what you're asking correctly, you would like to get each number from these values separately? This works for me:
$input = "Hardings Catering Services Handicap (Div I) Cl6 5f213y";
preg_match_all('/((\d+)(m|f|y))/', $input, $matches);
After the preg_match_all() executes, $matches[2] holds an array of the numbers that matched (in this case, $matches[2][0] is 5 and $matches[2][1] is 213.
If all three values exist, m will be in $matches[2][0], f in $matches[2][1], and y in $matches[2][2]. If any values are missing, the next value gets bumped up a spot. It may also come in handy that $matches[3] will hold an array of the corresponding letter matched on, so if you need to check whether it was an m, f, or y, you can.
If this isn't what you're after, please provide an example of the output you would like to see for this or another sample input.