Calculating the distance between characters - regex

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.

Related

Linear Programming: How to implement with multiple constraints?

I’m trying to solve a linear programing model and need some help. I’m not a programming expert, but I conceptually can draw up the problem and am hoping for some help implementing it.
I’m looking into an asset allocation problem for an investment portfolio from a theoretical perspective, but for simplicity of this post I’m going to use generic terms.
I have a list of 500+ choices that all have an assigned cost and value add. My goal is to maximize the sum of the value add, given a constraint on how much I can spend. These 500 choices are divided into 5 categories and there are restrictions on how many choices I can have from each category.
Category 1 = 1
Category 2 = 1
Category 3 = 2 or 3
Category 4 = 1 or 2
Category 5 = 2
Category 3 + Category 4 = 4
I figure I’ll need to use a binary X variable attached to each choice and 1 means I’m picking that choice and 0 doesn’t so in the end there should be 8 variables that have 1 and the rest have a 0 value that leads to the maximum value add given the constraints on cost each choice has.
I ultimately hope to be able to run and say for example “what is the nth highest value” so instead of getting the maximum value add I can get the second highest value add and so on.
Is this possible and what software/language would be best to do it? Thanks for your help!
Just to simplify writing everything down, let's assume you had 15 assets, with value added v_1, v_2, ..., v_15 and costs c_1, c_2, ..., c_15. Let's assume assets 1, 2, and 3 are in category 1, assets 4, 5, and 6 are in category 2, assets 7, 8, and 9 are in category 3, assets 10, 11, and 12 are in category 4, and assets 13, 14, and 15 are in category 5. Finally, let's assume a budget B.
We would create binary variables x_1, x_2, ..., x_15 to indicate whether we bought each asset. Now, the objective function of our integer program is:
max v_1*x_1 + v_2*x_2 + ... + v_15*x_15
Our budget constraint is:
c_1*x_1 + c_2*x_2 + ... + c_15*x_15 <= B
Exactly one choice from category 1:
x_1 + x_2 + x_3 = 1
Exactly one choice from category 2:
x_4 + x_5 + x_6 = 1
Either 2 or 3 choices from category 3:
x_7 + x_8 + x_9 >= 2
x_7 + x_8 + x_9 <= 3
Either 1 or 2 choices from category 4:
x_10 + x_11 + x_12 >= 1
x_10 + x_11 + x_12 <= 2
Exactly 2 choices from category 5:
x_13 + x_14 + x_15 = 2
Exactly 4 choices from categories 3 and 4 combined:
x_7 + x_8 + x_9 + x_10 + x_11 + x_12 = 4
Finally, you would specify all variables to be binary.
Note that the only adjustment you would need to your problem is to change the variables in each of these constraints to be the variables associated with each of your five categories.
All that remains would be to implement the model. There are a myriad of linear programming packages in all major languages; check out this survey for details. Since Stack Overflow is not a software recommendation site and you haven't really given any details about your situation (e.g. free vs. non-free solvers or the programming language you're using), I will refrain from suggesting a particular package.

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.
I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.
I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Replace zeros with missing values in certain cases

I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.
Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...

How to count rating?

My question is more mathematical. there is a post in the site. User can like and dislike it. And below the post is written for example -5 dislikes and +23 likes. On the base of these values I want to make a rating with range 0-10 or (-10-0 and 0-10). How to make it correctly?
This may not answer your question as you need a rating between [-10,10] but this blog post describes the best way to give scores to items where there are positive and negative ratings (in your case, likes and dislikes).
A simple method like
(Positive ratings) - (Negative ratings), or
(Positive ratings) / (Total ratings)
will not give optimal results.
Instead he uses a method called Binomial proportion confidence interval.
The relevant part of the blog post is copied below:
CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:
(source: evanmiller.org)
(Use minus where it says plus/minus to calculate the lower bound.) Here p is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.
Here it is, implemented in Ruby, again from the blog post.
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
This is extension to Shepherd's answer.
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
It depends on number of visitors to your app. Lets say if you expect about 100 users rate your app. When a first user click dislike, we will rate it as 0 based on above approach. But this is not logically right.. since our sample is very small to make it a zero. Same with only one positive - our app gets 10 rating.
A better thing would be to add a constant value to numerator and denominator. Lets say if our app has 100 visitors, its safe to assume that until we get 10 ups/downs, we should not go to extremes(neither 0 nor 10 rating). SO just add 5 to each likes and dislikes.
num_likes = num_likes + 5;
num_dislikes = num_dislikes + 5;
total_votes = num_likes + num_dislikes;
rating = round(10*(num_likes)/(total_votes));
It sounds like what you want is basically a percentage liked/disliked. I would do 0 to 10, rather than -10 to 10, because that could be confusing. So on a 0 to 10 scale, 0 would be "all dislikes" and 10 would be "all liked"
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
And that's basically it.