I am doing a quasi experimentation and am interested in getting ATT. i have a data with 260k entries where Ti = 0 and 5k entries where Ti = 1. I am calculating ATT with iptw technique where I achieve a great balance and the treatment effect -on treated as -ve 450 euros but not significant.
Weight calculation:
(If treatment = 1, weight = 1 else propensity score / (1-propensity score)
Then, to compare against other methodology, I use nearest neighbour matching with ratio = 1, the balance is again achieved. I get treatment effect (which is ATT by default in matching) as +very 750 and significant.
Shouldn't both method generate similar result ? Which method should I go for in this case and why?
When you match, is there any treated individuals without a match?
In expectation, IPTW and matching should both give the same answer. One possible explanation is that some treated individuals don't have a close match, so they are dropped. When this happens, the population for which the causal effect is defined changes. This could results in different answer between the methods
Each of these methods needs to be evaluated differently.
For IPW you need to check that you did not get samples with extremely low (or extremely high) propensity. If they are close to 0 or 1 then you need to evaluate why this happened and probably remove samples like that from the data. Since your labels are very unbalanced this could certainly happen.
For matching, like #pzivich said, you need to check if there are samples that did not get matched (similar to a very low propensity)
Finally, I like checking the balancing on held-out data to check that there is no over-fitting.
Related
I have a question about doing negative sampling in word2vec.
In order to solve the binary classification problem, I know that words around the target word are labeled positive and words in other ranges are labeled negative. At this time, there are too many negative words. So, negative sampling samples according to the frequency of words appearing in the entire document.
In this case, if the words in positive are sampled in the same negative, how is it processed?
For example, if you look at the target word love in "I love pizza", (love, pizza) should have a positive label. But isn't it possible for (love, pizza) to have a negative label through negative sampling again in the sentence afterwards?
The negative-sampling isn't done based on words elsewhere in the sentence, but the word-frequencies across the entire training corpus.
That is, while the observed neighbors become positive examples, the negative examples are chosen at random from the entire distribution of all known words.
These negative examples might even, by bad luck, be the same as other nearby in-context-window intended-positive words! But this isn't fatal: across all training examples, across all passes, across all negative-samples drawn, the net effect of training remains roughly the same: the positive-examples have their predictions very-often up-weighted, the negative-examples down-weighted, and even occasionally where the negative-example is sometimes also a neighbor, that effect "washes out" across all the (non-biasing) errors, still leaving that real neighbor net-reinforced over the full training run.
So: you're right to note that the process seems a little sloppy, but it all evens out in the end.
FWIW, the word2vec implementations with which I'm most familiar do check if the negative-sample is exactly the positive word we're trying to predict, but don't do any other checking against the whole current window.
If I have a sentence, ex: “get out of here”
And I want to use word2vec Embed. to represent it .. I found three different ways to do that:
1- for each word, we compute the AVG of its embedding vector, so each word replaced by a single value.
2- as in 1, but with using the standard deviation of the embedding vector values.
3- or by adding the Embed. vector as it is. So if I use 300 length embedding vector .. for the above example, I will have in the final a vector of (300 * 4 words) 1200 length as a final vector to represent the sentence.
Which one of them is most suitable .. ? specifically, for the sentence similarity applications ..
The way you describe option (1) makes it sound like each word becomes a single number. That wouldn't work.
The simple approach that's often used is to average all word-vectors for words in the sentence together - so with 300-dimensional word-vectors, you still wind up with a 300-dimensional sentence-average vector. Perhaps that's what you mean by your option (1).
(Sometimes, all vectors are normalized to unit-length before this operation, but sometimes not - because the non-normalized vector lengths can sometimes indicate the strength of a word's meaning. Sometimes, word-vectors are weighted by some other frequency-based indicator of their relative importance, such as TF/IDF.)
I've never seen your option (2) used and don't quite understand what you mean or how it could possibly work.
Your option (3) would be better described as "concatenating the word-vectors". It gives different-sized vectors depending on the number of words in the sentence. Slight differences in word placement, such as comparing "get out of here" and "of here get out", would result in very different vectors, that usual methods of comparing vectors (like cosine-similarity) would not detect as being 'close' at all. So it doesn't make sense, and I've not seen it used.
So, only your option (1), as properly implemented to (weighted-)average word-vectors, is a good baseline for sentence-similarities.
But, it's still fairly basic and there are many other ways to compare sentences using text-vectors. Here are just a few:
One algorithm closely related to word2vec itself is called 'Paragraph Vectors', and is often called Doc2Vec. It uses a very word2vec-like process to train vectors for full ranges of text (whether they're phrases, sentences, paragraphs, or documents) that work kind of like 'floating document-ID words' over the full text. It sometimes offers a benefit over just averaging word-vectors, and in some modes can produce both doc-vectors and word-vectors that are also comparable to each other.
If your interest isn't just pairwise sentence similarities, but some sort of downstream classification task, then Facebook's 'FastText' refinement of word2vec has a classification mode, where the word-vectors are trained not just to predict neighboring words, but to be good at predicting known text classes, when simply added/averaged together. (Text-vectors constructed from such classification vectors might be good at similarities too, depending on how well the training-classes capture salient contrasts between texts.)
Another way to compute pairwise similarities, using just word-vectors, is "Word Mover's Distance". Rather than averaging all the word-vectors for a text together into a single text-vector, it considers each word-vector as a sort of "pile of meaning". Compared to another sentence, it calculates the minimum routing work (distance along lots of potential word-to-word paths) to move all the "piles" from one sentence into the configuration of another sentence. It can be expensive to calculate, but usually represents sentence-contrasts better than the simple single-vector-summary that naive word-vector averaging achieves.
`
model = Word2Vec(sentences,vector_size=100, min_count=1)
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
X.append(sent_vectorizer(sentence, model))
print ("========================")
print (X)
`
Visual Studio / XPath / RegEx:
Given Expression:
(?<TheObject>(Car|Car Blue)) +(?<OldState>.+) +---> +(?<NewState>.+)
Given Searched String:
Car Blue Flying ---> Crashed
I expected:
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
What I get:
TheObject = "Car"
OldState = "Blue Flying"
NewState = "Crashed"
Given new RegEx:
(?<TheObject>(Car Blue|Car)) +(?<OldState>.+) +---> +(?<NewState>.+)
Result is (what I want):
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
I conceptually get what's happening under the hood; the RegEx is putting the first (left-to-right) match it finds in the OR'd list into the <TheObject> group and then goes on.
The OR'd list is built at run time and cannot guarantee the order that "Car" or "Car Blue" is added to the OR'd list in <TheObject> group. (This is dramatically simplified OR'd list)
I could brute force it, by sorting the OR'd list from longest to shortest, but, I was looking for something a little more elegant.
Is there a way to make <TheObject> group capture the largest it can find in the OR'd list instead of the first it finds? (Without me having to worry about the order)
Thank you,
I would normally automatically agree with an answer like ltux's, but not in this case.
You say the alternation group is generated dynamically. How frequently is it generated dynamically? If it's every user request, it's probably faster to do a quick sort (either by longest length first, or reverse-alphabetically) on the object the expression is built from than to write something that turns (Car|Car Red|Car Blue) into (Car( Red| Blue)?).
The regex may take a bit longer (you probably won't even notice a difference in the speed of the regex) but the assembly operation may be much faster (depending on the architecture of the source of your data for the alternation list).
In simple test of an alternation with 702 options, in three methods, results are comparable using an option set like this, but none of these results are taking into calculation the amount of time to build the string, which grows as the complexity of the string grows.
The options are all the same, just in different formats
zap
zap
yes
xerox
...
apple
yes
zap
yes
xerox
...
apple
xerox
zap
yes
xerox
...
apple
...
apple
zap
yes
xerox
...
apple
Using Google Chrome and Javascript, I tried three (edit: four) different formats and saw consistent results for all between 0-2ms.
'Optimized factoring' a(?:4|3|2|1)?
Reverse alphabetically sorting (?:a4|a3|a2|a1|a)
Factoring a(?:4)?|a(?:3)?|a(?:2)?|a(?:1)?. All are consistently coming in at 0 to 2ms (the difference being what else my machine might be doing at the moment, I suppose).
Update: I found a way that you may be able to do this without sorting in Regular Expressions, using a lookahead like this (?=a|a1|a2|a3|a4|a5)(.{15}|.(14}|.{13}|...|.{2}|.) where 15 is the upper bound counting all the way down to the lower bound.
Without some restraints on this method, I feel like it can lead to a lot of problems and false positives. It would be my least preferred result. If the lookahead matches, the capture group (.{15}|...) will capture more than you'll desire on any occasion where it can. In other words, it will reach ahead past the match.
Though I made up the term Optimized Factoring in comparison to my Factoring example, I can't recommend my Factoring example syntax for any reason. Sorted would be the most logical, coupled with easier to read/maintain than exploiting a lookahead.
You haven't given much insight into your data but you may still need to sort the sub groups or factor further if the sub-options can contain spaces and may overlap, further diminishing the value of "Optimized Factoring".
Edit: To be clear, I am providing a thorough examination as to why no form of factoring is a gain here. At least not in any way that I can see. A simple Array.Sort().Reverse().Join("|") gives exactly what anyone in this situation would need.
The | operator of regular expression usually uses Aho–Corasick algorithm under the hood. It will always stop at the left most match it found. We can't change the behaviour of | operator.
So the solution is to avoid using | operator. Instead of (Car Blue|Car) or (Car|Car Blue), use (Car( Blue)?).
(?<TheObject>(Car( Blue)?) +(?<OldState>.+) +---> +(?<NewState>.+)
Then the <TheObject> group will always be Car Blue in the presence of Blue.
I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".
Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...
What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.
Is that a valid/proven approach? Or is it plain stupid?
Is Levenshtein distance supposed to be
used as an absolute value?
It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)
I am using both Daitch-Mokotoff
soundexing and Damerau-Levenshtein to
find out if a user entry and a value
in the application are "the same".
Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?
Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?
Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
Maximize average matching accuracy
You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.
It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:
The Jaro distance dj of two given
strings s1 and s2 is
(m / |s1| + m / |s2| + (m - t) / m) / 3
where:
m is the number of matching characters
t is the number of transpositions
Jaro–Winkler distance uses a prefix
scale p which gives more favourable
ratings to strings that match from the
beginning for a set prefix length l.
The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg
cat -> scat = 1 (75% similar??)
difference -> differences = 1 (90% similar??)
Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.
I use soundexing to rank words that have the same lev distance eg
cat and fat both have a LD of 1 relative to kat, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)
So the short answer is just use the lev distance to determine the similarity.
I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.