Produce new word2vec model from existing one - word2vec

Here is word2vec model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True) and it contains words in uppercase. How I can produce new model from this one with all words from it and these words are lowercased? All words would have the same vectors as in source model.

When you're using a set of pre-trained vectors, like GoogleNews-vectors-negative300.bin.gz, the creator of those vectors determined what words, with what case-handling, are included.
Once loaded, lookup in such a model is by exact, case-sensitive string matching.
There's no built-in capability in Gensim for performing later case-normalization, such as converting all keys to lowercase. And if there was, there would be an open question of how to deal with situations where multiple extant keys would all flatten to the same key.
For example, what if a vector set includes separate vectors for "USA", "Usa", and "usa", but you want a case-insensitive lookup of "usa". Should just one of the vectors be retained, discarding the others? Should the vector returned be some average of the three? What if there's some odd mixed-casing, say "usA", that's late in the list of all vectors (and thus was relatively infrequent in the training data). Should that vector have no weight, lesser weight, or equal weight to whatever casing is most-frequent?
If you know how you'd want to resolve such cases, you could certainly tamper with the model itself to modify its mappings. For example, you could look though the w2v_model.index2entity list, which shows the word in each 'slot' of the model, and modify both that last and the w2v_model.vocab dictionary so that it only included the mappings you'd prefer.

Related

Word2Vec Wordvectors Most similar

I trained a Word2Vec Model an I am trying formulate the most_similar function mathematicaly.
I thought about a set, that contains the n most similar word, given a word as reference.
Exist there somewhere a good definition?
You can view the source code which implements most_similar() for the gensim Python library's KeyedVectors abstraction (for holding & performing common actions on sets of word-vectors):
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/keyedvectors.py#L491
Roughly, it first computes a target vector – by combining any positive & negative examples the caller has provided. In the common case, this might just be a single ('positive') word-vector.
Then, it calculates the cosine-similarity with every other vector, and sorts those similarities for highest, and returns the top-N results.

Why is a set used instead of a map? C++

Sets are used to get information of an object by providing all the information, usually used to check if the data exists. A map is used to get the information of an object by using a key (single data). Correct me if I am wrong. Now the question is why would we need a set in the first place, can't we a map to see if the data exist? why would we need to provide all the information just to see if it exist?
There are many operations where you just need a set. Using a map would be just extra space.
Set operations (Union, Intersection etc.).
Keeping unique elements from a collection of numbers, objects etc.
A set serves to group items of the same type that are different among themselves (i.e., they are not equal). For example, the numbers 1 and 2 are both of int type, but 1!=2.
set containers are useful when you want to keep track of collections of homogeneous things as a group, and perform mathematical operations on such groups (like intersection, union, difference, etc). For example, imagine a set of search results containing all the documents mentioning the words cat and dog. And then another set containing all the documents mentioning the words pet. The union of those two sets would give you the group of documents containing the words cat, dog, and pet. Notice that such group will have no repetitions (i.e., if a document was in the both sets initially, it will be only once in the second set).
maps are most certainly not a set, but they can be seen as an arrangement which allows you to associate a value with every element of a set. They are used to represent relationships. For example, the set of people working for a company have an associated employee_number; in this case a map would be a useful structure to represent such relationship.
Going back to the previous example, if you wanted to know how many times has each page been accessed, you could probably create a map along the lines of std::map<Page, int>, that is, a relationship between the pages, and the number of times each has been visited.
Notice that the keys of a map form a set (probably this is what confuses many people), and an implication of this property is that you can only have a given key once (there are some esoteric containers where a key can be mapped to different values though).
So, if you need to interact with groups and collections as a whole, and with the members of the group itself, probably you want a set. If you need to associate certain things with members of a group or a collection, probably you want a map. If the association spans more than one dimension, probably you want a multi_map.
Important notice that in C++ std::set and std::map are ordered. C++11 offers alternative unordered containers called std::unordered_set and std::unordered_map.
A Set contains a unique list of ordered values, but a Map can contain a non unique set of unordered values accessed using a key.
Either could be used to determine if an object exists, it depends on your use case and how you need to be able to access that object - can you test to see if the Set contains an object that you have a reference to, or do you need to look it up by one or more keys to be able to compare it?

Is std::map a good solution?

All,
I have following task.
I have finite number of strings (categories). Then in each category there will be a set of team and the value pairs. The number of team is finite based on the user selection.
Both sizes are not more than 25.
Now the value will change based on the user input and when it change the team should be sorted based on the value.
I was hoping that STL has some kind of auto sorted vector or list container, but the only thing I could find is std::map<>.
So what I think I need is:
struct Foo
{
std::string team;
double value;
operator<();
};
std::map<std::string,std::vector<Foo>> myContainer;
and just call std::sort() when the value will change.
Or is there more efficient way to do it?
[EDIT]
I guess I need to clarify what I mean.
Think about it this way.
You have a table. The rows of this table are teams. The columns of this table are categories. The cells of this table are divided in half. Top half is the category value for a given team. This value is increasing with every player.
Now when the player is added to a team, the scoring categories of the player will be added to a team and the data in the columns will be sorted. So, for category "A" it may be team1, team2; and for category "B" it may be team2, team1.
Then based on the position of each team the score will be assigned for each team/category.
And that score I will need to display.
I hope this will clarify what I am trying to achieve and it become more clear of what I'm looking for.
[/EDIT]
It really depend how often you are going to modify the data in the map and how often you're just going to be searching for the std::string and grabbing the vector.
If your access pattern is add map entry then fill all entries in the vector then access the next, fill all entries in the vector, etc. Then randomly access the map for the vector afterwards then .. no map is probably not the best container. You'd be better off using a vector containing a standard pair of the string and the vector, then sort it once everything has been added.
In fact organising it as above is probably the most efficient way of setting it up (I admit this is not always possible however). Furthermore it would be highly advisable to use some sort of hash value in place of the std::string as a hash compare is many times faster than a string compare. You also have the string stored in Foo anyway.
map will, however, work but it really depends on exactly what you are trying to do.

When do we actually use a Trie?

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie
I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:
S={ball, bid, byte, car, cat, mac, map etc...}
Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.
Then using a Trie we could find things really fast.
But it seems this is not the case.
Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?
So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?
Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.
For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.
We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation #
Ukkonen's suffix tree algorithm in plain English?
As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:
My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)
The trie tag probably has many more examples.
There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.
For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...
So if we have the text "The Text" we would insert the following set:
{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}
In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.
If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.
So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

Given 200 strings, what is a good way to key a LUT of relationship values

I've got 200 strings. Each string has a relationship (measured by a float between 0 and 1) with every other string. This relationship is two-way; that is, relationship A/B == relationship B/A. This yields n(n-1)/2 relationships, or 19,800.
What I want to do is store these relationships in a lookup table so that given any two words I can quickly find the relationship value.
I'm using c++ so I'd probably use a std::map to store the LUT. The question is, what's the best key to use for this purpose.
The key needs to be unique and needs to be able to be calculated quickly from both words.
My approach is going to be to create a unique identifier for each word pair. For example given the words "apple" and "orange" then I combine them together as "appleorange" (alphabetical order, smallest first) and use that as the key value.
Is this a good solution or can someone suggest something more cleverer? :)
Basically you are describing a function of two parameters with the added property that order of parameters is not significant.
Your approach will work if you do not have ambiguity between words when changing order (I would suggest putting a coma or like between the two words to remove possible ambiguities). Any 2D array would also work.
I would probably convert each keyword to some unique identifier (using a simple map) before trying to find the relationship value, but it does not change much from what you are proposing.
If boost/tr1 is acceptable, I would go for an unordered_map with the pair of strings as key. The main question would then be: what with the order of the strings? This could be handled by the hash-function, which starts with the lexical first string.
Remark: this is just a suggestion after reading the design-issue, not a study.
How "quickly" is quickly? Given you don't care about the order of the two words, you could try a map like this:
std::map<std::set<std::string>, double> lut;
Here the key is a set of the two words, so if you insert "apple" and "orange", then the order is the same as "orange" "apple", and given set supports the less than operator, it can function as a key in a map. NOTE: I intentionally did not use a pair for a key, given the order matters there...
I'd start with something fairly basic like this, profile and see how fast/slow the lookups etc. are before seeing if you need to do anything smarter...
If you create a sorted array with the 200 strings, then you can binary search it to find the matching indices of the two strings, then use those two indices in a 2D array to find the relationship value.
If your 200 strings are in an array, your 20,100 similarity values can be in a one dimensional array too. It's all down to how you index into that array. Say x and y are the indexes of the strings you want the similarity for. Swap x and y if necessary so that y>=x, then look at entry i= x + y(y+1)/2 in the large array.
(x,y) of (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),(0,3),(1,3)... will take you to entry 0,1,2,3,4,5,6,7...
So this uses space optimally and it gives faster look up than a map would. I'm assuming efficiency is at least mildly important to you since you are using C++!
[if you're not interested in self similarity values where y=x, then use i = x + y(y-1)/2 instead].