Word2vec build vocab adds TM to words - word2vec

I'm trying to make convert my textdata to vectors. I would like to transform the word ultraram to a vector. I added the word to the model using model.build_vocab, but only ultraramTM is added. What did i do wrong
model.save("word2vec.model2")
model = Word2Vec.load("word2vec.model2")
model.build_vocab(data_tokenized, update=True)
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors2")
# Load back with memory-mapping = read-only, shared across processes.
self.wv = KeyedVectors.load("word2vec.wordvectors2", mmap='r')
for i in self.wv.key_to_index:
if "ultrar" in i:
print(i)
ultraram™
manufactureultraram™
ultrarobust
ultrarare
ultrarealistic
ultrarelativistic
it shows some words with a TM. What does this mean? and how can i add the word "utraram" without the tm.

If there's a ™ at the end of some tokens, then those tokens, with the ™, were exactly what had been passed into the model when its vocabulary was first established.
If you don't want them, you'd have to strip them during your tokenization. (You current question/code doesn't show how you might have tokenized your data.)
Separately:
Directly using .load() to replace a Word2Vec model's existing KeyedVectors won't in general be reliable: a Word2Vec isn't expecting that to change separate from its own initialization/training.
It may work in this limited case – eloading exactly the same word-vectors as were just saved – but in such a case, it's unclear why you'd want to do it. From the comment it seems the motivation here might be to save some memory. However, if you're only looking-up word-vectors, you don't the full Word2Vec model at all. You can just use the set-of-KeyedVectors alone (for more memory savings).

Related

Finding word similarities along the output or input vectors using gensim word2vec?

I know that you can use model.wv.most_similar(...) to get words by cosine similarity in gensim.
I also know gensim gives you input and output vectors in e.g. model.syn0 and model.syn1neg.
Is there a simple way to calculate cosine similarity and create a list of most similar using only the input or output vectors, one or the other? E.g. I want to try doing it using just output vectors.
There's no built-in facility, but I believe you could achieve it by creating a separate KeyedVectors instance where you replace the usual ('input projection'/'syn0') vector-array with the same-sized output array (as exists in negative-sampling models).
Roughly the following may work:
full_w2v_model = ... # whatever training/loading is necessary
full_w2v_model.wv.save_word2vec_format(PATH) # saves just the word-vectors
out_vecs = KeyedVectors.load_word2vec_format(PATH) # reloads as separate object
out_vecs.vectors = full_w2v_model.syn1neg # clobber raw vecs with model's output layer
(Let me know if this works as-is or neds some further touch-up to work!)

Handling OOV words in GoogleNews-vectors-negative300.bin

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:
['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].
If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.
I did also
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.train(sentences_with_OOV_words)
However, tensorflow 2 returns an error.
Any help would be greatly appreciate it.
If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):
try:
word_vector = model.wv.get_vector('your_word_here')
except KeyError:
word_vector = np.zeros((300,))
The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.
(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)
You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.
But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)
Awesome! Thank you very much.
def get_vectorOOV(s):
try:
return np.array(model.wv.get_vector(s))
except KeyError:
return np.zeros((300,))

What are the ways of Key-Value extraction from unstructured text?

I'm trying to figure out what are the ways (and which of them the best one) of extraction of Values for predefined Keys in the unstructured text?
Input:
The doctor prescribed me a drug called favipiravir.
His name is Yury.
Ilya has already told me about that.
The weather is cold today.
I am taking a medicine called nazivin.
Key list: ['drug', 'name', 'weather']
Output:
['drug=favipiravir', 'drug=nazivin', 'name=Yury', 'weather=cold']
So, as you can see, in the 3d sentence there is no explicit key 'name' and therefore no value extracted (I think there is the difference with NER). At the same time, 'drug' and 'medicine' are synonyms and we should treat 'medicine' as 'drug' key and extract the value also.
And the next question, what if the key set will be mutable?
Should I use as a base regexp approach because of predefined Keys or there is a way to implement it with supervised learning/NN? (but in this case how to deal with mutable keys?)
You can use a parser to tag words. Your problem is similar to Named Entity Recognition (NER). A lot of libraries, like NLTK in Python, have POS taggers available. You can try those. They are generally trained to identify names, locations, etc. Depending on the type of words you need, you may need to train the parser. So you'll need some labeled data also. Check out this link:
https://nlp.stanford.edu/software/CRF-NER.html

Selectively loading elements from jld file in Julia

I saved an object named results in Julia with the JLDpackage writing
#save "res.jld" results
The object resultsis a
81-element Array{Tuple{Int64,Float64,Array{Array{Array{Int64,1},1},1},Array{Array{Array{Int64,1},1},1},Array{Int64,1}},1}
where each element has 5 elements: Int64, Float64, Array{Array{Array{Int64,1},1},1}, Array{Array{Array{Int64,1},1},1} and Array{Int64,1}.
How can I have access to the first 2 elements of each element (the Int64and the Float64) without loading the whole file, because it requires a large amount of memory. I want to avoid #load "res.jld"because it's too heavy.
What you are looking for isn't quite possible I'm afraid. There is hyperslabbing and it is also partially supported by JLD (simple example here). It will allow you to read in each element one by one. However, it doesn't enable you to only load only the first two components of each element.
Nonetheless, iterating over each element one by one might be still useful as you can avoid loading the full dataset into memory (hence you could process a dataset that is too large to be kept in memory). It probably isn't faster than loading the full dataset (if you can) though.
Creating some (simplified) fake data and saving it to disk
using JLD
results = [(i, Float64(i), rand(3)) for i in 1:1000];
#save "res.jld" results
Basically, what I was describing above would look like this
jldopen("res.jld") do f
for k in 1:length(f["results"])
f["results"][k][1][1:2] # read k-th element and extract first two components.
end
end

How can I perform search on a lookup table without loading it in memory?

Now I have a file recording the entries of a lookup table. If the number of entries is small, I can simply load this file into an STL map and perform search in my code. But what if there are many many entries? If I do it in the way above, it may cause error such as out of memory. I'm here to listen to your advice...
P.S. I just want to perform search without loading all entries into memory.
Can Key-value database solve this problem?
You'll have to load the data from hard drive eventually but sure if a table is huge it won't fit into memory to do a linear search through it, so:
think if you can split the data into a set of files
make an index table of what file contains what entries (say the first 100 entries are in "file1_100", second hundred is in "file101_201" an so on)
using index table from step 2 locate the file to load
load the file and do a linear search
That is a really simplified scheme for a typical database management system so you may want to use one like MySQL, PostgreSQL, MsSQL, Oracle or any one of them.
If that's a study project then after you're done with the search problem, consider optimizing linear operations (by switching to something like binary search) and tables (real databases use balanced tree structures, hash tables and like).
One method would be to reorganize the data in the file into groups.
For example, let's consider a full language dictionary. Usually, dictionaries are too huge to read completely into memory. So one idea is to group the words by first letter.
In this example, you would first read in the appropriate group based on the letter. So if the word you are searching for begins with "m", you would load the "m" group into memory.
There are other methods of grouping such as word (key) length. There can also be subgroups too. In this example, you could divide the "m" group by word lengths or by second letter.
After grouping, you may want to write the data back to another file so you don't have to modify the data anymore.
There are many ways to store groups on the file, such as using a "section" marker. These would be for another question though.
The ideas here, including from #047, are to structure the data for the most efficient search, giving your memory constraints.