Implementation of vector space model in java - data-mining

Does anyone tell me how to transform the text documents to vectors by the bag-of-words concepts? And how to implement vector space model in java? Actually, i have preprocessed the text data set upto stemming process and now i have to transform those text documents into vectorized model using the bag-of-words concept.Does anyone help me? How to implement this in java?

Build a dictionary.
You assign every word a unique integer index, which is the dimension in the VSM.

Related

Handling OOV words in GoogleNews-vectors-negative300.bin

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:
['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].
If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.
I did also
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.train(sentences_with_OOV_words)
However, tensorflow 2 returns an error.
Any help would be greatly appreciate it.
If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):
try:
word_vector = model.wv.get_vector('your_word_here')
except KeyError:
word_vector = np.zeros((300,))
The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.
(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)
You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.
But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)
Awesome! Thank you very much.
def get_vectorOOV(s):
try:
return np.array(model.wv.get_vector(s))
except KeyError:
return np.zeros((300,))

how to make a vectorized file in python. I need to convert tweets to vector form inorder to run a code in bayesian network

Is it possible to make a dataset atleast? I am doing sentiment analysis and is getting polarity of the message
I was following this tutorial. But it is not the data set required.
http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
It would be great if anyone could explain the csv file given here.
Basically, the process of converting a collection of text documents into numerical feature vectors is called vectorization. There are several techniques or concepts that can be used to vectorize text documents(for eg. word embeddings, bag of words, etc.).
Bag of words is one of the simplest ways to vectorize text into numerical features. TfIdf is an effective vectorization technique based on the bag of words concept.
On a very basic level, TfIdf uses a set of unigrams or bigrams(n-grams in general) from the entire text corpus and uses them as the features for all your text documents(tweets in your case). So if you imagine your text corpus as a table of numerical values then each row would be a text document(a tweet in your case) and each column would be a unigram(which is basically a word) and the value of each cell (i,j) in the table would depend on the term frequency of unigram j in the tweet i(the number of times that the particular unigram occurs in the tweet) and the inverse of the document frequency of the unigram j(the number of tweets that the particular unigram occurs in all the tweets combined). Hence, you would have a list of tweets as vectors which would have a numerical tfidf values corresponding to each feature(unigram).
For more information on how to implement tfidf look at the following links:
http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

python k-means clustering text

I am trying to find an example to assist me to cluster some textual data I have. The data is in the form:
A,B,3
C,D,5
A,D,57
The two first entries are the members of a pair, the number is how often this pair occurs in the dataset. I have over 200,000 unique pairs.
Any tips? Thanks!!
Don't use k-means on such data.
It will not work.
What you have is a similarity matrix, not continuous vectors as needed for k-means. You can try hierarchical clustering (with a sparse similarity, not a distance; no, I won't write the code for you).

Django: Storing huge matrix in table or in a file?

I use Django to manage a machine learning process. At the end of the calculation stage, I'm left with a huge matrix data (~50MB of floats). Should I store it in my Django model (binary field?) or in a file (FileField)? It seems there are pros and cons for the two alternatives.
My specific case: I just need to write the data once the training is finished and load it in memory each time I want to use the learned model. No query. Just read entire data in matrix and write matrix in table/file.
Thanks for asking back!!
I am adapting my answer according to your use-case.
Since you just need to write data each time after training,You should try this

How to create an index for a collection of vectors/histograms for content based image retrieval

I'm currently writing a Bag of visual words-based image retrieval system which is similar to the Vector Space Model in text retrieval. Under this framework, each image is represented by a vector (or sometimes also called histogram in the literature). Basically each number in the vector counts the number of times each "visual word" occur in that image. If 2 images have vectors which are "close" together, this means they have many image features in common and are hence similar.
I'm basically trying to create the inverted file index for a set of such vectors. I want something that can scale from thousands (during trial stage) to hundred of thousands or million+ images so a home made data structure hack will not work.
I've looked at Lucene but apparently it only indexes text (correct me if I'm wrong) whereas in my case I want it to index numbers (i.e. the vectors themselves). I've seen cases where people convert the vector to a text document in the following way:
<3, 6, ..., 5> --> "w1 w2... wn". Basically any component that is non-zero is replaced by a textual word "w[n]" where n is the index of that number. This "document" is then passed to Lucene to index.
The problem with using this method is that the text representation for the vector does not encode how frequently the particular "word" occur so the ranking of the retrieved images would not be good.
Does anyone know of a mature indexing API that can handle vectors or perhaps suggest a different encoding scheme for my vectors so that I can continue to use Lucene? I've also looked at Lucene for Image Retrieval (LIRE) project and have tried the demo that came with it but the number of exceptions that were generated when I ran that demo makes me unsure about using it.
As for language of API, I'm open to C++ or Java.
Thanks in advance for any replies.
You can try GRire which is a Java library that implements the Bag of Visual Words model. It is my project and I am currently working on implementing an inverted index.