Is there any pretrained word2vec model with data containing both single word or multiple words coalesced together such as 'drama', 'drama_film' or '‘africanamericancommunity’. Is there any such model trained with huge dataset such as dataset trained for gloVE?
I did a quick search on google, but unfortunately I could not find a pretrained model. One way to train your own model to detect phrases is to use a bigram model. So, you can take a big wikipedia dump, for instance, preprocess is using bigrams and train the word2vec model.
A good github project which can help you to achieve this is https://github.com/KeepFloyding/wikiNLPpy
A nice article on the topic: https://towardsdatascience.com/word2vec-for-phrases-learning-embeddings-for-more-than-one-word-727b6cf723cf
As stated in google pre-trained word2vec, the pre-trained model by google already contains some phrases (bigrams).
Related
is there any way to train an ml .net model in runtime through user input?
I've created a text classification model, trained it local, deployed it and now my users are using it.
Needed workflow:
Text will be categorized, category is displayed to user, he can accept it or select another of the predefined categories, than this feedback should train the model again.
Thanks!
What you are describing seems like online learning.
ML.NET doesn't have any true 'online' models (by which I mean, models that can adapt to new data example by example and instantaneously refresh): all ML.NET algorithms are 'batch' trainers, that require a (typically large) corpus of training data to produce a model.
If your situation allows, you could aggregate the users' responses as 'additional training data', and re-train the model periodically using this data (in addition to the older data, possibly down-sampled or otherwise decayed).
As #Jon pointed out, a slight modification of the above mechanism is to 'incrementally train an existing model on a new batch of data'. This is still a batch method, but it can reduce the retraining time.
Of ML.NET's multiclass trainers, only LbfgsMaximumEntropyMulticlassTrainer supports this mode (see documentation).
It might be tempting to take this approach to the limit, and 'retrain' the model on each 'batch' of one example. Unless you really, really know what you are doing, I would advise against it: more likely than not, such a training regime will be overfitting rapidly and disastrously.
I Already have a Doc2Vec model. I have trained it with my train data.
Now after a while I want to use Doc2Vec for my test data. I want to add my test data vocabulary to my existing model's vocabulary. How can I do this?
I mean how can I update my vocabulary?
Here is my model:
model = model.load('my_model.Doc2vec')
Words that weren't present for training mean nothing to Doc2Vec, so quite commonly, they're just ignored when encountered in later texts.
It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.
But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.
(Also, while the parent class for Doc2Vec, Word2Vec, offers an experimental update=True option on its build_vocab() step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)
Note that since Doc2Vec is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.
H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?
thank you for your questions.
You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.
You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.
To do that import the model into a Frame (just like with any other dataset):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
And then convert it to a regular H2O word2vec model:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
Please note that you need to provide the size of the embeddings.
H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api
We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.
There is only 2 kinds of in-built prediction/classification models in AWS Machine Learning. Logistic regression and linear regression. Is it possible somehow in current version of AWS ML to:
1) Re-build this what is under the hood of logistic and linear regression models
2) Build your own models written in Python/R, implement them on AWS ML and run things such as neural nets, random forests, clustering alghoritms?
In AWS ML Developer Guide latest version I could not find answers on those questions explicite, that it is impossible to do so. Any tips?
A bit of background first...
Amazon Machine Learning can build models for three kinds of machine learning problems (binary/multiclass classification & regression). As you previously mentioned, the model selected and trained by the platform is abstracted from the user.
This "black box" implementation is perhaps the largest deficiency of Amazon's machine learning platform. You have no information on what model or how the model is trained (beyond, for ex. linear regression, stochastic gradient descent). Amazon is quite clear that this is intentional, as they want the platform to be built into an application, and not just used to train models for one. See the 47:25 and 53:30 mark of this Q&A.
So, to answer your questions:
You cannot see how the exactly models have been trained, for example what constants in a linear regression (although you may be able to deduce by testing the model). When you query the model, the response includes a field which indicates the algorithm used for that particular model (for ex. SGD). A full list of learning algorithms can be found here.
Unfortunately not. You cannot create your own models and import them into AWS Machine Learning, meaning that no decision trees or neural network models can run on the platform.
I have generated two source code for two different model from training set for a classifier in Weka.
Is there any way through which I can combine these two different models generated from the same Weka Classifier?
Any suggestions or solutions would be of great help.
Thank you.
There are multiple ways to combine two or more classifiers into a single classifier. Weka offers schemes such as vote and stacking (weka.classifiers.meta). Boosting and bagging schemes (such as Adaboost) are technically also ways of combining multiple classifiers into one, but they typically work around a single type of classifier and help you train it more intensively.
When using vote you can include multiple classifiers, each of which assign a class to an instance. Vote itself will assign the class that was assigned most often. Stacking trains a meta classifier over the included classifiers. In a way it helps you train a way of interpreting multiple classifications.
You can accomplish that in the Weka Explorer (classification tab), by selecting the pre build models under preBuildClassifiers, or setting them in your code with setPreBuiltClassifiers.