Does or will H2O provide any pretrained vectors for use with h2o word2vec? - word2vec

H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?

thank you for your questions.
You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.
You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.
To do that import the model into a Frame (just like with any other dataset):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
And then convert it to a regular H2O word2vec model:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
Please note that you need to provide the size of the embeddings.
H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api
We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.

Related

add new vocabulary to existing Doc2vec model

I Already have a Doc2Vec model. I have trained it with my train data.
Now after a while I want to use Doc2Vec for my test data. I want to add my test data vocabulary to my existing model's vocabulary. How can I do this?
I mean how can I update my vocabulary?
Here is my model:
model = model.load('my_model.Doc2vec')
Words that weren't present for training mean nothing to Doc2Vec, so quite commonly, they're just ignored when encountered in later texts.
It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.
But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.
(Also, while the parent class for Doc2Vec, Word2Vec, offers an experimental update=True option on its build_vocab() step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)
Note that since Doc2Vec is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.

Specific topics on Tensorflow for CNN

I have a mini project for my new course in Tensorflow for this semester with random topics. Since I have some background on Convolution Neuron Network, I intend to use it for my project. My computer can only run CPU version of TensorFlow.
However, as a new bee, I realize that there are a lot of topics such that MNIST, CIFAR-10, etc, thus I don't know which suitable topic I should pick out from them. I only have two weeks left. It would be great if the topic is not too complicated but too not easy for study because it matchs my intermediate level.
In your experience, could you give me some advice about the specific topic I should do for my project?
Moreover, it would be better if in this topic I can provide my own data to test my training, because my professor said that it is a plus point to get A grade in my project.
Thanks in advance,
I think that to answer this question you need to properly evaluate the marking criteria for your project. However, I can give you a brief overview of what you've just mentioned.
MNIST: MNIST is a Optical Character Recognition task for individual numbers 0-9 in images size 28px square. This is considered the "Hello World" of CNNs. It's pretty basic and might be too simplistic for your requirements. Hard to gauge without more information. Nonetheless, this will run pretty quickly with CPU Tensorflow and the online tutorial is pretty good.
CIFAR-10: CIFAR is a much bigger dataset of objects and vehicles. The image sizes are 32px square so individual image processing isn't too bad. But the dataset is very large and your CPU might struggle with it. It takes a long time to train. You could try training on a reduced dataset but I don't know how that would go. Again, depends on your course requirements.
Flowers-Poets: There is the Tensorflow for Poets re-training example which might not be suitable for your course, you could use the flowers dataset to build your own model.
Build-your-own-model: You could use tf.Layers to build your own network and experiment with it. tf.Layers is pretty easy to use. Alternatively you could look at the new Estimators API that will automate a lot of the training processes for you. There are a number of tutorials (of varying quality) on the Tensorflow website.
I hope that helps give you a run-down of what's out there. Other datasets to look at are PASCAL VOC and imageNet (however they are huge!). Models to look at experimenting with may include VGG-16 and AlexNet.

machine learning for any cancer diagnosis on image dataset with python

Blockquote
i am working on this project asssigned by university as final project. But the issue is i am not getting any help from the internet so i thought may be asking here can solve issue. i had read many articles but they had no code or guidance and i am confused what to do. Basically it is an image processing work with machine learning. Data set can be found easily but issue is python python learning algorithm and code
Blockquote
I presume if it's your final project you have to create the program yourself rather than ripping it straight from the internet. If you want a good starting point which you can customise Tensor Flow from Google is very good. You'll want to understand how it works (i.e. how machine learning works) but as a first step there's a good example of image processing on the website in the form of number recognition (which is also the "Hello World" of machine learning).
https://www.tensorflow.org/get_started/mnist/beginners
This also provides a good intro to machine learning with neural nets: https://www.youtube.com/watch?v=uXt8qF2Zzfo
One note on Tensor Flow, you'll probably have to use Python 3.5+ as in my experience it can be difficult getting it on 2.7.
First of all I need to know what type of data are you using because depending on your data, if it is a MRI or PET scan or CT, there could be different suggestion for using machine learning in python for detection.
However, I suppose your main dataset consist of MR images, I am attaching an article which I found it a great overview of different methods>
This project compares four different machine learning algorithms: Decision Tree, Majority, Nearest Neighbors, and Best Z-Score (an algorithm of my own design that is a slight variant of the Na¨ıve Bayes algorithm)
https://users.soe.ucsc.edu/~karplus/abe/Science_Fair_2012_report.pdf
Here, breast cancer and colorectal cancer have been considered and the algorithms that performed best (Best Z-Score and Nearest Neighbors) used all features in classifying a sample. Decision Tree used only 13 features for classifying a sample and gave mediocre results. Majority did not look at any features and did worst. All algorithms except Decision Tree were fast to train and test. Decision Tree was slow, because it had to look at each feature in turn, calculating the information gain of every possible choice of cutpoint.
My Solution:-
Lung Image Database Consortium provides open access dataset for Lung Cancer Images.
Download it then apply any machine learning algorithm to classify images having tumor cells or not.
I attached a link for reference paper. They applied neural network to classify the images.
For coding part, use python "OpenCV" for image pre-processing and segmentation.
When it comes for classification part, use any machine learning libraries (tensorflow, keras, torch, scikit-learn... much more) as you are compatible to work with and perform classification using any better outperforming algorithms as you wish.
That's it..
Link for Reference Journal

Amazon Machine Learning models rebuilding possibilities

There is only 2 kinds of in-built prediction/classification models in AWS Machine Learning. Logistic regression and linear regression. Is it possible somehow in current version of AWS ML to:
1) Re-build this what is under the hood of logistic and linear regression models
2) Build your own models written in Python/R, implement them on AWS ML and run things such as neural nets, random forests, clustering alghoritms?
In AWS ML Developer Guide latest version I could not find answers on those questions explicite, that it is impossible to do so. Any tips?
A bit of background first...
Amazon Machine Learning can build models for three kinds of machine learning problems (binary/multiclass classification & regression). As you previously mentioned, the model selected and trained by the platform is abstracted from the user.
This "black box" implementation is perhaps the largest deficiency of Amazon's machine learning platform. You have no information on what model or how the model is trained (beyond, for ex. linear regression, stochastic gradient descent). Amazon is quite clear that this is intentional, as they want the platform to be built into an application, and not just used to train models for one. See the 47:25 and 53:30 mark of this Q&A.
So, to answer your questions:
You cannot see how the exactly models have been trained, for example what constants in a linear regression (although you may be able to deduce by testing the model). When you query the model, the response includes a field which indicates the algorithm used for that particular model (for ex. SGD). A full list of learning algorithms can be found here.
Unfortunately not. You cannot create your own models and import them into AWS Machine Learning, meaning that no decision trees or neural network models can run on the platform.

Starting with Data Mining

I have started learning Data Mining and wish to create a small project in C++/Java that allows me to utilize a database, say from twitter and then publish a particular set of results (for eg. all the news items on a feed). I want to know how to go about it? Where should I start?
This is a really broad question, so it's hard to answer. Here are some things to consider:
Where are you going to get the data? You mention twitter, but you'll still need to collect the data in some way. There are probably libraries out there for listening to twitter streams, or you could probably buy the data if someone is selling it.
Where are you going to store the data? Depending on how much you'll have and what you plan to do with it, a traditional relational database may or may not be the best fit. You may be better off with something that supports running mapreduce jobs out-of-the box.
Based on the answers to those questions, the choice of programming languages and libraries will be easier to make.
If you're really set on Java, then I think a Hadoop cluster is probably what you want to start out with. It supports writing mapreduce jobs in Java, and works as an effective platform for other systems such as HBase, a column-oriented datastore.
If your data are going to be fairly regular (that is, not much variation in structure from one record to the next), maybe Hive would be a better fit. With Hive, you can write SQL-like queries, given only data files as input. I've never used Mahout, but I understand that its machine learning capabilities are suited for data mining tasks.
These are just some ideas that come to mind. There are lots of options out there and choosing between them has as much to do with the particular problem you're trying to solve and your own personal tastes as anything else.
If you just want to start learning about Data Mining there are two books that I particularly really enjoy:
Pattern Recognition and Machine Learning. Christopher M. Bishop. Springer.
And this one, which is for free:
http://infolab.stanford.edu/~ullman/mmds.html
Good references for you are
AI course taught by people who actually know the subject,Weka website, Machine Learning datasets, Even more datasets, Framework for supporting the mining of larger datasets.
The first link is a good introduction on AI taught by Peter Norvig and Sebastian Thrun, Google's Research Director, and Stanley's creator (the autonomous car), respectively.
The second link you get you to Weka website. Download the software - which is pretty intuitive - and get the book. Make sure you understand all the concepts: what's data mining, what's machine learning, what are the most common tasks, and what are the rationales behind them. Play a lot with the examples - the software package bundles some datasets - until you understand what generated the results.
Next, go to real datasets and play with them. When tackling massive datasets, you may face several performance issues with Weka - which is more of a learning tool as far as my experience can tell. Thus I recommend you to take a look at the fifth link, which will get you to Apache Mahout website.
It's far from being a simple topic, however, it's quite interesting.
I can tell you how I did it.
1) I got the data using twitter4j.
2) I analyzed the data using JUNG.
You have to define a class representing edges and a class representing vertices.
These classes will contain the attributes of the edges and vertices.
3) Then, there is a simple function to add an edge g.addedge(V1,V2,edgeFromV1ToV2) or to add a vertex g.addVertex(V).
The class that defines edges or vertices is easy to create. As an example :
`public class MyEdge {
int Id;
}`
The same is done for vertices.
Today I would do it with R, but if you don't want to learn a new programming language, just import jung which is a java library.
Data mining is broad fields with many different techniques; classification, clustering, association and pattern mining, outlier detection, etc.
You should first decide what you want to do and then decide wich algorithm you need.
If you are new to data mining, I would recommend to read some books like Introduction to Data Mining by Tan, Steinbach and Kumar.
I would like to suggest you to use python or R for data mining process. Doing work with java or c , it bit difficult in the sense you need to do a lot coding