add new vocabulary to existing Doc2vec model - word2vec

I Already have a Doc2Vec model. I have trained it with my train data.
Now after a while I want to use Doc2Vec for my test data. I want to add my test data vocabulary to my existing model's vocabulary. How can I do this?
I mean how can I update my vocabulary?
Here is my model:
model = model.load('my_model.Doc2vec')

Words that weren't present for training mean nothing to Doc2Vec, so quite commonly, they're just ignored when encountered in later texts.
It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.
But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.
(Also, while the parent class for Doc2Vec, Word2Vec, offers an experimental update=True option on its build_vocab() step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)
Note that since Doc2Vec is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.

Related

Getting low accuracy on two fields after labelling using the tool, Form Recognizer, Custom Label

I need help with recognition of two particular fields- credit date and credit type. Getting low accuracy (training ~30%) after labelling and even lower on the test set (~10%).
I am using Custom Label API after labelling, tagging and training.
I think as these two fields appear at different places relative to other fields due to different number of entries in different receipts.
Is there anything I can do to improve these fields' accuracy.
Cognitive Services Form Recognizer service has added support for new and exciting features - multiple forms models (model compose), language expansion, pre-built business cards model, selection marks and lots more are now available in the Form Recognizer v2.1 release.
Form Recognizer sample labeling tool has been updated to support the new release functionality, see this quick start for getting started with custom train with labels.
Please find the snapshot for the JSON for the image that you are trying.

Tensorflow.js constant retraining

I have a application were we gather and classify images triggered by motion detection. We have a model trained on a lot of images that works OK. I have converted it to TF.js format and are able to make predictions in the browser, so far so good.
However we have cameras on a lot of different locations and the lighting and surroundings vary on each location whereas and we also put up new cameras each year. So we would need to retrain the model often and I am also afraid that the model will be to generic and not so accurate on each specific location.
All data we gather from the motion detection is uploaded to our server and we use a web interface to classify all the images as "false positive, positive etc" and store everything in a MYSQL database.
The best solution I think would to have a generic model trained on a lot of data. This model would be implemented on each each specific location. And while we manually interpret each image as we normally would do we would relearn the generic model so that it will be specific to each location.
To solve this we have to serve the models on our server our on some host and be able to write to the model since we are a lot of different people interpreting the data on different browsers and computers.
Would it be possible and a good solution? I would love some input before I invest more time in to this. I haven't found a whole lot of information about, serving writable models and reinforcement learning on tensorflow.js
So
I was wondering if it is possible to serve tensoflow.js on our server that was trained on our data. But for every manual intepretation the model would "relearn" with the new image.

How to train ML .Net model in runtime

is there any way to train an ml .net model in runtime through user input?
I've created a text classification model, trained it local, deployed it and now my users are using it.
Needed workflow:
Text will be categorized, category is displayed to user, he can accept it or select another of the predefined categories, than this feedback should train the model again.
Thanks!
What you are describing seems like online learning.
ML.NET doesn't have any true 'online' models (by which I mean, models that can adapt to new data example by example and instantaneously refresh): all ML.NET algorithms are 'batch' trainers, that require a (typically large) corpus of training data to produce a model.
If your situation allows, you could aggregate the users' responses as 'additional training data', and re-train the model periodically using this data (in addition to the older data, possibly down-sampled or otherwise decayed).
As #Jon pointed out, a slight modification of the above mechanism is to 'incrementally train an existing model on a new batch of data'. This is still a batch method, but it can reduce the retraining time.
Of ML.NET's multiclass trainers, only LbfgsMaximumEntropyMulticlassTrainer supports this mode (see documentation).
It might be tempting to take this approach to the limit, and 'retrain' the model on each 'batch' of one example. Unless you really, really know what you are doing, I would advise against it: more likely than not, such a training regime will be overfitting rapidly and disastrously.

Does or will H2O provide any pretrained vectors for use with h2o word2vec?

H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?
thank you for your questions.
You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.
You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.
To do that import the model into a Frame (just like with any other dataset):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
And then convert it to a regular H2O word2vec model:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
Please note that you need to provide the size of the embeddings.
H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api
We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.

Correct implementation of the Filter (Criteria) Design Pattern

The design pattern is explained here:
http://www.tutorialspoint.com/design_pattern/filter_pattern.htm
I'm working on a software very similar to Adobe Lightroom or ACDSee but with different purposes. The user (photographer) is able to import thousands of images from his hard drive (it wouldn't be weird to have over 100k/200k images).
We have a side panel where users can create custom "filters" which are expressions like:
Does contain the keyword: "car"
AND
Does not contain the keyword "woods"
AND
(
Camera model is "Nikon D300s"
OR
Camera model is "Canon 7D Mark II"
)
AND
NOT
Directory is "C:\today_pictures"
You can get the idea from the above example.
We have a SQLite database where all image information is stored. The question is, should we load ALL Photo objects into memory from the database the first time the program is loaded and implement the Criteria/Filter design pattern as explained in the website cited above so our Criteria classes filter objects or is better to do the criteria classes actually generate an SQL query that is finally executed in order to retrieve only what's needed from the database?
We are developing the program with C++ (QT).
TL;DR: It's already properly implemented in SQLITE3, and look at how long that took. You'll face the same burden.
It'd be a horrible case of data duplication to read the data from the database and store it again in another data structure. Use database queries to implement the query that the user gave you. Let the database execute the query. That's what databases are for.
By reimplementing a search/query system for ~500k records, you'll be rewriting large chunks of a bog-standard database yourself. It'd be a mostly pointless exercise. SQLITE3 is very well tested and is essentially foolproof. It'll cost you thousands of hours of work to reimplement even a small fraction of its capabilities and reliability/resiliency. If that doesn't scream "reinventing the wheel", I don't know what does.
The database also allows you to very easily implement lookahead/dropdowns to aid the user in writing the query. For example, as you're typing out "camera model is", the user can have an option of autocompletion or a dropdown to select one or more models from.
You paid the "price" of a database, it'd be a shame for it all to go to waste. So, use it. It'll give you lots of leverage, and allow you to implement features two orders of magnitude faster than otherwise.
The pattern you've linked to is just a pattern. It doesn't mean that it's an exact blueprint of how to design your application to perform on real data. You'll be, eventually, fighting things such as concurrency (a file scanning thread running to update the metadata), indexing, resiliency in face of crashes, etc. In the end you'll end up with big chunks of SQLITE reimplemented for your particular application. 500k metadata records are nothing much, if you design your query translator well and support it with proper indexes, it'll work perfectly well.