Getting low accuracy on two fields after labelling using the tool, Form Recognizer, Custom Label - computer-vision

I need help with recognition of two particular fields- credit date and credit type. Getting low accuracy (training ~30%) after labelling and even lower on the test set (~10%).
I am using Custom Label API after labelling, tagging and training.
I think as these two fields appear at different places relative to other fields due to different number of entries in different receipts.
Is there anything I can do to improve these fields' accuracy.

Cognitive Services Form Recognizer service has added support for new and exciting features - multiple forms models (model compose), language expansion, pre-built business cards model, selection marks and lots more are now available in the Form Recognizer v2.1 release.
Form Recognizer sample labeling tool has been updated to support the new release functionality, see this quick start for getting started with custom train with labels.
Please find the snapshot for the JSON for the image that you are trying.

Related

Tensorflow.js constant retraining

I have a application were we gather and classify images triggered by motion detection. We have a model trained on a lot of images that works OK. I have converted it to TF.js format and are able to make predictions in the browser, so far so good.
However we have cameras on a lot of different locations and the lighting and surroundings vary on each location whereas and we also put up new cameras each year. So we would need to retrain the model often and I am also afraid that the model will be to generic and not so accurate on each specific location.
All data we gather from the motion detection is uploaded to our server and we use a web interface to classify all the images as "false positive, positive etc" and store everything in a MYSQL database.
The best solution I think would to have a generic model trained on a lot of data. This model would be implemented on each each specific location. And while we manually interpret each image as we normally would do we would relearn the generic model so that it will be specific to each location.
To solve this we have to serve the models on our server our on some host and be able to write to the model since we are a lot of different people interpreting the data on different browsers and computers.
Would it be possible and a good solution? I would love some input before I invest more time in to this. I haven't found a whole lot of information about, serving writable models and reinforcement learning on tensorflow.js
So
I was wondering if it is possible to serve tensoflow.js on our server that was trained on our data. But for every manual intepretation the model would "relearn" with the new image.

How to train ML .Net model in runtime

is there any way to train an ml .net model in runtime through user input?
I've created a text classification model, trained it local, deployed it and now my users are using it.
Needed workflow:
Text will be categorized, category is displayed to user, he can accept it or select another of the predefined categories, than this feedback should train the model again.
Thanks!
What you are describing seems like online learning.
ML.NET doesn't have any true 'online' models (by which I mean, models that can adapt to new data example by example and instantaneously refresh): all ML.NET algorithms are 'batch' trainers, that require a (typically large) corpus of training data to produce a model.
If your situation allows, you could aggregate the users' responses as 'additional training data', and re-train the model periodically using this data (in addition to the older data, possibly down-sampled or otherwise decayed).
As #Jon pointed out, a slight modification of the above mechanism is to 'incrementally train an existing model on a new batch of data'. This is still a batch method, but it can reduce the retraining time.
Of ML.NET's multiclass trainers, only LbfgsMaximumEntropyMulticlassTrainer supports this mode (see documentation).
It might be tempting to take this approach to the limit, and 'retrain' the model on each 'batch' of one example. Unless you really, really know what you are doing, I would advise against it: more likely than not, such a training regime will be overfitting rapidly and disastrously.

add new vocabulary to existing Doc2vec model

I Already have a Doc2Vec model. I have trained it with my train data.
Now after a while I want to use Doc2Vec for my test data. I want to add my test data vocabulary to my existing model's vocabulary. How can I do this?
I mean how can I update my vocabulary?
Here is my model:
model = model.load('my_model.Doc2vec')
Words that weren't present for training mean nothing to Doc2Vec, so quite commonly, they're just ignored when encountered in later texts.
It would only make sense to add new words to a model if you could also do more training, including those new words, to somehow integrate them with the existing model.
But, while such continued incremental training is theoretically possible, it also requires a lot of murky choices of how much training should be done, at what alpha learning rates, and to what extent older examples should also be retrained to maintain model consistency. There's little published work suggesting working rules-of-thumb, and doing it blindly could just as likely worsen the model's performance as improve it.
(Also, while the parent class for Doc2Vec, Word2Vec, offers an experimental update=True option on its build_vocab() step for later vocabulary-expansion, it wasn't designed or tested with Doc2Vec in mind, and there's an open issue where trying to use it causes memory-fault crashes: https://github.com/RaRe-Technologies/gensim/issues/1019.)
Note that since Doc2Vec is an unsupervised method for creating features from text, if your ultimate task is using Doc2Vec features for classification, it can sometimes be sensible to include your 'test' texts (without class labeling) in the Doc2Vec training set, so that it learns their words and the (unsupervised) relations to other words. The separate supervised classifier would then only be trained on non-test items, and their known labels.

WSO2 ML Cross Validation and Grid Search

I would like to know if the WSO2 ML implement Cross-Validation and Grid Search for best model selection.
Presently, (as of version 1.1.0) WSO2 Machine Learner does not have a direct method for hyper-parameters optimization. As mentioned in your question, we are planning to include Random Search and Grid Search in one of the upcoming releases. In order to track the progress of this process, I have created a public JIRA [1]. So when the new feature is ready I will notify you via this SO Question.
Next, let me briefly describe cross-validation process we use in WSO2 Machine Learning server. In the third step of the ML Wizard of the ML Server, you can set the training data fraction (please see the attached screen shot).
So let's say you pick 0.7 of your data for training. Then, model building process will use 70% of your data for training and rest of the dataset (i.e. 30%) will be used for cross-validation. As you might recognize this a most basic approach for cross-validation and it is not particularly suitable for small datasets. So in upcoming releases, we are planning to include K-fold cross-validations [2] in addition to the currently available cross-validation method.
Yandi, if you need further help regarding this question or anything related to our product please let me know.
Thanks,
Upul
[1] https://wso2.org/jira/browse/ML-313
[2] https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation

Amazon Machine Learning for sentiment analysis

How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?
You can build a good machine learning model for sentiment analysis using Amazon ML.
Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media
Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data that was tagged and build the model with it.
The tagging can be based on Mechanical Turk, like in the example above, or using interns ("the summer is coming") to do that tagging for you. The benefit of having your specific tagging is that you can put your logic into the model. For example, the difference between "The beer was cold" or "The steak was cold", where one is positive and one was negative, is something that a generic system will find hard to learn.
You can also try to play with some sample data, from the project above or from this Kaggle competition for sentiment analysis on movie reviews: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews. I used Amazon ML on that data set and got fairly good results rather easily and quickly.
Note that you can also use the Amazon ML to run real-time predictions based on the model that you are building, and you can use it to respond immediately to negative (or positive) input. See more here: http://docs.aws.amazon.com/machine-learning/latest/dg/interpreting_predictions.html
It is great for starting out. Highly recommend you explore this as an option. However, realize the limitations:
you'll want to build a pipeline because models are immutable--you have to build a new model to incorporate new training data (or new hyperparameters, for that matter)
you are drastically limited in the tweakability of your system
it only does supervised learning
the target variable can't be other text, only a number, boolean or categorical value
you can't export the model and import it into another system if you want--the model is a black box
Benefits:
you don't have to run any infrastructure
it integrates with AWS data sources well
the UX is nice
the algorithms are chosen for you, so you can quickly test and see if it is a fit for your problem space.