I am looking at AWS Comprehend for a text classification task which will involve an active learning component. I am trying to understand if it's possible to incrementally train a custom comprehend model using batches of newly annotated data, or if it only supports training from scratch. In this blog post it sounds like they are stitching the annotated data back together with the original training data (i.e. retraining from screatch each time), but I don't see the mentioned cloudformation template (part 1 has the template for training/deployment, but part 2 seems to be talking about another template).
Is it possible to do incremental training with Comprehend? Or would I need to use a custom text classification model through SageMaker and then do incremental training that way? I am attempting to do the following
Get a pretrained model
Fine tune on own classification data
Incrementally train on annotated low confidence preditions
1 and 2 can be done with AWS Comprehend, but not sure about 3. Thanks
Related
I'd like to do image classification. In my dataset, despite the fact that images features is a strong component for this classification (colors, shapes, etc), some categories of images will be hard to distinguish without interpreting the text inside the image.
I don't think VertexAI/AutoML will use pre-trained models in order to facilitate classification if in some case the only difference is the text. I know Google Vision/OCR is capable of doing such extraction. But is there a way to do image classification (VertexAI/AutoML) using Google Cloud Vision extraction as an additional image feature?
Currently my project uses 3 models (no google cloud):
model 1: classify an image using images features
model 2: classify an image, only using OCR + regex (same categories)
model 3: combine both models and decide when to use model 1 or model 2
I'd like to switch to Vertex AI the following will improve my project quality for the following:
AutoML classification seems very good for model 1
I need to use a tool to manage my datasets (Vertex AI managed dataset)
Vertex AI has interesting pipeline training features
If it is confirmed that AutoML won't perform well if some images categories only differs in the text, I would recreate a similar 3-tier models using Vertex AI custom training scripts. I can easily create model 1 with VertexAI/AutoML. However I have no idea if:
I can create model 2 with a vertex ai custom training script using google cloud vision/ocr to do image classification
I can create model 3 that would use models 1 and 2 created by vertex ai.
Could you give me recommendations on how to achieve that using Google Cloud Platform?
For this purpose, I recommend you the following:
1. model 2:
Keep your images in a GCS.
Use the Detect text in images | Cloud Vision API to generate your dataset (text) {"gcs":"gs://path_to_image/image_1","text":["text1"...]}.
Use AutoML on this text dataset processed by vision api or just use a regexp on this data or insert into a bigquery dataset and query on it, and so on...
1. model 3:
I would follow a similar approach, processing the images using the cloud vision API and generating a text dataset, but this time, the images that dont have any text on it, will generate a dataset with the "text" field empty {"gcs":"gs://path_to_image/image_2","text":[]}. Your own script can exclude the data with text and generate a dataset for the model 2, and a dataset for the model 1.
I see that your models 2 and 3 are not strictly classifications. Model 2 is a ocr problem, and them you process the output data. The model 3 is basically process your data and separate the proper datasets.
I hope this insight may help you.
I am facing a problem while making a custom model with Automl. I am supplying Automl with JSONL training data with label DISEASE.
My service account has the permission healthcare.nlpservce.analyzeEntities and also before Start training i am choosing the option to Enable Healthcare Entity Extraction.
But still after 4+ hours of training the model detects only the DISEASE label.
It is not detecting Problems, Procedures, etc
I am following the steps mentioned in the documentation .
Attached photo of the service account's permission(no utilization analysis)
Can anyone please point me in the right direction....
I have data from a web users in Firestore.
I have inserted some of this data in Google BigQuery in order to run a machine learning model.
I have experience in training Machine Learning models, but I don't have experience in obtain the predictions for new data once this model is trained.
I have read that I can upload this trained model in Google cloud storage and then put it in AI Platform, but I don't know the process I have to follow, because new data it is going to be inserted in Bigquery, with this new data I want to make predictions and then pick this predictions and put them in Firstore again.
I think that it could be done with Dataflow (Apache Beam) or Data composer (Airflow) where I can automate this process and schedule it to run all the process every week, but I don't have experience in use this technologies,can anyone recommend me what technology will be better for this particular case to lookup information on how to use it?
One possibility could be save the model in AI platform or in google cloud storage and with cloud functions call this saved model and make predictions to save them in firestore?
Bigquery ML supports external Tensorflow models.
TensorFlow model importing. This feature allows you to create BigQuery
ML models from previously-trained TensorFlow models, then perform
prediction in BigQuery ML. See the CREATE MODEL statement for
importing TensorFlow models for more information.
So what you want to achieve is
Get a table in BigQuery
Build out a feature set for your model (select statements)
CREATE MODEL in BigQuery (rerun this to re-train)
Run the ML.PREDICT (or equivalent) to get predictions on new data
As new data arrives into BigQuery you can
- retrain the model (externally or internally depends on type of algorithm you have)
- use the new row in predictions
https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro
For doing this you need 2 services:
One for the prediction which serve your model
One for getting the prediction and storing the result in firestore
Personally, I don't recommend you to store your model in AI-Platform today (a new release should happen by the end of the month, but today, it's no!). I wrote an article for hosting a Tensorflow model in Cloud Run. It should work another framework, but I only had built a tensorflow model, and I used it for my tests.
The best solution if your new data are in BigQuery, and if your model is in tensorflow, is to load you model in BigQuery. The prediction is free of charge, you only pay for the data in your query (I'm also writing an article on this, but I'm waiting the new AI-platform release for providing a correct comparison between both solution).
After getting the prediction, (result of BigQuery + call to Cloud Run OR Result of BigQuery with predict clause), you have to iterate of the results to store them into firestore. I recommend you a batch write to firestore
I have read that I can upload this trained model in Google cloud storage
If you want to do this you can use Dataflow. You can write a pipeline that reads data from BigQuery and writes them to GCS.
(I am not sure I understand how you want your job to interact with AI platform and Firestore)
I'm training a small model with AutoML entity extraction, but the training keeps failing with the error message "INTERNAL" and no other details.
I'm doing this from the Google Cloud console, and I've followed the same steps I've used successfully to train other models.
The dataset has two labels with a few hundred text items each, so I doubt it's a timeout or anything like that.
What might be causing this and is there a way to debug/get more visibility?
Could be that dataset contains duplicate columns which is not currently supported. If this is not your case, I'd suggest to reach with GCP Support to check it internally.
I am trying to create a MODEL using Amazon machine language service, but i am facing a problem that my data set is of 20 KB and contain only 120 rows. As the data set is very small it is impossible that the learning algorithm will learn correctly the data pattern and eventually my prediction model will be incorrect.
Is there any way that amazon itself can create some data on my behalf using the data that i have provided for training ?
You’re correct that 120 rows of data would not be enough to train a good model and then there is data for evaluation part too. so the data that you end up with will be 30% lesser for training.
Amazon cannot create additional data on your behalf. but if you can collect more data points but not the labels (“correct” answers that are used to train the model), you may find Amazon Mechanical Turk useful for collecting these labels.
The next question is how to use Mechanical turk , that i am unsure of and trying to learn. so can't help you on that.