How to make Automl model with Healthcare Entity extraction as base model? - google-cloud-platform

I am facing a problem while making a custom model with Automl. I am supplying Automl with JSONL training data with label DISEASE.
My service account has the permission healthcare.nlpservce.analyzeEntities and also before Start training i am choosing the option to Enable Healthcare Entity Extraction.
But still after 4+ hours of training the model detects only the DISEASE label.
It is not detecting Problems, Procedures, etc
I am following the steps mentioned in the documentation .
Attached photo of the service account's permission(no utilization analysis)
Can anyone please point me in the right direction....

Related

Possible to do incremental training with AWS comprehend?

I am looking at AWS Comprehend for a text classification task which will involve an active learning component. I am trying to understand if it's possible to incrementally train a custom comprehend model using batches of newly annotated data, or if it only supports training from scratch. In this blog post it sounds like they are stitching the annotated data back together with the original training data (i.e. retraining from screatch each time), but I don't see the mentioned cloudformation template (part 1 has the template for training/deployment, but part 2 seems to be talking about another template).
Is it possible to do incremental training with Comprehend? Or would I need to use a custom text classification model through SageMaker and then do incremental training that way? I am attempting to do the following
Get a pretrained model
Fine tune on own classification data
Incrementally train on annotated low confidence preditions
1 and 2 can be done with AWS Comprehend, but not sure about 3. Thanks

Google Vertex AI image AutoML classification when an important image feature is text inside the image

I'd like to do image classification. In my dataset, despite the fact that images features is a strong component for this classification (colors, shapes, etc), some categories of images will be hard to distinguish without interpreting the text inside the image.
I don't think VertexAI/AutoML will use pre-trained models in order to facilitate classification if in some case the only difference is the text. I know Google Vision/OCR is capable of doing such extraction. But is there a way to do image classification (VertexAI/AutoML) using Google Cloud Vision extraction as an additional image feature?
Currently my project uses 3 models (no google cloud):
model 1: classify an image using images features
model 2: classify an image, only using OCR + regex (same categories)
model 3: combine both models and decide when to use model 1 or model 2
I'd like to switch to Vertex AI the following will improve my project quality for the following:
AutoML classification seems very good for model 1
I need to use a tool to manage my datasets (Vertex AI managed dataset)
Vertex AI has interesting pipeline training features
If it is confirmed that AutoML won't perform well if some images categories only differs in the text, I would recreate a similar 3-tier models using Vertex AI custom training scripts. I can easily create model 1 with VertexAI/AutoML. However I have no idea if:
I can create model 2 with a vertex ai custom training script using google cloud vision/ocr to do image classification
I can create model 3 that would use models 1 and 2 created by vertex ai.
Could you give me recommendations on how to achieve that using Google Cloud Platform?
For this purpose, I recommend you the following:
1. model 2:
Keep your images in a GCS.
Use the Detect text in images  |  Cloud Vision API to generate your dataset (text) {"gcs":"gs://path_to_image/image_1","text":["text1"...]}.
Use AutoML on this text dataset processed by vision api or just use a regexp on this data or insert into a bigquery dataset and query on it, and so on...
1. model 3:
I would follow a similar approach, processing the images using the cloud vision API and generating a text dataset, but this time, the images that dont have any text on it, will generate a dataset with the "text" field empty {"gcs":"gs://path_to_image/image_2","text":[]}. Your own script can exclude the data with text and generate a dataset for the model 2, and a dataset for the model 1.
I see that your models 2 and 3 are not strictly classifications. Model 2 is a ocr problem, and them you process the output data. The model 3 is basically process your data and separate the proper datasets.
I hope this insight may help you.

Using google cloud for image classification, cropping and OCR

Please allow me to ask a rather newbie question. So far, I have been using local tools like imagemagick or GOCR to perform the job, but that is rather old-fashioned, and I am urged to "move to google cloud AI".
The setup
I have a (training) data set of various documents (as JPG and PDF) of different kinds, and by certain features (like prevailing color, repetitive layout) I intend to classify them, e.g. as invoice type 1, invoice type 2, not an invoice. In a 2nd step, I would like to OCR certain predefined areas of each document and extract e.g. the address of the company sending the invoice and the date.
The architecture I am envisioning
In a modern platform as a service (pass), I have already set up an UI where I can upload new files. These are then locally stored in a directory with filenames (or in a MongoDB). Meta info like upload timestamp, user, original file name is stored in a DB.
The newly uploaded file should should then be submitted to google cloud which should do the classification step, and deliver back the label to be saved in the database.
The document pages should be auto-cropped, i.e. black or white margins are removed, most probably with google cloud as well. The parameters of the crop should be persisted in the DB.
In case it is e.g. an invoice, OCR should be performed (again by google cloud) for certain regions of the documents, e.g. a bounding box of spanning from the mid of the page to the right margin in the upper 10% of the cropped page. The results of the OCR should be again persisted locally.
The problem
I seem to be missing the correct search term to figure out how to do it with google cloud. Is there an google-API (e.g. REST), I can use to upload and which gives me back the results of steps 2 to 4?
I think that your best option here is to use Document AI (REST API and Libraries).
Using Document AI, you can:
Convert images to text
Classify documents
Analyze and extract entities
Additionally, for your use case, we have a new Document AI feature that is still in preview and has limited access which is the Invoice parser.
Invoice parser is similar to Form parser but for invoices instead of forms. Check out the Invoice parser page and you will see what I mean by preview and limited access.
AFIK, there isn't any GCP tool for image edition.

How to automate predictions with a trained model in google cloud

I have data from a web users in Firestore.
I have inserted some of this data in Google BigQuery in order to run a machine learning model.
I have experience in training Machine Learning models, but I don't have experience in obtain the predictions for new data once this model is trained.
I have read that I can upload this trained model in Google cloud storage and then put it in AI Platform, but I don't know the process I have to follow, because new data it is going to be inserted in Bigquery, with this new data I want to make predictions and then pick this predictions and put them in Firstore again.
I think that it could be done with Dataflow (Apache Beam) or Data composer (Airflow) where I can automate this process and schedule it to run all the process every week, but I don't have experience in use this technologies,can anyone recommend me what technology will be better for this particular case to lookup information on how to use it?
One possibility could be save the model in AI platform or in google cloud storage and with cloud functions call this saved model and make predictions to save them in firestore?
Bigquery ML supports external Tensorflow models.
TensorFlow model importing. This feature allows you to create BigQuery
ML models from previously-trained TensorFlow models, then perform
prediction in BigQuery ML. See the CREATE MODEL statement for
importing TensorFlow models for more information.
So what you want to achieve is
Get a table in BigQuery
Build out a feature set for your model (select statements)
CREATE MODEL in BigQuery (rerun this to re-train)
Run the ML.PREDICT (or equivalent) to get predictions on new data
As new data arrives into BigQuery you can
- retrain the model (externally or internally depends on type of algorithm you have)
- use the new row in predictions
https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro
For doing this you need 2 services:
One for the prediction which serve your model
One for getting the prediction and storing the result in firestore
Personally, I don't recommend you to store your model in AI-Platform today (a new release should happen by the end of the month, but today, it's no!). I wrote an article for hosting a Tensorflow model in Cloud Run. It should work another framework, but I only had built a tensorflow model, and I used it for my tests.
The best solution if your new data are in BigQuery, and if your model is in tensorflow, is to load you model in BigQuery. The prediction is free of charge, you only pay for the data in your query (I'm also writing an article on this, but I'm waiting the new AI-platform release for providing a correct comparison between both solution).
After getting the prediction, (result of BigQuery + call to Cloud Run OR Result of BigQuery with predict clause), you have to iterate of the results to store them into firestore. I recommend you a batch write to firestore
I have read that I can upload this trained model in Google cloud storage
If you want to do this you can use Dataflow. You can write a pipeline that reads data from BigQuery and writes them to GCS.
(I am not sure I understand how you want your job to interact with AI platform and Firestore)

Getting error "INTERNAL" when training a model with AutoML

I'm training a small model with AutoML entity extraction, but the training keeps failing with the error message "INTERNAL" and no other details.
I'm doing this from the Google Cloud console, and I've followed the same steps I've used successfully to train other models.
The dataset has two labels with a few hundred text items each, so I doubt it's a timeout or anything like that.
What might be causing this and is there a way to debug/get more visibility?
Could be that dataset contains duplicate columns which is not currently supported. If this is not your case, I'd suggest to reach with GCP Support to check it internally.