How to explicitly set sagemaker autopilot's validation set? - amazon-web-services

The example notebook: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/autopilot/autopilot_customer_churn.ipynb states that in the Analyzing Data step:
The dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
Presumably, autopilot uses this validation set to select the best performing model candidates to return to the user. However, I have not found a way to manually set this validation set used by sagemaker autopilot.
For example, google automl, allows users to add TRAIN, VALIDATE,TEST keywords to a data_split column to manually set which data points are in which set.
Is something like this currently possible which sagemaker autopilot?

I'm afraid you can't do this at the moment. The validation set is indeed built by Autopilot itself.

Related

Possible to do incremental training with AWS comprehend?

I am looking at AWS Comprehend for a text classification task which will involve an active learning component. I am trying to understand if it's possible to incrementally train a custom comprehend model using batches of newly annotated data, or if it only supports training from scratch. In this blog post it sounds like they are stitching the annotated data back together with the original training data (i.e. retraining from screatch each time), but I don't see the mentioned cloudformation template (part 1 has the template for training/deployment, but part 2 seems to be talking about another template).
Is it possible to do incremental training with Comprehend? Or would I need to use a custom text classification model through SageMaker and then do incremental training that way? I am attempting to do the following
Get a pretrained model
Fine tune on own classification data
Incrementally train on annotated low confidence preditions
1 and 2 can be done with AWS Comprehend, but not sure about 3. Thanks

Data validation in Google Cloud - BigQuery

I want to validate the data that is exported from Y42 to BigQuery in Google Cloud (e.g. given a predefined schema, I want to check whether all columns appear in the data, the ranges of the values, etc.).
I created a Python script that validates the data that comes in a CSV file. However, I do not know how to run the script before exporting the data to Google Cloud. I can create a VM instance in Google Cloud and run a Python script there, but I don't know how to use the data that is stored in Google Cloud in my script. Can anyone give me any hints regarding this issue?
I investigated whether there are any other ways to validate data directly in Google Cloud, but I did not find anything. Is someone aware of any data validation methods in Google Cloud?
What I usually do, I import the data in BigQuery (in a temporary table to not break my clean prod table) and I run a query on it. That query perform all the checks that I want.
If the query return lines, those lines are in error, the others are OK. Then I merge the valid data in the clean prod table, and the bad data in a log table for further analysis.
All that sequence is orchestrated with Cloud Workflow.

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

Getting error "INTERNAL" when training a model with AutoML

I'm training a small model with AutoML entity extraction, but the training keeps failing with the error message "INTERNAL" and no other details.
I'm doing this from the Google Cloud console, and I've followed the same steps I've used successfully to train other models.
The dataset has two labels with a few hundred text items each, so I doubt it's a timeout or anything like that.
What might be causing this and is there a way to debug/get more visibility?
Could be that dataset contains duplicate columns which is not currently supported. If this is not your case, I'd suggest to reach with GCP Support to check it internally.

AWS Machine Learning Retrain Model

I have some Models created in AWS Machine Learning with a S3 csv file.
After a lot of search I didn't find the better way to retrain my model.
I would like to know if there any any option to retrain my models with new data or I if need to create a new one each time.
Amazon ML is providing a set of API (and SDKs) that allows creating programmatically a pipeline that will take new data from S3 and generate the datasource and the ML models from it.
All the components including datasources, ML models, evaluation etc. are immutable, and if you want to retrain, you need to recreate it. It allows you to roll back to a previous model, if the performance of the new model is not better that the old model.