Process online prediction request - google-cloud-ml

When using ml-engine for online prediction we send a request and get the prediction results, that's cool but Request is usually different compared to model input, for example:
A categorical variable can be in request but model is expecting and integer mapped to that that category
also for a given feature we may need to create multiple features, like splitting text into two or more features
And we might need to exclude some of the features in the request like a constant feature that's useless for the model
How do you handle this process? My solution is to get the request with an appengine app, send it to pub/sub , process it in dataflow, save it to gcs and trigger a cloud function to send the processed request to ml-engine endpoint and get the predicted result. This can be an over-engineering and I want to avoid that, If you have any advice regarding to Xgboost models I'll be appreciated.

We are testing out a feature that allows a user to provide some Python code to be run server-side. This will allow you to do the types of transformations you are trying to do, either as a scikit learn pipeline or as a Python function. If you'd like to test it out, please contact cloudml-feedback#google.com.

Related

PythonOperator or SimpleHttpOperator to make HTTP get request and save results to GCP storage

I an in the early stages of learning Airflow. I am learning Airflow to build a simple ETL (ELT?) data pipeline, and am in the process of the figuring out the architecture for the pipeline (what operators I should use). The basics of my data pipeline are going to be:
Make HTTP GET request from API for raw data.
Save raw JSON results into a GCP bucket.
Transform the data and save into a BigQuery database.
...and the pipeline will be scheduled to run once daily.
As the title suggests, I am trying to determine if the SimpleHttpOperator or PythonOperator is more appropriate to use to make the HTTP GET requests for data. From this somewhat related stackoverflow post, stackoverflow post, the author simply concluded:
Though I think I'm going to simply use the PythonOperator from now on
It seems simple enough to write a 10-20 lines-of-code python script that makes the http request, identifies the GCP storage bucket, and writes to that bucket. However, I'm not sure if this is the best approach for this type of task (call api --> get data --> write to gcp storage bucket).
Any help or thoughts on this, any example links on building similar pipelines, etc. would be greatly helpful. Thanks in advance
I recommend you to see airflow as a glue between processing steps. The processing performed into Airflow should be to conditionally trigger or not a step, doing loop on steps and handle errors.
Why? Because, if tomorrow you choose to change your workflow app, you won't have to code again your process, you will only have to rewrite the workflow logic (because you changed your workflow app). A simple separation of concern.
Thereby, I recommend you to deploy your 10-20 lines of python code into a Cloud Functions and to set a SimpleHTTPOperator to call it.
In addition, it's far more easier to a function than a workflow (to run and to look at the code). The deployments and the updates will be also easier.

AWS Sagemaker - using cross validation instead of dedicated validation set?

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks
SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.
Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata

Django API beyond simple data handling

I have a django application that deploys the model logic and data handling through the administration.
I also have in the same project a python file (scriptcl.py) that makes use of the model data to perform heavy calculations that take some time, per example 5 secs, to be processed.
I have migrated the project to the cloud and now I need an API to call this file (scriptcl.py) passing parameters, process the computation accordingly to the parameters and data of the DB (maintained in the admin) and then respond back.
All examples of the django DRF that I've seen so far only contain authentication and data handling (Create, Read, Update, Delete).
Could anyone suggest an idea to approach this?
In my opinion correct approach would be using Celery to perform this calculations asynchronous.
Write a class which inherits from DRF APIView which handles authentication, write whatever logic you want or call whichever function, Get the final result and send back the JsonReposen. But as you mentioned if the Api takes more time to respond. Then you might have to think of some thing else. Like giving back a request_id and hit that server with the request_id every 5seconds to get the data or something like that.
Just to give a feedback to this, the approach that I took was to build another API using flask and normal python scripts.
I also used sqlalchemy to access the database and retrieve the necessary data.

Google Tag Manager clickstream to Amazon

So the questions has more to do with what services should i be using to have the efficient performance.
Context and goal:
So what i trying to do exactly is use tag manager custom HTML so after each Universal Analytics tag (event or pageview) send to my own EC2 server a HTTP request with a similar payload to what is send to Google Analytics.
What i think, planned and researched so far:
At this moment i have two big options,
Use Kinesis AWS which seems like a great idea but the problem is that it only drops the information in one redshift table and i would like to have at least 4 o 5 so i can differentiate pageviews from events etc ... My solution to this would be to divide from the server side each request to a separated stream.
The other option is to use Spark + Kafka. (Here is a detail explanation)
I know at some point this means im making a parallel Google Analytics with everything that implies. I still need to decide what information (im refering to which parameters as for example the source and medium) i should send, how to format it correctly, and how to process it correctly.
Questions and debate points:
Which options is more efficient and easiest to set up?
Send this information directly from the server of the page/app or send it from the user side making it do requests as i explained before.
Does anyone did something like this in the past? Any personal recommendations?
You'd definitely benefit from Google Analytics custom task feature instead of custom HTML. More on this from Simo Ahava. Also, Google Big Query is quite a popular destination for streaming hit data since it allows many 'on the fly computations such as sessionalization and there are many ready-to-use cases for BQ.

Machine Learning (tensorflow / sklearn) in Django?

I have a django form, which is collecting user response. I also have a tensorflow sentences classification model. What is the best/standard way to put these two together.
Details:
tensorflow model was trained on the Movie Review data from Rotten Tomatoes.
Everytime a new row is made in my response model , i want the tensorflow code to classify it( + or - ).
Basically I have a django project directory and two .py files for classification. Before going ahead myself , i wanted to know what is the standard way to implement machine learning algorithms to a web app.
It'd be awesome if you could suggest a tutorial or a repo.
Thank you !
Asynchronous processing
If you don't need the classification result from the ML code to pass immediately to the user (e.g. as a response to the same POST request that submtted), then you can always queue the classification job to be ran in the background or even a different server with more CPU/memory resources (e.g. with django-background-tasks or Celery)
A queued task would be for example to populate the field UserResponse.class_name (positive, negative) on the database rows that have that field blank (not yet classified)
Real time notification
If the ML code is slow and want to return that result to the user as soon as it is available, you can use the asynchronous approach described above, and pair with the real time notification (e.g. socket.io to the browser (this can be triggered from the queued task)
This becomes necessary if ML execution time is so long that it might time-out the HTTP request in the synchronous approach described below.
Synchronous processing, if ML code is not CPU intensive (fast enough)
If you need that classification result returned immediately, and the ML classification is fast enough *, you can do so within the HTTP request-response cycle (the POST request returns after the ML code is done, synchronously)
*Fast enough here means it wouldn't time-out the HTTP request/response, and the user wouldn't lose patience.
Well, I had to develop the same solution myself. In my case, I used Theano. If you are using tensorflow or theano, you are able to save the model you have built. So first, train the model with your training dataset, then save the model using the library you have chosen. You need to deploy into your django web application only the part of your code that handles the prediction. So using a simple POST, you would give to the user the predicted class of your sentence quickly enough. Also, if you think is needed, you can run a job periodically to train your model again with the new input patterns and save it once more.
I would suggest not to use Django since it will add execution time to the solution.
Instead, you could use node to serve a Reactjs frontend that interacts with the TensorFlow rest API that functions as a standalone server.
As the answer above this post suggests, it will be better to use WebSockets, you could use a react WebSocket module so it will refresh your components once the state of the component changes.
Hope this helps.