How can I post a Watson Machine Learning scoring request with a sparse matrix as a parameter - data-science-experience

Because of the current limitation regarding the publish of scikit-learn models on Watson ML service, which does not allow any custom transformer etc (https://datascience.ibm.com/docs/content/analyze-data/ml-scikit-learn.html) in the pipeline, I ended up deploying a pipeline that only contains the SVC classifier, and not the TfidfVectorizer as well.
Which means, I need to "transform" my raw test data with the TfidfVectorizer before invoking the model on Watson ML.
This is working fine as long as I don't try the online deployment approach (which I need, since I want an app to POST a request to my model).
How should I serialise my sparse matrix from the TfidfVectorizer.transform and pass it as a json payload to the WML service ?
Thanks !

So actually, I am answering my question ;-)
If you get into that situation where you have to send a sparse matrix to WML, then you can use
<yourmatrix>.todense().tolist()
So, to put it back in context of my initial issue, I can send the result of the transform as such:
valuesList = tfidf_vectorizer.transform(test).todense().tolist()
payload_scoring = {"values": [[valuesList]]}
response_scoring = requests.post(scoringUrl, json=payload_scoring, headers=header)

Related

AWS Sagemaker CustomerError: Encoding Mismatch when monitoring input

I've deployed a Pipeline model in AWS and am now trying to use ModelMonitor to assess incoming data behavior, but it failes when generating monitoring report
The pipeline consists of a preprocessing step and then a regular XGBoost container. The model is invoked with Content-type: application/json.
For that I set up as stated in the docs, but it fails with the following error
Exception in thread "main" com.amazonaws.sagemaker.dataanalyzer.exception.CustomerError: Error: Encoding mismatch: Encoding is JSON for endpointInput, but Encoding is CSV for endpointOutput. We currently only support the same type of input and output encoding at the moment.
I've found this issue at GitHub, but didn't help me.
Digging depper into how XGBoost outputs, I've found out that it's CSV encoded, hence the error makes sense, but even deploying the model enforcing the serializers fails (code in the section below)
I'm configuring the schedule as recommended by AWS, I've just changed the location of my constraints (had to manually adjust'em)
---> Tried so far (all attempts fail with the exact same error)
As mentioned in the issue, but since I'm expecting a json payload, I've used
data_capture_config=DataCaptureConfig(
enable_capture = True,
sampling_percentage=100,
json_content_types = ['application/json'],
destination_s3_uri=MY_BUCKET)
Tried enforcing the (de)serializer of the predictor (I'm not sure if that even makes sense)
predictor = Predictor(
endpoint_name=MY_ENDPOINT,
# Hoping that I could force the output to be a JSON
deserializer=sagemaker.deserializers.JSONDeserializer)
and later
predictor = Predictor(
endpoint_name=MY_ENDPOINT,
# Hoping that I could force the input to be a CSV
serializer=sagemaker.serializers.CSVSerializer)
Setting (de)serializer during deploy
p_modle = pipeline_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name=MY_ENDPOINT,
serializer = sagemaker.serializers.JSONSerializer(),
deserializer= sagemaker.deserializers.JSONDeserializer(),
wait = True)
I have come across a similar issue earlier while invoking the endpoint using boto3 sagemaker runtime. Try adding the 'Accept' parameter in invoke_endpoint function with value as 'application/json'.
refer for more help https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_RequestSyntax

what does fit method do when loading pretrained model (e.g. from onnx file)

Could I get rid of the pipeline.Fit(trainingData) method if I load a fully trained model (e.g. from an onnx file)?
What does the fit method do anyway? I read in some sources the method would performing a training step, in other sources I read it fits the pipeline (whatever that should mean). I also read that the fit method just performs the steps defined in the pipeline before.
But do I need this steps from the pipeline if I load a fully trained model?
When I load a model from a .zip file I don`t need the fit method.
To clarify my question I added some code...
(The code doesn`t run without errors... I suggest some problems with the naming of some input and output columns... but thats not the part of the question. ;) )
I want to call the CreatePredictionEngine without the .fit method.
(As said before it would be possible with saved .zip models)
Thanks for clarification in advance. ;)
var pipeline = mlContext.Transforms.LoadImages(outputColumnName: "image", imageFolder: "", inputColumnName: nameof(ImageData.ImagePath))
.Append(mlContext.Transforms.ResizeImages(outputColumnName: "image", imageWidth: ImageNetSettings.imageWidth, imageHeight: ImageNetSettings.imageHeight, inputColumnName: "image"))
.Append(mlContext.Transforms.ExtractPixels(outputColumnName: "inception_v3_input", inputColumnName: "image"))
.Append(mlContext.Transforms.ApplyOnnxModel(modelFile: modelLocation, outputColumnNames: new[] { TinyYoloModelSettings.ModelOutput }, inputColumnNames: new[] { TinyYoloModelSettings.ModelInput }))
.Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "LabelKey", inputColumnName: "Label"))
.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "LabelKey", featureColumnName: TinyYoloModelSettings.ModelOutput))
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabelValue", "PredictedLabel"))
.AppendCacheCheckpoint(mlContext);
IDataView trainingData = mlContext.Data.LoadFromTextFile<ImageData>(path: _trainTagsTsv, hasHeader: false);
ITransformer model = pipeline.Fit(trainingData);
var imageData = new ImageData()
{
ImagePath = _url
};
var predictor = mlContext.Model.CreatePredictionEngine<ImageData, ImagePrediction>(model);
var prediction = predictor.Predict(imageData);
I would highly recommend you to read this document on high-level concepts of ML.NET. As a fellow developer, this may speak to you better than the derived docs and recipes :)
That doc is unfortunately a little bit outdated: I wrote it before we finalized the API on prediction engines, so the code in 'prediction function' will not compile. The rest of the document appears to still hold.
In ML.NET API design, we followed the set of Spark naming conventions. Unfortunately for us, sklearn uses the same names with completely different semantics. So, ML.NET does what Spark does, not what sklearn does.
In short, the 'pipeline' is an Estimator. Estimators have only one operation: Fit, which takes data and produces a Transformer.
Transformers, on the other hand, take data and produce data. The ZIP file that you save the model in contains the transformer.
PredictionEngine is constructed out of a Transformer.
Typically, an Estimator is a 'pipeline' or 'chain' of trainable and non-trainable operators, that include a ML algorithm. However, this is not a requirement: you can build a pipeline out of only non-trainable operators (such as loading an ONNX model from a file). It will still be an Estimator (and therefore you have to call Fit to get the Transformer, even though in this case Fit will be a no-op).
The MLContext's Append methods, by design, only create Estimators. Call it the price of strong typing, but Fit is a requirement.
In this explanation I deliberately didn't use the term 'model': unfortunately, it has become so loaded that it's hard to tell whether 'model' refers to 'the ML algorithm', or 'a mutable object that can train itself', or 'the result of such training'.

How to provide autocomplete street addresses functionality in django?

Is there any way to provide automatic street address suggestion when user tries to enter their address in the input box using so they start getting automatic address suggestions in django.
I was searching autocomplete light but could not find specially anything related to that.
I've implemented this functionality with Google's Place Autocomplete. The sample code in the link is pretty spot on from memory.
https://developers.google.com/maps/documentation/javascript/examples/places-autocomplete-addressform
You could use python to make the requests yourself and implement some light javascript to fill your inputs:
This will return you a list of predictions based on the input address you supply. You just need an address and your API key to run the query, but note in my example I'm using the components/types parameters as well.
here's the documentation: https://developers.google.com/maps/documentation/places/web-service/autocomplete
url = f'https://maps.googleapis.com/maps/api/place/autocomplete/json?input=<address>&components=country:us&types=address&key=<your key>'
r = requests.get(url)
predictions = r.json()['predictions']
for p in predictions:
print(p['description'])
print('-----------------------')

How to save and restore a tf.estimator.Estimator model with export_savedmodel?

I started using Tensorflow recently and I try to get use to tf.estimator.Estimator objects. I would like to do something a priori quite natural: after having trained my classifier, i.e. an instance of tf.estimator.Estimator (with the train method), I would like to save it in a file (whatever the extension) and then reload it later to predict the labels for some new data. Since the official documentation recommends to use Estimator APIs, I guess something as important as that should be implemented and documented.
I saw on some other page that the method to do that is export_savedmodel (see the official documentation) but I simply don't understand the documentation. There is no explanation of how to use this method. What is the argument serving_input_fn? I never encountered it in the Creating Custom Estimators tutorial or in any of the tutorials that I read. By doing some googling, I discovered that around a year ago the estimators where defined using an other class (tf.contrib.learn.Estimator) and it looks like the tf.estimator.Estimator is reusing some of the previous APIs. But I don't find clear explanations in the documentation about it.
Could someone please give me a toy example? Or explain me how to define/find this serving_input_fn?
And then how would be load the trained classifier again?
Thank you for your help!
Edit: I discovered that one doesn't necessarily need to use export_savemodel to save the model. It is actually done automatically. Then if we define later a new estimator having the same model_dir argument, it will also automatically restore the previous estimator, as explained here.
As you figured out, estimator automatically saves an restores the model for you during the training. export_savemodel might be useful if you want to deploy you model to the field (for example providing the best model for Tensorflow Serving).
Here is a simple example:
est.export_savedmodel(export_dir_base=FLAGS.export_dir, serving_input_receiver_fn=serving_input_fn)
def serving_input_fn():
inputs = {'features': tf.placeholder(tf.float32, [None, 128, 128, 3])}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
Basically serving_input_fn is responsible for replacing dataset pipelines with a placeholder. In the deployment you can feed data to this placeholder as the input to your model for inference or prediction.

Tensorflow error using tf.image.random : 'numpy.ndarray' object has no attribute 'get_shape'

Intro
I am using a modified version of the Tensorflow tutorial "Deep MNIST for experts" with the Python API for a medical images classification project using convolutionnal networks.
I want to artificially increase the size of my training set by applying random modifications on the images of my training set.
Problem
When I run the line :
flipped_images = tf.image.random_flip_left_right(images)
I get de following error :
AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'
My Tensor "images" is an ndarray (shape=[batch, im_size, im_size, channels]) of "batch" ndarrays (shape=[im_size, im_size, channels]).
Just to check if my input data was packed in the right shape and type, I have tried to apply this simple function in the (not modified) tutorial "Tensorflow Mechanics 101" and I get the same error.
Finally, I still get the same error trying to use the following functions :
tf.image.random_flip_up_down()
tf.image.random_brightness()
tf.image.random_contrast()
Questions
As input data is usually carried in Tensorflow as ndarrays, I would like to know :
Is it a bug of Tensorflow Python API or is it my "fault" because
of the type/shape of my input data?
How could I get it to work and be able to apply tf.image.random_flip_left_right to my training set?
This seems like an inconsistency in the TensorFlow API, since almost all other op functions accept NumPy arrays wherever a tf.Tensor is expected. I've filed an issue to track the fix.
Fortunately, there is a simple workaround, using tf.convert_to_tensor(). Replace your code with the following:
flipped_images = tf.image.random_flip_left_right(tf.convert_to_tensor(images))