what does fit method do when loading pretrained model (e.g. from onnx file) - ml.net

Could I get rid of the pipeline.Fit(trainingData) method if I load a fully trained model (e.g. from an onnx file)?
What does the fit method do anyway? I read in some sources the method would performing a training step, in other sources I read it fits the pipeline (whatever that should mean). I also read that the fit method just performs the steps defined in the pipeline before.
But do I need this steps from the pipeline if I load a fully trained model?
When I load a model from a .zip file I don`t need the fit method.
To clarify my question I added some code...
(The code doesn`t run without errors... I suggest some problems with the naming of some input and output columns... but thats not the part of the question. ;) )
I want to call the CreatePredictionEngine without the .fit method.
(As said before it would be possible with saved .zip models)
Thanks for clarification in advance. ;)
var pipeline = mlContext.Transforms.LoadImages(outputColumnName: "image", imageFolder: "", inputColumnName: nameof(ImageData.ImagePath))
.Append(mlContext.Transforms.ResizeImages(outputColumnName: "image", imageWidth: ImageNetSettings.imageWidth, imageHeight: ImageNetSettings.imageHeight, inputColumnName: "image"))
.Append(mlContext.Transforms.ExtractPixels(outputColumnName: "inception_v3_input", inputColumnName: "image"))
.Append(mlContext.Transforms.ApplyOnnxModel(modelFile: modelLocation, outputColumnNames: new[] { TinyYoloModelSettings.ModelOutput }, inputColumnNames: new[] { TinyYoloModelSettings.ModelInput }))
.Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "LabelKey", inputColumnName: "Label"))
.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "LabelKey", featureColumnName: TinyYoloModelSettings.ModelOutput))
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabelValue", "PredictedLabel"))
.AppendCacheCheckpoint(mlContext);
IDataView trainingData = mlContext.Data.LoadFromTextFile<ImageData>(path: _trainTagsTsv, hasHeader: false);
ITransformer model = pipeline.Fit(trainingData);
var imageData = new ImageData()
{
ImagePath = _url
};
var predictor = mlContext.Model.CreatePredictionEngine<ImageData, ImagePrediction>(model);
var prediction = predictor.Predict(imageData);

I would highly recommend you to read this document on high-level concepts of ML.NET. As a fellow developer, this may speak to you better than the derived docs and recipes :)
That doc is unfortunately a little bit outdated: I wrote it before we finalized the API on prediction engines, so the code in 'prediction function' will not compile. The rest of the document appears to still hold.
In ML.NET API design, we followed the set of Spark naming conventions. Unfortunately for us, sklearn uses the same names with completely different semantics. So, ML.NET does what Spark does, not what sklearn does.
In short, the 'pipeline' is an Estimator. Estimators have only one operation: Fit, which takes data and produces a Transformer.
Transformers, on the other hand, take data and produce data. The ZIP file that you save the model in contains the transformer.
PredictionEngine is constructed out of a Transformer.
Typically, an Estimator is a 'pipeline' or 'chain' of trainable and non-trainable operators, that include a ML algorithm. However, this is not a requirement: you can build a pipeline out of only non-trainable operators (such as loading an ONNX model from a file). It will still be an Estimator (and therefore you have to call Fit to get the Transformer, even though in this case Fit will be a no-op).
The MLContext's Append methods, by design, only create Estimators. Call it the price of strong typing, but Fit is a requirement.
In this explanation I deliberately didn't use the term 'model': unfortunately, it has become so loaded that it's hard to tell whether 'model' refers to 'the ML algorithm', or 'a mutable object that can train itself', or 'the result of such training'.

Related

How to save and restore a tf.estimator.Estimator model with export_savedmodel?

I started using Tensorflow recently and I try to get use to tf.estimator.Estimator objects. I would like to do something a priori quite natural: after having trained my classifier, i.e. an instance of tf.estimator.Estimator (with the train method), I would like to save it in a file (whatever the extension) and then reload it later to predict the labels for some new data. Since the official documentation recommends to use Estimator APIs, I guess something as important as that should be implemented and documented.
I saw on some other page that the method to do that is export_savedmodel (see the official documentation) but I simply don't understand the documentation. There is no explanation of how to use this method. What is the argument serving_input_fn? I never encountered it in the Creating Custom Estimators tutorial or in any of the tutorials that I read. By doing some googling, I discovered that around a year ago the estimators where defined using an other class (tf.contrib.learn.Estimator) and it looks like the tf.estimator.Estimator is reusing some of the previous APIs. But I don't find clear explanations in the documentation about it.
Could someone please give me a toy example? Or explain me how to define/find this serving_input_fn?
And then how would be load the trained classifier again?
Thank you for your help!
Edit: I discovered that one doesn't necessarily need to use export_savemodel to save the model. It is actually done automatically. Then if we define later a new estimator having the same model_dir argument, it will also automatically restore the previous estimator, as explained here.
As you figured out, estimator automatically saves an restores the model for you during the training. export_savemodel might be useful if you want to deploy you model to the field (for example providing the best model for Tensorflow Serving).
Here is a simple example:
est.export_savedmodel(export_dir_base=FLAGS.export_dir, serving_input_receiver_fn=serving_input_fn)
def serving_input_fn():
inputs = {'features': tf.placeholder(tf.float32, [None, 128, 128, 3])}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
Basically serving_input_fn is responsible for replacing dataset pipelines with a placeholder. In the deployment you can feed data to this placeholder as the input to your model for inference or prediction.

load the GoogleNews-vectors-negative300.bin and predict_output_word

I tried to load the GoogleNews-vectors-negative300.bin and try the predict_output_word method,
I tested three ways, but every failed, the code and error of each way are shown below.
import gensim
from gensim.models import Word2Vec
The first:
I first used this line:
model=Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
print(model.wv.predict_output_word(['king','man'],topn=10))
error:
DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.
The second:
Then I tried:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
print(model.wv.predict_output_word(['king','man'],topn=10))
error:
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
The third:
model = gensim.models.Word2Vec.load('GoogleNews-vectors-negative300.bin')
print(model.wv.predict_output_word(['king','man'],topn=10))
error:
_pickle.UnpicklingError: invalid load key, '3'.
I read the document at
https://radimrehurek.com/gensim/models/word2vec.html
but still have no idea the namespace where the predict_output_word would be in.
Anybody can help?
Thanks.
The GoogleNews set of vectors is just the raw vectors – without a full trained model (including internal weights). So it:
can't be loaded as a fully-functional gensim Word2Vec model
can be loaded as a lookup-only KeyedVectors, but that object alone doesn't have the data or protocols necessary for further model training or other functionality
Google hasn't released the full model that was used to create the GoogleNews vector set.
Note also that the predict_output_word() function in gensim should be considered an experimental curiosity. It doesn't work in hierarchical-softmax models (because it's not as simple to generate ranked predictions). It doesn't quite match the same context-window weighting as is used during training.
Predicting words isn't really the point of the word2vec algorithm – and many imeplementations don't offer any interface for making individual word-predictions outside of the sparse bulk training process. Rather, word2vec uses the exercise of (sloppily) trying to make predictions to train word-vectors that turn out to be useful for other, non-word-prediction, purposes.

Django: how do i create a model dynamically

How do I create a model dynamically upon uploading a csv file? I have done the part where it can read the csv file.
This doc explains very well how to dynamically create models at runtime in django. It also links to an example of doing so.
However, as you will see after looking at the document, it is quite complex and cumbersome to do this. I would not recommend doing this and believe it is quite likely you can determine a model ahead of time that is flexible enough to handle the CSV. This would be much better practice since dynamically changing the schema of your database as your application is running is a recipe for a ton of bugs in your code.
I understand that you want to create new schema's on the fly based on fields in the those in a CSV. While thats a valid use case and could be the absolute right call. I doubt it though - it lends itself to a data model for a single tenet SaaS application that could have goofy performance and migration issues.
I'd try using Mongo/ some other NoSQL solutions as others have mentioned. But a simpler approach may be a modified Star Schema implemented in SQL. In this case you create a dimensions tables that stores each header, then create an instance of each data element that has a foreign key to dimension and records the value of that dimension.
If you read the csv the psuedo code would look something like this:
for row in DictReader(file):
for k in row.keys():
try:
dim = Dimension.objects.get(name=k)
except:
dim = Dimension(name=k)
dim.save()
DimensionRecord(dimension=dim, value=row[k]
Obviously you could better handle reading the headers and error trapping if dimensions already exist, but this would be an example of how you could dynamically load variable headered CSV's into a SQL db.

Tensorflow error using tf.image.random : 'numpy.ndarray' object has no attribute 'get_shape'

Intro
I am using a modified version of the Tensorflow tutorial "Deep MNIST for experts" with the Python API for a medical images classification project using convolutionnal networks.
I want to artificially increase the size of my training set by applying random modifications on the images of my training set.
Problem
When I run the line :
flipped_images = tf.image.random_flip_left_right(images)
I get de following error :
AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'
My Tensor "images" is an ndarray (shape=[batch, im_size, im_size, channels]) of "batch" ndarrays (shape=[im_size, im_size, channels]).
Just to check if my input data was packed in the right shape and type, I have tried to apply this simple function in the (not modified) tutorial "Tensorflow Mechanics 101" and I get the same error.
Finally, I still get the same error trying to use the following functions :
tf.image.random_flip_up_down()
tf.image.random_brightness()
tf.image.random_contrast()
Questions
As input data is usually carried in Tensorflow as ndarrays, I would like to know :
Is it a bug of Tensorflow Python API or is it my "fault" because
of the type/shape of my input data?
How could I get it to work and be able to apply tf.image.random_flip_left_right to my training set?
This seems like an inconsistency in the TensorFlow API, since almost all other op functions accept NumPy arrays wherever a tf.Tensor is expected. I've filed an issue to track the fix.
Fortunately, there is a simple workaround, using tf.convert_to_tensor(). Replace your code with the following:
flipped_images = tf.image.random_flip_left_right(tf.convert_to_tensor(images))

How to create separate python script for uploading data into ndb

Can anyone guide me towards the right direction as to where I should place a script solely for loading data into ndb. As I wish to upload all data into the gae ndb so that the application could perform query on it.
Right now, the loading of data is in my application. I wish to placed it separately from the main application.
Should it be edited in the yaml file?
EDITED
This is a snippet of the entity and the handler to upload the data into GAE ndb.
I wish to placed this chunk of code separately from my main application .py. Reason being the uploading of this data won't be done frequently and to keep the codes in the main application "cleaner".
class TagTrend_refine(ndb.Model):
tag = ndb.StringProperty()
trendData = ndb.BlobProperty(compressed=True)
class MigrateData(webapp2.RequestHandler):
def get(self):
listOfEntities = []
f = open("tagTrend_refine.txt")
lines = f.readlines()
f.close()
for line in lines:
temp = line.strip().split("\t")
data = TagTrend_refine(
tag = temp[0],
trendData = temp[1]
)
listOfEntities.append(data)
ndb.put_multi(listOfEntities)
For example if I placed the above code in a file called dataLoader.py, where should I call this script to invoke?
In app.yaml alongside my main application(knowledgeGraph.application)?
- url: /.*
script: knowledgeGraph.application
You don't show us the application object (no doubt a WSGI app) in your knowledge.py module, so I can't know what URL you want to serve with the MigrateData handler -- I'll just guess it's something like /migratedata.
So the class TagTrend_refine should be in a separate file (usually called models.py) so that both your dataloader.py, and your knowledge.py, can import models to access it (and models.py will need its own import of ndb of course). (Then of course access to the entity class will be as models.TagTrend_refine -- very basic Python).
Next, you'll complete dataloader.py by defining a WSGI app, e.g, at end of file,
app = webapp2.WSGIApplication(routes=[('/migratedata', MigrateData)])
(of course this means this module will need to import webapp2 as well -- can I take for granted a knowledge of super-elementary Python?).
In app.yaml, as the first URL, before that /.*, you'll have:
url: /migratedata
script: dataloader.app
Given all this, when you visit '/migratedata', your handler will read the "tagTrend_refine.txt" file that you uploaded together with your .py, .yaml, and so on, files in your overall GAE application, and unconditionally create one entity per line of that file (assuming you fix the multiple indentation problems in your code as displayed above, but, again, this is just super-elementary Python -- presumably you've used both tabs and spaces and they show up OK in your editor, but not here on SO... I recommend you use strictly, only spaces, never tabs, in Python code).
However this does seem to be a peculiar task. If /migratedata gets visited twice, it will create duplicates of all entities. If you change the tagTrend_refine.txt and deploy a changed variation, then visit /migratedata... all old entities will stick around and all the new entities will join them. And so forth.
Moreover -- /migratedata is NOT idempotent (if visited more than once it does not produce the same state as running it just once) so it shouldn't be a GET (and now we're on to super-elementary HTTP for a change!-) -- it should be a POST.
In fact I suspect (but I'm really flying blind here, since you see fit to give such tiny amounts of information) that you in fact want to upload a .txt file to a POST handler and do the updates that way (perhaps avoiding duplicates...?). However, I'm no mind reader, so this is about as far as I can go.
I believe I have fully answered the question you posted (though perhaps not the one you meant but didn't express:-) and by SO's etiquette it would be nice to upvote and accept this answer, then, if needed, post another question, expressing MUCH more clearly and completely what you're trying to achieve, your current .py and .yaml (ideally with correct indentation), what they actually do and why you'd like to do something different. For POST vs GET in particular, just study When should I use GET or POST method? What's the difference between them? ...
Alex's solution will work, as long as all you data can be loaded in under 1 minute, as that's the timeout for an app engine request.
For larger data, consider calling the datastore API directly from your own computer where you have the source. It's a bit of a hassle because it's a different API; it's not ndb. But it's still a pretty simple API. Here's some code that calls the API:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/2-structured-data/bookshelf/model_datastore.py
Again, this code can run anywhere. It doesn't need to be uploaded to app engine to run.