Implementing TensorFlow Attention OCR on iOS - c++

I have successfully trained (using Inception V3 weights as initialization) the Attention OCR model described here: https://github.com/tensorflow/models/tree/master/attention_ocr and frozen the resulting checkpoint files into a graph. How can this network be implemented using the C++ API on iOS?
Thank you in advance.

As suggested by others you can use some existing iOS demos (1, 2) as a starting point, but pay close attention to the following details:
Make sure you use the right tools to "freeze" the model. The SavedModel is a universal serialization format for Tensorflow models.
An model export script can and usually do some kind of input normalization. Note that the Model.create_base function expects a tf.float32 tensor of shape [batch_size, height, width, channels] with values normalized to [-1.25, 1.25]. If you do image normalization as part of the TensorFlow computation graph, make sure images are passed unnormalized and vise versa.
To get names of input/output tensors you can simply print them, e.g. somewhere in your export script:
data_images = tf.placeholder(dtype=tf.float32, shape=[batch_size, height, width, channels], name='normalized_input_images')
endpoints = model.create_base(data_images, labels_one_hot=None)
print(data_images, endpoints.predicted_chars, endpoints.predicted_scores)

Related

How to find logos on a website screenshot

I'm looking for a way to check if a given logo appears on a screenshot of a webpage. So basically, I need to be able to find a small predefined image on a larger image that may or may not contain the smaller image. A match could be of a different scale, somewhat different colors. I need to judge occurrence similarity as well. Need some pointers for what to look at, I've never worked with computer vision before.
Simplest yet not simple way to do it is a normal CNN trained on augmented dataset of the logos.
Trying to keep the answer short, Just make a cnn in tensorflow and train your model on tons images of logos with labels on each training image, It's a simple task and a not-very-crafty CNN must be able to get your work done.
CNN- Convolutional Neural Network
Reference : https://etasr.com/index.php/ETASR/article/view/3919

Tensorflow Lite Micro - Implementing a CNN for Binary Image Classification on ESP32

I am an Electrical & Electronics Engineer trying to implement a binary image classifier that uses a Convolutional Neural Network in Tensorflow Lite Micro on an ESP32. I have trained a simple model that takes in an RGB image of resolution 1024(height)x256(width) in PNG format and returns an output of either 0 or 1 to label the image into two classes.
I have read Pete Warden's book on TinyML and have completed the following steps as per the guide:
Trained a simple binary image classifying CNN in google colab
I have normalized the image to improve training performance and have reduced the total number of parameters to 1464.
I have saved the model into the saved model format and converted the model to .tflite files
Finally I have performed the post-training quantization and generated 4 additional files for Dynamic Range, Float16, Full integer with float fallback, and Full integer only post-training quantized tflite files.
I then used xxd to convert these .tflite files into .cc files as mentioned in the book.
My problem right now is even though I have worked quite a bit with microcontrollers as an electrical and electronics engineer and although the book elaborately explains the steps, I'm having difficulty in understanding how to actually deploy the model onto the microcontroller itself in a simple manner. I have been trying to understand the process by referring to the files structure and the various files for the example projects mentioned in the book, but even the simple hello world example has too many files and the code is challenging for a beginner to understand.
I would like to know, how to pass a ~107KB RGB PNG file into the model for inference. I understand that I would need to load the image onto SPIFFS (SPI Flash File Storage) as a PNG itself, otherwise, I would lose the image compression if I tried to convert the image into an array and stored it as a text file as the file would be simply too large for the microcontroller.
Therefore I would like to seek some guidance on how to proceed with the C++ code for deploying the CNN model onto my ESP32 Microcontroller in an easy way.
Any help is appreciated, Thank you in advance.

Vision Framework with ARkit and CoreML

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

Importing weights into opencv MLP?

Say I want to use some other product to create an MLP (R,Python, Matlab, whatever) but I want to run that network, i.e. just for prediction, under opencv. Assume that the parameters (e.g. activation function) are compatible between the training product and opencv.
How can I import my trained weights into the opencv MLP? Perhaps the training product uses an MxN matrix of weights for each layer where M is the input layer and M the output (and so W(i,j) would be the weight between input node i and output node j.) Perhaps the biases are stored in a separate N element vector. The specifics of the original format don't matter so much because as long as I know what the weights mean and how they are stored I can transform them however opencv needs them.
So, given that, how do I import these weights into a (run time prediction only) opencv MLP? What weight bias (etc?) format does opencv need and how do I set its weights+baises?
I've just run into the same problem. I haven't looked at Opencv's MLP class enough yet to know if there's an easier/simpler way to, but OpenCV let's you save and load MLP's from .xmls and .ymls so if you make an ANN in OpenCV, you can then save it to one of those formats, look at it to figure out the format OpenCV wants, and then save your network into that format from R/Python/MatLab or at least into some format and make a script to translate it from there to OpenCv's format. Once you have that done it should be as simple as instantiating opencv's mlp in the code you actually want to use it to predict on and calling the load("filename") function on it. (I realize this is a year after the fact, so hopefully you found an answer or a work around. If you found a better idea, tell me, I'd love to know).
You must parse your model like how the 'read' function of MLP in OpenCV parse the xml or yml. I think this will not too hard.

How to set classification colors in GDAL output files

I am using the GDAL C++ library to reclassify raster map images and then create an output image of the new data. However when I create the new the new image and open it, the classification values don't seem to have a color defined, so I just get a black image. I can fix this by going into the image properties and setting a color for each of the 10 classification values I'm using, but that is extremely time consuming for the amount of maps and trials I am doing.
My question is, is there a way to set metadata info through the GDAL API to define a color for each classification value? Just the name of the right function would be great, I can figure it out from there.
I have tried this using ArcGIS and QuantumGIS, and both have the same problem. Also the file type I am using is Erdas Imagine (called "HFA" in GDAL).
You can use SetColorTable() method on your raster band. Easiest to do is to fetch one pre-existing raster using GetColorTable(), and pass it to your new raster.