Is tf.py_func allowed at online prediction time? - google-cloud-ml

Is tf.py_func allowed at online prediction time?
If yes any examples of how to use it?
Does the answer change if I need to install additional pip packages?
My use-case: I work with text, I need to do word stemming (using porter stemmer), I know how to do it using python, tensorflow doesn't have Ops for that. I would like to use the same text processing at training and prediction time - thus I would like to encode it all into a tensorflow graph.
https://www.tensorflow.org/api_docs/python/tf/py_func comes with known limitations and I would like to know if it will work during training and online prediction before I invest more time into it.
Thanks

Unfortunately, no. Py_func can not be restored from a saved model. However, since your use case involves pre-processing, just invoke the py_func explicitly in all three (train, eval, serving) input functions. This won't work if the py_func is in the middle of your graph, but for stemming, it should work just fine.

Related

TSFRESH library for python is taking way too long to process

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.
I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.
from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)
My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.
Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.
Update
It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by
if __name__ == "__main__":
main()
Once I added
if __name__ == "__main__":
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.
which version of tsfresh did you use? Which OS?
We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.
Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.
from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())
Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).
Syntax has changed slightly (see docs), the current approach would be:
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())
Or
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())
Since version 0.15.0 we have improved our bindings for Apache Spark and dask.
It is now possible to use the tsfresh feature extraction directly in your usual dask or Spark computation graph.
You can find the bindings in tsfresh.convenience.bindings with the documentation here. For example for dask, it would look something like this (assuming df is a dask.DataFrame, for example the robot failure dataframe from our example)
df = df.melt(id_vars=["id", "time"],
value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
var_name="kind", value_name="value")
df_grouped = df.groupby(["id", "kind"])
features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
column_sort="time", column_value="value",
default_fc_parameters=EfficientFCParameters())
# or any other parameter set
Using either dask or Spark (or anything alike) might help you with very large data - both for memory as well as speed (as you can distribute the work over multiple machines). Of course, we still support the usual distributors (docu) as before.
Additional to that, it is also possible to run tsfresh together with a task orchestration system, such as luigi. You can create a task to
* read in the data for only one id and kind
* extract the features
* write out the result to disk
and let luigi handle all the rest. You may find a possible implementation of this here on my blog.
I've found, at least on a multicore machine, that a better way to distribute extract_features calculation over independent subgroups (identified by the column_id value) is through joblib.Parallel with the Loky backend.
For example, you define your features extraction function on a single value of columnd_id and you apply it
from joblib import Parallel, delayed
def map_extract_features(df):
return extract_features(
timeseries_container=df,
default_fc_parameters=settings,
column_id="ID",
column_sort="DATE",
n_jobs=1,
disable_progressbar=True
).reset_index().rename({"index":"ID_CONTO"}, axis=1)
out = Parallel(n_jobs=cpu_count()-1)(
delayed(map_extract_features)(
my_dataframe[my_dataframe["ID"]==id]
) for id in tqdm(my_dataframe["ID"].unique())
)
This method takes way less memory than specifying column_id directly in the extract_features function.

Reduce a Caffe network model

I'd like to use Caffe to extract image features. However, it takes too long to process an image, so I'm looking for ways to optimize for speed.
One thing I noticed is that the network definition I'm using has four extra layers on top the one from which I'm reading a result (and there are no feedback signals, so they should be safe to delete).
I tried to delete them from the definition file but it had no effect at all. I guess I might need to remove the corresponding part of the file that contains pre-trained weights, too. That is, however, a binary file (a protobuffer) so editing it is not that easy.
Do you think that removing the four layers might have a profound effect of the net performance?
If so then how do I get familiar with the file contents so that I could edit it and how do I know which parts to remove?
first, I don't think removing the binary weights will have any effect.
Second, you can do it easily using the python interface: see this tutorial.
Last but not least, have you tried running caffe time to measure the performance of your net? this may help you identify the bottlenecks of your computations.
PS,
You might find this thread relevant as well.
Caffemodel stores data as key-value pair. Caffe only copies weight for those layers (in train.prototxt) having exactly same name as caffemodel. Hence I don't think removing binary weights will work. If you want to change network structure, just modify train.prototxt and deploy.txt.
If you insist to remove weights from binary file, follow this caffe example.
And to make sure you delete right part, this visualizing tool should help.
I would retrain on a smaller input size, change strides, etc. However if you want to reduce file size, I'd suggest quantizing the weights https://github.com/yuanyuanli85/CaffeModelCompression and then using something like lzma compression (xz for unix). We do this so we can deploy to mobile devices. 8 bit weights compress nicely.

Creating custom voice commands (GNU/Linux)

I'm looking for advices, for a personal project.
I'm attempting to create a software for creating customized voice commands. The goal is to allow user/me to record some audio data (2/3 secs) for defining commands/macros. Then, when the user will speak (record the same audio data), the command/macro will be executed.
The software must be able to detect a command in less than 1 second of processing time in a low-cost computer (RaspberryPi, for example).
I already searched in two ways :
- Speech Recognition (CMU-Sphinx, Julius, simon) : There is good open-source solutions, but they often need large database files, and speech recognition is not really what I'm attempting to do. Speech Recognition could consume too much power for a small feature.
- Audio Fingerprinting (Chromaprint -> http://acoustid.org/chromaprint) : It seems to be almost what I'm looking for. The principle is to create fingerprint from raw audio data, then compare fingerprints to determine if they can be identical. However, this kind of software/library seems to be designed for song identification (like famous softwares on smartphones) : I'm trying to configure a good "comparator", but I think I'm going in a bad way.
Do you know some dedicated software or parcel of code doing something similar ?
Any suggestion would be appreciated.
I had a more or less similar project in which I intended to send voice commands to a robot. A speech recognition software is too complicated for such a task. I used FFT implementation in C++ to extract Fourier components of the sampled voice, and then I created a histogram of major frequencies (frequencies at which the target voice command has the highest amplitudes). I tried two approaches:
Comparing the similarities between histogram of the given voice command with those saved in the memory to identify the most probable command.
Using Support Vector Machine (SVM) to train a classifier to distinguish voice commands. I used LibSVM and the results are considerably better than the first approach. However, one problem with SVM method is that you need a rather large data set for training. Another problem is that, when an unknown voice is given, the classifier will output a command anyway (which is obviously a wrong command detection). This can be avoided by the first approach where I had a threshold for similarity measure.
I hope this helps you to implement your own voice activated software.
Song fingerprint is not a good idea for that task because command timings can vary and fingerprint expects exact time match. However its very easy to implement matching with DTW algorithm for time series and features extracted with CMUSphinx library Sphinxbase. See Wikipedia entry about DTW for details.
http://en.wikipedia.org/wiki/Dynamic_time_warping
http://cmusphinx.sourceforge.net/wiki/download

Cross Validation in libsvm

I'm using libsvm library in my project and have recently discovered that it provides out-of-the-box cross validation.
I'm checking the documentation and it says clearly that I have to call svm-train with -n switch to use CV feature
.
When I call it with -v switch I cannot get a model file which is needed by svm-predict.
Implementing Support Vector Machine from scratch is beyond the scope of my project, so I'd rather fix this one if it is broken or ask the community for support.
Can anybody help with that?
Here's the link to the library, implemented in C and C++, and here is the paper that describes how to use it.
Cause libsvm use cv only for parameter selection.
From libsvm FAQ:
Q: After doing cross validation, why there is no model file outputted ?
Cross validation is used for selecting good parameters. After finding them, you want to re-train the whole data without the -v option.
If you are going to use cv for estimating quality of classifier on your data you should implement external cross validation by splitting data, train on some part and test on other.
It's been a while since I used libsvm so I don't think I have the answer you're looking, but if you run the cross-validation and are satisfied with the results, running lib-svm with the same parameters without the -v will yield the same model.

How do I write a Perl script to filter out digital pictures that have been doctored?

Last night before going to bed, I browsed through the Scalar Data section of Learning Perl again and came across the following sentence:
the ability to have any character in a string means you can create, scan, and manipulate raw binary data as strings.
An idea immediately hit me that I could actually let Perl scan the pictures that I have stored on my hard disk to check if they contain the string Adobe. It seems by doing so, I can tell which of them have been photoshopped. So I tried to implement the idea and came up with the following code:
#!perl
use autodie;
use strict;
use warnings;
{
local $/="\n\n";
my $dir = 'f:/TestPix/';
my #pix = glob "$dir/*";
foreach my $file (#pix) {
open my $pic,'<', "$file";
while(<$pic>) {
if (/Adobe/) {
print "$file\n";
}
}
}
}
Excitingly, the code seems to be really working and it does the job of filtering out the pictures that have been photoshopped. But problem is many pictures are edited by other utilities. I think I'm kind of stuck there. Do we have some simple but universal method to tell if a digital picture has been edited or not, something like
if (!= /the origianl format/) {...}
Or do we simply have to add more conditions? like
if (/Adobe/|/ACDSee/|/some other picture editors/)
Any ideas on this? Or am I oversimplifying due to my miserably limited programming knowledge?
Thanks, as always, for any guidance.
Your best bet in Perl is probably ExifTool. This gives you access to whatever non-image information is embedded into the image. However, as other people said, it's possible to strip this information out, of course.
I'm not going to say there is absolutely no way to detect alterations in an image, but the problem is extremely difficult.
The only person I know of who claims to have an answer is Dr. Neal Krawetz, who claims that digitally altered parts of an image will have different compression error rates from the original portions. He claims that re-saving a JPEG at different quality levels will highlight these differences.
I have not found this to be the case, in my investigations, but perhaps you might have better results.
No. There is no functional distinction between a perfectly edited image, and one which was the way it is from the start - it's all just a bag of pixels in the end, after all, and any other metadata you can remove or forge all you want.
The name of the graphics program used to edit the image is not part of the image data itself but of something called meta data - which may be stored in the image file but, as others have noted, is neither required (so some programs may not store it, some may allow you an option of not storing it) nor reliable - if you forged an image, you might have forged the meta data as well.
So the answer to your question is "no, there's no way to universally tell if the pic was edited or not, although some image editing software may write its signature into the image file and it'll be left there by carelessness of the editing person.
If you're inclined to learn more about image processing in Perl, you could take a look at some of the excellent modules CPAN has to offer:
Image::Magick - read, manipulate and write of a large number of image file formats
GD - create colour drawings using a large number of graphics primitives, and emit the drawings in various formats.
GD::Graph - create charts
GD::Graph3d - create 3D Graphs with GD and GD::Graph
However, there are other utilities available for identifying various image formats. It's more of a question for Super User, but for various unix distros you can use file to identify many different types of files, and for MacOSX, Graphic Converter has never let me down. (It was even able to open the bizarre multi-file X-ray of my cat's shattered pelvis that I got on a disc from the vet.)
How would you know what the original format was? I'm pretty sure there's no guaranteed way to tell if an image has been modified.
I can just open the file (with my favourite programming language and filesystem API) and just write whatever I want into that file willy-nilly. As long as I don't screw something up with the file format, you'd never know it happened.
Heck, I could print the image out and then scan it back in; how would you tell it from an original?
As other's have stated, there is no way to know if the image was doctored. I'm guessing what you basically want to know is the difference between a realistic photograph and one that has been enhanced or modified.
There's always the option of running some extremely complex image recognition algorithm that would analyze every pixel in your image and do some very complicated stuff to determine if the image was doctored or not. This solution would probably involve AI which would examine millions of photos that are both doctored and those that are not and learn from them. However, this is more of a theoretical solution and isn't very practical... you would probably only see it in movies. It would be extremely complex to develop and probably take years. And even if you did get something like this to work, it probably still wouldn't be 100% correct all the time. I'm guessing AI technology still isn't at that level and could take a while until it is.
A not-commonly-known feature of exiftool allows you to recognize the originating software through an analysis of the JPEG quantization tables (not relying on image metadata). It recognizes tables written by many applications. Note that some cameras may use the same quantization tables as some applications, so this isn't a 100% solution, but it is worth looking into. Here is an example of exiftool run on two images, the first was edited by photoshop.
> exiftool -jpegdigest a.jpg b.jpg
======== a.jpg
JPEG Digest : Adobe Photoshop, Quality 10
======== b.jpg
JPEG Digest : Canon EOS 30D/40D/50D/300D, Normal
2 image files read
This will work even if the metadata has been removed.
There is existing software out there which uses various techniques (compression artifacting, comparison to signature profiles in a database of cameras, etc.) to analyze the actual image data for evidence of alteration. If you have access to such software and the software available to you provides an API for external access to these analysis functions, then there's a decent chance that a Perl module exists which will interface with that API and, if no such module exists, it could probably be created rather quickly.
In theory, it would also be possible to implement the image analysis code directly in native Perl, but I'm not aware of anyone having done so and I expect that you'd be better off writing something that low-level and processor-intensive in a fully-compiled language (e.g., C/C++) rather than in Perl.
http://www.impulseadventure.com/photo/jpeg-snoop.html
is a tool that does the job almost good
If there has been any cloning , there is a variation in the pixel density..or concentration which sometimes shows up.. upon manual inspection
a Photoshop cloned area will have even pixel density(my meaning is variation of Pixels wrt a scanned image)