TSFRESH library for python is taking way too long to process - python-2.7

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.
I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.
from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)
My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.
Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.
Update
It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by
if __name__ == "__main__":
main()
Once I added
if __name__ == "__main__":
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.

which version of tsfresh did you use? Which OS?
We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.
Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.
from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())
Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).

Syntax has changed slightly (see docs), the current approach would be:
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())
Or
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())

Since version 0.15.0 we have improved our bindings for Apache Spark and dask.
It is now possible to use the tsfresh feature extraction directly in your usual dask or Spark computation graph.
You can find the bindings in tsfresh.convenience.bindings with the documentation here. For example for dask, it would look something like this (assuming df is a dask.DataFrame, for example the robot failure dataframe from our example)
df = df.melt(id_vars=["id", "time"],
value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
var_name="kind", value_name="value")
df_grouped = df.groupby(["id", "kind"])
features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
column_sort="time", column_value="value",
default_fc_parameters=EfficientFCParameters())
# or any other parameter set
Using either dask or Spark (or anything alike) might help you with very large data - both for memory as well as speed (as you can distribute the work over multiple machines). Of course, we still support the usual distributors (docu) as before.
Additional to that, it is also possible to run tsfresh together with a task orchestration system, such as luigi. You can create a task to
* read in the data for only one id and kind
* extract the features
* write out the result to disk
and let luigi handle all the rest. You may find a possible implementation of this here on my blog.

I've found, at least on a multicore machine, that a better way to distribute extract_features calculation over independent subgroups (identified by the column_id value) is through joblib.Parallel with the Loky backend.
For example, you define your features extraction function on a single value of columnd_id and you apply it
from joblib import Parallel, delayed
def map_extract_features(df):
return extract_features(
timeseries_container=df,
default_fc_parameters=settings,
column_id="ID",
column_sort="DATE",
n_jobs=1,
disable_progressbar=True
).reset_index().rename({"index":"ID_CONTO"}, axis=1)
out = Parallel(n_jobs=cpu_count()-1)(
delayed(map_extract_features)(
my_dataframe[my_dataframe["ID"]==id]
) for id in tqdm(my_dataframe["ID"].unique())
)
This method takes way less memory than specifying column_id directly in the extract_features function.

Related

Is tf.py_func allowed at online prediction time?

Is tf.py_func allowed at online prediction time?
If yes any examples of how to use it?
Does the answer change if I need to install additional pip packages?
My use-case: I work with text, I need to do word stemming (using porter stemmer), I know how to do it using python, tensorflow doesn't have Ops for that. I would like to use the same text processing at training and prediction time - thus I would like to encode it all into a tensorflow graph.
https://www.tensorflow.org/api_docs/python/tf/py_func comes with known limitations and I would like to know if it will work during training and online prediction before I invest more time into it.
Thanks
Unfortunately, no. Py_func can not be restored from a saved model. However, since your use case involves pre-processing, just invoke the py_func explicitly in all three (train, eval, serving) input functions. This won't work if the py_func is in the middle of your graph, but for stemming, it should work just fine.

SAS Enterprise Guide - process dependence and parallel execution

I am working on a project in SAS EG (7.1) which involves process dependence and parallel execution, as depicted below:
I have the following questions:
Is there a way to retrieve or set relations (i.e. process_C --> program_D) between the processes programmatically? The maintenance is becoming problematic with complex projects. Ideally, I would like to be able to re-create the links between processes from external table.
I start the whole process with the option “Run branch from <>” process. Let’s assume that we have only 2 processors available. Is there a way to set the order of execution between process_A, B, C? The critical path of the whole flow is “begin -> process_C -> process_D -> end” hence we would like it to start with process_C in order to ensure minimum execution time.
Thank you in advance.
For 1, I think the answer is "no", if you mean a well defined SAS programmatic method. At least for the relatively limited information and example you provide above, anyway. More might be possible with metadata server - not my area of expertise.
You can do some of this at least using scripting via Powershell or VBScript. EG's API is fairly wide open and not all that hard to use. I won't suggest how as my understanding of this is limited also, but it seems like it should be possible to do what you suggest, though probably not easy.
For your second point:
First off, EG typically runs "top to bottom" if it has no other information on how to process a particular choice. So put c->d above a/b to get it processed first.
Second, you could use conditional processing perhaps. There should be a macro variable that tells you how many cpus you have (&SYSNCPU on my machine, hopefully same on other versions). You could use that value to conditionally link to A then B as opposed to A+B simultaneously. I'm not sure how easy this would be to do in a flexible fashion, though.

Creating custom voice commands (GNU/Linux)

I'm looking for advices, for a personal project.
I'm attempting to create a software for creating customized voice commands. The goal is to allow user/me to record some audio data (2/3 secs) for defining commands/macros. Then, when the user will speak (record the same audio data), the command/macro will be executed.
The software must be able to detect a command in less than 1 second of processing time in a low-cost computer (RaspberryPi, for example).
I already searched in two ways :
- Speech Recognition (CMU-Sphinx, Julius, simon) : There is good open-source solutions, but they often need large database files, and speech recognition is not really what I'm attempting to do. Speech Recognition could consume too much power for a small feature.
- Audio Fingerprinting (Chromaprint -> http://acoustid.org/chromaprint) : It seems to be almost what I'm looking for. The principle is to create fingerprint from raw audio data, then compare fingerprints to determine if they can be identical. However, this kind of software/library seems to be designed for song identification (like famous softwares on smartphones) : I'm trying to configure a good "comparator", but I think I'm going in a bad way.
Do you know some dedicated software or parcel of code doing something similar ?
Any suggestion would be appreciated.
I had a more or less similar project in which I intended to send voice commands to a robot. A speech recognition software is too complicated for such a task. I used FFT implementation in C++ to extract Fourier components of the sampled voice, and then I created a histogram of major frequencies (frequencies at which the target voice command has the highest amplitudes). I tried two approaches:
Comparing the similarities between histogram of the given voice command with those saved in the memory to identify the most probable command.
Using Support Vector Machine (SVM) to train a classifier to distinguish voice commands. I used LibSVM and the results are considerably better than the first approach. However, one problem with SVM method is that you need a rather large data set for training. Another problem is that, when an unknown voice is given, the classifier will output a command anyway (which is obviously a wrong command detection). This can be avoided by the first approach where I had a threshold for similarity measure.
I hope this helps you to implement your own voice activated software.
Song fingerprint is not a good idea for that task because command timings can vary and fingerprint expects exact time match. However its very easy to implement matching with DTW algorithm for time series and features extracted with CMUSphinx library Sphinxbase. See Wikipedia entry about DTW for details.
http://en.wikipedia.org/wiki/Dynamic_time_warping
http://cmusphinx.sourceforge.net/wiki/download

Is there any way to make django remote api run faster in GAE?

Following up this question here.
I finally wrote up a code generation tool to wrap all my database data into something like this:
Pdtfaamt(fano=212373,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='A3',itemamt=75,type=0).save()
Pdtfaamt(fano=212374,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='E1',itemamt=75,type=0).save()
Pdtfaamt(fano=212375,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='S',itemno='E6',itemamt=75,type=0).save()
Pdtfaamt(fano=212376,comsname='SMM',pdtcode='20PLFCL',kind='1',fatype='C',itemno='A3',itemamt=3,type=1).save()
Yes, that's right! I pulled the entire database out and transformed the data into population instruction codes so that I am able to migrate my database up to GAE.
So I deployed the django-nonrel project, used django-nonrel remote api to trigger the data population process.
It works okay, except that there is a problem: it's extremely slow. Could anyone tell me how I will be able to improve the speed? I have done some calculation, it may take up to 30 days to get all my data up and running there on GAE.
ps. I am using django-nonrel, and djangoappengine for the backend.
Write your import script to take advantage of python's multiprocessing Pool
def import_thing(data):
thing = ThingEntity(**data)
thing.put()
def main():
data = [{fano:'212374', comsname:'SMM', },
{fano:'212374', comsname:'212375', },
...etc ]
pool = multiprocessing.Pool(4) # split data into 4 parts to run in parallel
pool.map(import_thing, data)
Since the AppEngine production servers like having lots of connections you should play around with the pool size to find the best number. This will not work for importing to the dev server as it's single-threaded.
Also important: Ensure you are putting them in batches of say 10-20 not putting one at a time, or the round-trips will be killing your performance. So an improved script should work in chunks like:
data = [
[item1,item2,item3],
[item4, item5, item6],
[item7, item8, item9],
]
pool.map(import_batch, data)
You probably want to look into the Mapper API.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.