I am using Python 2.7. to general data analysis of various sorts however I am facing a problem with fitting my data.
If I try to fit my data using Binned LH, UnBinned LH, Chi2, Kmeans etc - the software will always return the same value (exact!), even if i supply different starting parameters to the the fit. For some method, the returned value is completely off - see attached KMeans plot.
Following packages are showing problems
Probfit
Iminuit
Sklearn
Scipy
Any ideas what is going on?
My own
Same code on a coworkers computer (same setup)
I am looking to replicate a plot similar to this:
with the following criteria:
NON-ribbon connections between points (thin lines)
Text outside each dot
Only 10 or so points, instead of the 30+ shown above.
Additional colors (lines/text) not necessary.
I'm aware that there are packages like Plotly, but these seem to be dedicated to having ribbon-like connections, instead of thing lines. I'm also aware of Circos, but it seems difficult to make something that is relatively simple.
Is there a straight-forward Python 2.7 script that I can use (an online example, or online tool)?
I am trying to use lasgne to train a simple neural network, and to use my own C++ code to do inference. I use the weights generated by lasgne, but I am not able to get good results. Is there a way I can print the output of a hidden layer and/or the calculations themselves? I want to see who it works under the hood, so I can implement it the same way in C++.
I can help with Lasagne + Theano in Python, I am not sure from your question whether you fully work in C++ or you only need the results from Python + Lasagne in your C++ code.
Let's consider you have a simple network like this:
l_in = lasagne.layers.InputLayer(...)
l_in_drop = lasagne.layers.DropoutLayer(l_in, ...)
l_hid1 = lasagne.layers.DenseLayer(l_in_drop, ...)
l_out = lasagne.layers.DenseLayer(l_hid1, ...)
You can get the output of each layer by calling the get_output method on a specific layer:
lasagne.layers.get_output(l_in, deterministic=False) # this will just give you the input tensor
lasagne.layers.get_output(l_in_drop, deterministic=True)
lasagne.layers.get_output(l_hid1, deterministic=True)
lasagne.layers.get_output(l_out, deterministic=True)
When you are dealing with dropout and you are not in the training phase, it's important to remember to call get_output method with the deterministic parameter set to True, to avoid non-deterministic behaviours. This applies to all layers that are preceded by one or more dropout layers.
I hope this answers your question.
I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.
I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.
from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)
My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.
Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.
Update
It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by
if __name__ == "__main__":
main()
Once I added
if __name__ == "__main__":
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.
which version of tsfresh did you use? Which OS?
We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.
Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.
from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())
Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).
Syntax has changed slightly (see docs), the current approach would be:
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())
Or
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())
Since version 0.15.0 we have improved our bindings for Apache Spark and dask.
It is now possible to use the tsfresh feature extraction directly in your usual dask or Spark computation graph.
You can find the bindings in tsfresh.convenience.bindings with the documentation here. For example for dask, it would look something like this (assuming df is a dask.DataFrame, for example the robot failure dataframe from our example)
df = df.melt(id_vars=["id", "time"],
value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
var_name="kind", value_name="value")
df_grouped = df.groupby(["id", "kind"])
features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
column_sort="time", column_value="value",
default_fc_parameters=EfficientFCParameters())
# or any other parameter set
Using either dask or Spark (or anything alike) might help you with very large data - both for memory as well as speed (as you can distribute the work over multiple machines). Of course, we still support the usual distributors (docu) as before.
Additional to that, it is also possible to run tsfresh together with a task orchestration system, such as luigi. You can create a task to
* read in the data for only one id and kind
* extract the features
* write out the result to disk
and let luigi handle all the rest. You may find a possible implementation of this here on my blog.
I've found, at least on a multicore machine, that a better way to distribute extract_features calculation over independent subgroups (identified by the column_id value) is through joblib.Parallel with the Loky backend.
For example, you define your features extraction function on a single value of columnd_id and you apply it
from joblib import Parallel, delayed
def map_extract_features(df):
return extract_features(
timeseries_container=df,
default_fc_parameters=settings,
column_id="ID",
column_sort="DATE",
n_jobs=1,
disable_progressbar=True
).reset_index().rename({"index":"ID_CONTO"}, axis=1)
out = Parallel(n_jobs=cpu_count()-1)(
delayed(map_extract_features)(
my_dataframe[my_dataframe["ID"]==id]
) for id in tqdm(my_dataframe["ID"].unique())
)
This method takes way less memory than specifying column_id directly in the extract_features function.
I'm running into a weird issue with some linear regression stuff from sklearn. Specifically, linear_model.
I'm trying to do some basic machine learning, and so I have a part of my script that combs through my data and extracts features into a list (of lists) X, and then another part that feeds those features into the fit function. So I've got (roughly)
from sklearn import linear_model
X, y = extractFeaturesFromData(data,numfeatures) # my homemade function
reg = linear_model.LinearRegression()
reg.fit(X,y)
When I run this, I get (along with the traceback)
ValueError: setting an array element with a sequence.
The example here ran fine. And the X and y that extractFeaturesFromData returns are of type 'list', same as in the example. If I use the dummy X and y from the example page, it works fine, but using mine causes it to throw an error.
I've tried varying the number of features extracted into X, and printing out the X and y returned from my function (which shows them to be the same format as their dummy counterparts from the example), but haven't had any luck so far. I'm running python 2.7 on a macbook running 10.9.5. Any idea why this might be happening? Any help would be much appreciated.
Figured it out! It was completely unrelated to my code itself; one of the files that I was importing was a good bit larger than the others, and (I think) was being automatically split into an array, causing the error. Removing that file made everything run fine.