SGDClassifier with HashingVectorizer and TfidfTransformer - python-2.7

I would like to understand if it is possible to train an online SGDClassifier (with partial_fit) using HashingVectorizer and TfidfTransformer. Simply joining them in a Pipeline will not work as TfidfTransformer is stateful so that would break the online learning process. This post says it's not possible to use tf-idf in an online fashion but a comment on this post suggests that it may somehow be possible: "In particular if you use stateful transformers as TfidfTransformer you will need to do several passes on your data". Is that possible without loading the whole training set into memory? If so, how? If not, is there an alternative solution to combine HashingVectorizer with tf-idf on large datasets?

Is that possible without loading the whole training set into memory?
No. TfidfTransformer needs to have the entire X matrix in memory. You'll need to roll your own tf-idf estimator, use that to compute per-term document frequencies in one pass over the data, then do another pass to produce tf-idf features and fit a classifier to them.

Related

How to use uncertainties to weight residuals in a Savitzky-Golay filter.

Is there a way to incorporate the uncertainties on my data set into the result of the Savitzky Golay fit? Since I am not passing this information into the function, I asume that it is simply calcuating the 'best fit' via an unweighted least-squares process. I am currently working with data that has non-uniform uncertainty, and so the fit of the data could be improved by including the errors that I have for my main dataset.
The wikipedia page for the Savitzky-Golay filter suggests how I might go about alter the process of calculating the coefficients of the fit, and I am staring at the code for scipy.signal.savgol_filter, but I cannot get my head around what I need to adjust so that this will do what I want it to.
Are there any ready-made weighted SG filters floating about? I find it hard to believe that no-one else has ever needed this tool in Python, but maybe I have missed something.
Check out this Python module: https://github.com/surhudm/savitzky_golay_with_errors
This python script improves upon the traditional Savitzky-Golay filter
by accounting for errors or covariance in the data. The inputs and
arguments are all modelled after scipy.signal.savgol_filter
Matlab function sgolayfilt supports weights. Check the documentation.

TF-IDF vectorizer doesn't work better than countvectorizer (sci-kit learn

I am working on a multilabel text classification problem with 10 labels.
The dataset is small, +- 7000 items and +-7500 labels in total. I am using python sci-kit learn and something strange came up in the results. As a baseline I started out with using the countvectorizer and was actually planning on using the tfidf vectorizer which I thought would work better. But it doesn't.. with the countvectorizer I get a performance of a 0,1 higher f1score. (0,76 vs 0,65)
I cannot wrap my head around why this could be the case?
There are 10 categories and one is called miscellaneous. Especially this one gets a much lower performance with tfidf.
Does anyone know when tfidf could perform worse than count?
The question is, why not ? Both are different solutions.
What is your dataset, how many words, how are they labelled, how do you extract your features ?
countvectorizer simply count the words, if it does a good job, so be it.
There is no reason why idf would give more information for a classification task. It performs well for search and ranking, but classification needs to gather similarity, not singularities.
IDF is meant to spot the singularity between one sample vs the rest of the corpus, what you are looking for is the singularity between one sample vs the other clusters. IDF smoothens the intra-cluster TF similarity.

Django: Storing huge matrix in table or in a file?

I use Django to manage a machine learning process. At the end of the calculation stage, I'm left with a huge matrix data (~50MB of floats). Should I store it in my Django model (binary field?) or in a file (FileField)? It seems there are pros and cons for the two alternatives.
My specific case: I just need to write the data once the training is finished and load it in memory each time I want to use the learned model. No query. Just read entire data in matrix and write matrix in table/file.
Thanks for asking back!!
I am adapting my answer according to your use-case.
Since you just need to write data each time after training,You should try this

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.

sequential/online kmeans clustering, how does it work? Existing codes?

I'm a little confused about online kmeans clustering. I know that it allows me to cluster with just one data at a time. But,is this all limited to one session? Suppose that I have a bunch of data clustered via this method and I get the clustered data result, would I be able to add more data to the cluster in the future?
I've also been looking for implementations of this code, and to no avail. Anyone know of any?
Update:
To clarify more. Here is how my code works right now:
Image is taken from live video feed, once enough pictures are saved, get kmeans of sift features.
Repeat step 1, a new batch of live feed pictures, get kmeans again. Combine the kmeans vectors with the previous kmeans like :[A B]
You can see that this is bad, because I quickly get too much clusters, and each batch of clusters will definitely have overlaps with another batch.
What I want:
Image taken from live video feed, once pics are saved, get kmeans
Repeat step 1, get kmeans again, which updates and adds new clusters to the previous cluster.
Nothing that I've seen could accommodate that, unless I'm just not understanding them correctly.
If you look at the original (!) publications, the method proposed by MacQueen - where the name k-means comes from - was in fact an online algorithm. I'm not sure if MacQueen did multiple passes over the data to improve the result. I believe he used a single pass, and objects would never be reassigned to a different cluster. If so, it was already an online algorithm!
Means are commonly computed as sum / count. This is not very sensible from a numerical point of view. E.g. in the classic Knuth book you can find a method for incrementally updating means. Wikipedia has it also.
Things get slightly more complicated once you actually want to reassign earlier points. But usually in a streaming context you do not know the previous points, so you cannot do that anyway.