Can sklearn.preprocessing.KBinsDiscretizer with strategy='quantile' drop the duplicated bins? - python-2.7

I used sklearn.preprocessing.KBinsDiscretizer(n_bins=10, encode='ordinal') to discretize my continuous feature.
The strategy is 'quantile', by defalut. But my data distribution is actually not uniformly, like 70% of rows is 0.
Then I got KBinsDiscretizer.bins_edges=[0.,0.,0.,0.,0.,0.,0.,256.,602., 1306., 18464.].
There're many duplicate bins. So, is there a method to drop the duplicates in KBinsDiscretizer's bins?
KBinsDiscretizer calculates the quantile of input. If the most samples of input are zero, the 10-quantiles will have multiple zeros. The result I expected is a discretizer with unique bins. For the example I mentioned, is [0.,256.,602., 1306., 18464.].

That will not be possible. Set strategy='uniform' to achieve your goal.

Related

Aws RedShift sampling

For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.

Applying word2vec to find all words above a similarity threshold

The command model.most_similar(positive=['france'], topn=100) gives the top 100 most similar words to "france". However, I would like to know if there is a method which will output the most similar words above a similarity threshold to a given word. Is there a method like the following?:
model.most_similar(positive=['france'], threshold=0.9)
No, you'd have to request a large number (or all, with topn=0) then apply the cutoff yourself.
What you request could theoretically be added as an option.
However, the cosine-similarity absolute magnitudes don't necessarily have a stable meaning, like "90% similar" across different model runs. Their distribution can vary based on model training parameters, such as the vector size, and they are most-often interpreted only in ranked-comparison to other pairwise values from the same model.
For example, the composition of the top-100 most-similar words for 'cold' may be very similar in models with different training parameters, but the range of absolute similarity values for the #1 to #100 words can be quite different. So if you were picking an absolute threshold, you'd likely want to vary the cutoff based on observing the model, or along with other model training metaparameters.
Well, let's say you can. Try the following code:
def find_most_similar(model, wrd, threshold=0.75):
res = [item for item in model.wv.most_similar(wrd, topn=len(model.wv.vocab)) if item[1] > threshold]
return res

How to detect and delete noise in rapidminer?

I am new in rapid miner 5, just want to know how to find noise in my data and show them in chart and how to delete them?
A complex problem because it depends what you mean by noise.
If you mean finding individual attributes whose values are plain wrong then you could plot a histogram view and work out some sort of limits on what constitutes a valid value. You could then impose that rule by using Filter Examples to remove them.
If you mean finding attributes that have some sort of random jitter applied to them it would be difficult to detect these. Only by knowing beforehand what the expected shape of the distribution is could you compare with observation and do something about it. However, the action to take is by no means obvious.
If you mean finding examples within an example set that are obviously different from other examples then you could consider using the various outlier functions. The simplest one to get started is Detect Outlier (Distances). This finds a set number of outliers (default 10) based on a distance calculation that uses all the attributes for examples. It creates a new attribute called outlier that is set to true or false. You could then use the Filter Examples operator to remove those that are set to true.
Hope that helps at least as a start.

Words to keep attribute in StringToWordVector filter in weka

What is the meaning of words to keep attribute in Weka StringToWord filter. Is it better to have higher value or not, for getting real results?
In general, it is a good idea to set the limit as high as possible in order to retain as many words as possible. Words with small frequencies can marginally help the classifiers you induce later.
Keeping too many words may look like a bad idea for a matter of efficiency - the higher the number of attributes, the longer it will take to learn the model. However, you can filter the words to keep the most predictive ones using the AttributeSelection filter with the Ranker function and the InfoGainAttributeEval measure. In fact, you can play with the theshold in the AttrivuteSelection filter in order to keep a relatively small number of very predictive words, with independence of their relative frequency.
Additionally, do not forget to set the flag doNotOperatePerClassBasis to true in order to keep all the words relevant to all classes.

Filtering Attributes with Weka

I have a simple question about filtering attributes in WEKA.
Let's say I have 500 attributes 30 classes and 100 samples for each class which equals 3000 rows and 500 columns. This causes time and memory problems a you can guess.
How do I filter attributes that occur only once or twice (or n times) in 3000 rows. And is it a good idea?
Thank you
Use the following filter
weka.filters.unsupervised.attribute.RemoveUseless
This filter removes attributes that do not vary at all or that vary too much. All constant attributes are deleted automatically, along with any that exceed the maximum percentage of variance parameter.