Multi-label text classification with scikit-learn - python-2.7

I'm new to machine learning and I'm having trouble adapting any examples that I've found to my specific problem. The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with, so I'm not really sure which algorithm I should be using, how to properly prepare my data for it, and how to get the predictions in the form I want.
I already have my feature extraction function for the text in place, which returns a tuple of floats ranging from 0.0 to 100.0. These represent the prevalence of a certain characteristic in the text as a percentage. So my features for a certain piece of text would look something like (0.0, 17.31, 57.0, 93.2, ...). I'm unsure of which algorithm would be the most suitable for this type of data.
As per the title, I also need the ability to classify a piece of text using more than one label. Reading some other SO questions clued me in that I need to use MultiLabelBinarizer and OneVsRestClassifier, but I'm still unsure how to apply them to my data and whichever algorithm I'll need to use.
I also didn't find any examples that would return prediction results for the multiple labels in the form I want them. That is, instead of a binary "is or isn't this label", I'd like a percentage chance that the text is of a certain label. So when doing something like classifier.predict(testData) I'd like a return like {"spam":87.3, "code":27.9, "urlList":3.12} instead of something like ["spam", "code", "urlList"]. That way I can make more precise decisions about what to do with a certain text.
I should probably also mention one characteristic of the dataset that I'm using, and that is that 85-90% of the text will be code, and therefore only have one tag, "code". I imagine there are some tweaks to the algorithm required to account for this?
Some simplified and probably unsuitable code:
possibleLabels = ["code", "spam", "urlList"]
trainData, trainLabels = [ (0.0, 17.31, 57.0, 93.2), ... ], [ ["spam"], ["code"], ["code", "urlList"], ... ]
testData, testLabels = [], [] # Separate batch of samples in the same format as above
# Not sure if this is the proper way to prepare my labels,
# nor how to later resolve the binarized versions to their string counterparts.
mlb = preprocessing.MultiLabelBinarizer()
fitTrainLabels = mlb.fit_transform(trainLabels)
# Feels like I need more to make it suitable for my data
classifier = OneVsRestClassifier()
classifier.fit(trainData, fitTrainLabels)
# Need return as a list of dicts containing probability of tags, ie. [ {"spam":87.3, "code":27.9, "urlList":3.12}, {...}, ... ]
predicted = classifier.predict(testData)

Related

Regex for XAML Formatting

I'm attempting to build a PowerShell CmdLet that can parse and cleanly reformat a chunk of XAML or any other markup language.
So far, I've had to build an assortment of CmdLet's so that I can get the correct information to put into this thing (for indentation, counts, items, child items, etc, so forth...)
What I'm attempting to do is to collect ALL of the properties and values in a set of XAML/HTML, etc, and then once I have the lengths of all those variables, I can then start to chunk them out and properly format them so that they all output down a straight line. It may not make a super amount of sense as I describe it? So, here's an example.
<Window xmlns = 'http://schemas.microsoft.com/winfx/2006/xaml/presentation'
xmlns:x = 'http://schemas.microsoft.com/winfx/2006/xaml'
Title = 'Window Title'
Height = '600'
MinHeight = '600'
Width = '800'
MinWidth = '800'
BorderBrush = 'Black'
ResizeMode = 'CanResize'
HorizontalAlignment = 'Center'
WindowStartupLocation = 'CenterScreen'>
The reason I am attempting to build this, is so that I can programmatically save the instructions to a smaller footprint. So, instead of... having fluctuating numbers for each line and item and the end result looking like this...
<Window xmlns='http://schemas.microsoft.com/winfx/2006/xaml/presentation'
xmlns:x = 'http://schemas.microsoft.com/winfx/2006/xaml' Title = 'Window Title' Height = '600'
MinHeight = '600' Width = '800' MinWidth = '800' BorderBrush = 'Black' ResizeMode = 'CanResize'
HorizontalAlignment = 'Center' WindowStartupLocation = 'CenterScreen'>
...I then have a set of instructions that can vectorize the content of the XAML, so that it has a pattern and less randomness. Sure, the line count might get expanded quite a bit, but there's no need to be concerned with that if all it is doing is expanding into RAM. Which is the point of it...
At any rate, the code that I am having trouble with is essentially a way to preserve the spacing between the quoted objects. I feel like I'm beating my head against a wall trying to get this to work correctly when I know it's a matter of Regex ...
I've posted the code I'm talking about via this link.
https://github.com/secure-digits-plus-llc/FightingEntropy/blob/master/Format-XAML.ps1
Lines 43-147
It is a script block, and testing with it requires a Xaml Here String.
Any suggestions would be appreciated. I'm not much of a Regex fan, I understand some basics to it but I'm not that great with it yet.
-MC
Found the answer I was looking for.
Not the most eloquent way to solve the issue I was having, but it works.
"(?<=\').+?(?=\')"
When the lines are split, and you want to preserve the spacing within the quotes, then you need something like this.
I was attempting to iterate through a do loop until the array/string contained (2) single quotes, but what was happening was... 'oh. I thought you wanted to match 'adbhjikvgrfe' with '21345rfs'.
No regex. Wasn't looking to match that. sigh.
Then it was taking the spacing out between the quotes.
sigh
I gotta say... anyone who truly writes good programming...? Well, I tip my hat off to you good sir/ma'am... because... it's a frustrating job. For certain.

Why the output of model.wv.similarity() in Word2Vec results different with model.wv.similar()?

I have trained a Word2Vec model and I am trying to use it.
When I input the most similar words of ‘动力', I got the output like this:
动力系统 0.6429724097251892
驱动力 0.5936785936355591
动能 0.5788494348526001
动力车 0.5579575300216675
引擎 0.5339343547821045
推动力 0.5152761936187744
扭力 0.501279354095459
新动力 0.5010953545570374
支撑力 0.48610919713974
精神力量 0.47970670461654663
But the problem is that if I input model.wv.similarity('动力','动力系统') I got the result 0.0, which is not equal with
0.6429724097251892
what confused me more was that when I got the next similarity of word '动力' and word '驱动力', it showed
3.689349e+19
So why ? Did I make misunderstanding with the similarity? I need someone to tell me!!
And the code is:
res = model.wv.most_similar('动力')
for r in res:
print(r[0],r[1])
print(model.wv.similarity('动力','动力系统'))
print(model.wv.similarity('动力','驱动力'))
print(model.wv.similarity('动力','动能'))
output:
动力系统 0.6429724097251892
驱动力 0.5936785936355591
动能 0.5788494348526001
动力车 0.5579575300216675
引擎 0.5339343547821045
推动力 0.5152761936187744
扭力 0.501279354095459
新动力 0.5010953545570374
支撑力 0.48610919713974
精神力量 0.47970670461654663
0.0
3.689349e+19
2.0
I have written a function to replace the model.wv.similarity method.
def Similarity(w1,w2,model):
A = model[w1]; B = model[w2]
return sum(A*B)/(pow(sum(pow(A,2)),0.5)*pow(sum(pow(B,2)),0.5)
Where w1 and w2 are the words you input, model is the Word2Vec model you have trained.
Using the similarity method directly from the model is deprecated. It has a bit extra logic in it that performs vector normalization before evaluating the result.
You should be using vw directly, because as stated in their documentation, for the word vectors it is of non importance how they were trained so they should be looked as independent structure, the model is just the means to obtain it.
Here is short discussion which should give you starting points if you want to investigate further.
It may be an encoding issue, where you are not actually comparing the same tokens.
Try the following, to see if it gives results closer to what you expect.
res = model.wv.most_similar('动力')
for r in res:
print(r[0],r[1])
print(model.wv.similarity('动力', res[0][0]))
print(model.wv.similarity('动力', res[1][0]))
print(model.wv.similarity('动力', res[2][0]))
If it does, you could look further into why the model might be reporting strings which print as 动力系统 (etc), but don't match your typed-in-code string literals like '动力系统' (etc). For example:
print(res[0][0]=='动力系统')
print(type(res[0][0]))
print(type('动力系统'))

Getting significance level, alpha, from KS test results?

I am trying to find the significance level/alpha level (to eventually get the confidence level) of my Kolmogorov-Smirnov test results and I feel like I'm going crazy because this doesn't seem explained well enough anywhere (in a way that I understand.)
I have sample data that I want to see if it comes from one of four probability distribution functions: Cauchy, Gaussian, Students t, and Laplace. (I am not doing a two-sample test.)
Here is sample code for Cauchy:
### Cauchy Distribution Function
data = [-1.058, 1.326, -4.045, 1.466, -3.069, 0.1747, 0.6305, 5.194, 0.1024, 1.376, -5.989, 1.024, 2.252, -1.451, -5.041, 1.542, -3.224, 1.389, -2.339, 4.073, -1.336, 1.081, -2.573, 3.788, 2.26, -0.6905, 0.9064, -0.7214, -0.3471, -1.152, 1.904, 2.082, -2.471, 0.6434, -1.709, -1.125, -1.607, -1.059, -1.238, 6.042, 0.08664, 2.69, 1.013, -0.7654, 2.552, 0.7851, 0.5365, 4.351, 0.9444, -2.056, 0.9638, -2.64, 1.165, -1.103, -1.624, -1.082, 3.615, 1.709, 2.945, -5.029, -3.57, 0.6126, -2.88, 0.4868, 0.4222, -0.2062, -1.337, -0.326, -2.784, 6.724, -0.1316, 4.681, 6.839, -1.987, -5.372, 1.522, -2.347, 0.4531, -1.154, -3.631, 0.426, -4.271, 1.687, -1.612, -1.438, 0.8777, 0.06759, 0.6114, -1.296, 0.07865, -1.104, -1.454, -1.62, -1.755, 0.7868, -3.312, 1.054, -2.183, -7.066, -0.04661, 1.612, 1.441, -1.768, -0.2443, -0.7033, -1.16, 0.2529, 0.2441, -1.962, 0.568, 1.568, 8.385, 0.7192, -1.084, 0.9035, 3.376, -0.7172, -0.1221, 3.267, 0.4064, -0.4894, -2.001, 1.63, -2.891, 0.6244, 2.381, -1.037, -1.705, -0.5223, -0.2912, 1.77, -3.792, 0.1716, 4.121, -0.9119, -0.1166, 5.694, -5.904, 0.5485, -2.788, 2.582, -1.553, 1.95, 3.886, 1.066, -0.475, 0.5701, -0.9367, -2.728, 4.588, -5.544, 1.373, 1.807, 2.919, 0.8946, 0.6329, -1.34, -0.6154, 4.005, 0.204, -1.201, -4.912, -4.766, 0.0554, 3.484, -2.819, -5.131, 2.108, -1.037, 1.603, 2.027, 0.3066, -0.3446, -1.833, -2.54, 2.828, 4.763, 0.9926, 2.504, -1.258, 0.4298, 2.536, -1.214, -3.932, 1.536, 0.03379, -3.839, 4.788, 0.04021, -0.2701, -2.139, 0.1339, 1.795, -2.12, 5.558, 0.8838, 1.895, 0.1073, 2.011, -1.267, -1.08, -1.12, -1.916, 1.524, -1.883, 5.348, 0.115, -1.059, -0.4772, 1.02, -0.4057, 1.822, 4.011, -3.246, -7.868, 2.445, 2.271, 0.5377, 0.2612, 0.7397, -1.059, 1.177, 2.706, -4.805, -0.7552, -4.43, -0.4607, 1.536, -4.653, -0.5952, 0.8115, -0.4434, 1.042, 1.179, -0.1524, 0.2753, -1.986, -2.377, -1.21, 2.543, -2.632, -2.037, 4.011, 1.98, -2.589, -4.9, 1.671, -0.2153, -6.109, 2.497]
def C(data):
stuff = []
# vary gamma
for scale in xrange(1, 101, 1):
ks_statistic, pvalue = ss.kstest(data, "cauchy", args=(scale,))
stuff.append((ks_statistic, pvalue, scale))
bestks = min(c[0] for c in stuff)
bestrow = [row for row in stuff if row[0] == bestks]
return bestrow
I am trying to fit this function to my data, and to return the scale parameter (gamma) that corresponds to the highest probability of being fit with a Cauchy Distribution. The corresponding ks-statistic and p-value also get returned. I thought that this would be done by finding the minimum ks-statistic, which would be the curve that yields the smallest distance between any given data point and distribution-curve point. I realize that I need, though, to find "alpha" so that I can find my probability that the sample data is from a Cauchy Distribution, with the specified scale/gamma value I found.
I have referenced many sources trying to explain how to find "alpha", but I have no clue how to do this in my code.
Thank you for any help and insight!
I think this question is actually outside the range of SO because it involves statistics. You would probably be better to answer on, say, Cross Validation. However, let me offer one or two remarks.
The K-S is used for testing whether a given set of data has arisen from a given, fully specified distribution function. (Even for this purpose it might not be optimal.) It's not intended, as far as I know, as a measure of fit amongst alternatives.
To make inferences about probabilities one must have a viable probability model for the data in the first place. In this case, what is the space of alternatives and how are probabilities assigned to them under the null and alternative hypotheses?
Now, to get that unhelpful comment that I offered. Thanks for being so tactful about it! This is what I was trying to express.
You try scales from 1 to 100 in unit steps. I wanted to point out that scales less than one produce curious results. Now I see some close fits, which is especially true when p-values are considered; there's nothing to tell them apart from that for scale=2. Here's a plot.
Each triple gives (scale, K-S, p).
The main thing might be, what do you want from your data?

Importing unfriedly formatted data in Excel and forcing messy values as column names

I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.
It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.

Avoiding pandas chained selection

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.
How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.