Related
I am trying to solve a problem in Pyomo. I need to write down restrictions on a subset A for some arcs between nodes (I and II) for a Constraint1. I wrote a model:
from pyomo.environ import *
model = AbstractModel()
model.I = Set()
model.II = SetOf(model.I)
model.J = Set()
model.A = Set(model.I,model.II)
model.c = Param(model.I, model.J, default=0)
model.b = Param(model.I, default=0)
model.x = Var(model.I, model.J, within=NonNegativeReals)
model.y = Var(model.I, model.II, within=NonNegativeReals)
data = DataPortal()
data.load(filename='Data.yaml')
data.load(filename='Table.tab')
def objective_rule(model):
return (sum(model.c[i,j]*model.x[i,j] for i in model.I for j in model.J))
model.OBJ = Objective(rule=objective_rule, sense = minimize)
def B_rule(model,i):
Bb = sum(model.x[i,j] for j in model.J)-sum(model.y[i,ii] for ii in model.II if ii != i)+0.01*sum(model.y[ii,i] for ii in model.II if ii != i)
return model.b[i] == Bb
model.B1 = Constraint(model.I, rule=B_rule)
def constraint1_rule(model,i,ii):
if (i,ii) in model.A:
return model.y[i,ii] <= 10000
return Constraint.Skip
model.constraint1 = Constraint(model.I, model.II, rule = constraint1_rule)
instance = model.create_instance(data)
opt = SolverFactory('cplex')
opt.solve(instance)
instance.OBJ.display()
instance.x.display()
instance.y.display()
The data is presented in the file Data.yaml:
I: [1, 2, 3, 4]
J: [1, 2]
b : {1: 10000, 2: 20000, 3: 25000, 4: 22000}
c:
- index: [1, 1]
value: 550
- index: [2, 2]
value: 120
- index: [3, 1]
value: 650
- index: [4, 2]
value: 550
- index: [1, 1]
value: 120
- index: [2, 2]
value: 650
- index: [3, 1]
value: 650
- index: [4, 2]
value: 550
The two-dimensional set A is presented in the file Table.tab:
set A : 1 2 3 4 :=
1 - - + +
2 - - + +
3 + + - +
4 + + + - ;
After the solution I get the error:
Unspecified format and data option
How to correctly represent the set A?
First, your data and sets are so small, that I would really consider just putting them into the base file and making a concrete model. Much easier to troubleshoot. You could even use basic python to read from csv files or such.
Anyhow. You can find a good example here on working with .tab files:
https://pyomo.readthedocs.io/en/stable/working_abstractmodels/data/dataportals.html?highlight=.tab#loading-structured-data
2 things you need to do:
change your statement for model.A to just show that it is within the sets.
model.A = Set(within = model.I*model.I)
Clean up your tab file to show the set index in top left and clear out the punctuation:
I 1 2 3 4
1 - - + +
2 - - + +
3 + + - +
4 + + + -
update your read statement:
data.load(filename='table.tab', set=model.A, format='set_array')
I would like to create a set of matrix(all 3 X 3), Bij (like B11, B12, B13, B21, B22, B23, B31, B32, B33 all are all 3 X 3) with 1 at ij th entry and 0 everywhere else. For eg.
B_12 = [[0,1,0],
[0,0,0],
[0,0,0]]
and
B_23 = [[0,0,0],
[0,0,1],
[0,0,0]]
I tried with the following code
for z in range(9):
B = [[0,0,0],
[0,0,0],
[0,0,0]]
for i in range(3):
for j in range(3):
if i==j:
val = 1
else:
val = 0
B[i][j] = val
print B
But it is not giving the desirable matrix.
Could anybody suggest me the correct logic?
Thanks
If I am able to understand your question correctly, what you are looking for is the code to modify the value of an index in the matrix and this is pretty simple.
B = [
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]
]
B[i][j] = 1 # This is your B_ij
I don't think you need loops here.
If you are looking for something else, kindly rephrase the question properly.
i am trying to solve a multilabel classification problem as
from sklearn.preprocessing import MultiLabelBinarizer
traindf = pickle.load("traindata.pkl","rb"))
X = traindf['Col1']
X=MultiLabelBinarizer().fit_transform(X)
y = traindf['Col2']
y= MultiLabelBinarizer().fit_transform(y)
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
from sklearn.linear_model import LogisticRegression
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(Xtrain,ytrain)
print "One vs rest accuracy: %.3f" % clf.score(Xvalidate,yvalidate)
in this way, i always get 0 accuracy. Please point out if i am doing something wrong. i am new to multilabel classification. Here is what my data looks like
Col1 Col2
asd dfgfg [1,2,3]
poioi oiopiop [4]
EDIT
Thanks for your help #lejlot. I think i am getting the hang of it. Here is what i tried
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
tdf = pd.read_csv("mul.csv", index_col="DocID",error_bad_lines=False)
print tdf
so my input data looks like
DocID Content Tags
1 abc abc abc [1]
2 asd asd asd [2]
3 abc abc asd [1,2]
4 asd asd abc [1,2]
5 asd abc qwe [1,2,3]
6 qwe qwe qwe [3]
7 qwe qwe abc [1,3]
8 qwe qwe asd [2,3]
so this is just some test data i created. then i do
text_clf = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=42)),
])
t=TfidfVectorizer()
X=t.fit_transform(tdf["Content"]).toarray()
print X
this gives me
[[ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0.89442719 0.4472136 0. ]
[ 0.4472136 0.89442719 0. ]
[ 0.55247146 0.55247146 0.62413987]
[ 0. 0. 1. ]
[ 0.40471905 0. 0.91444108]
[ 0. 0.40471905 0.91444108]]
then
y=tdf['Tags']
y=MultiLabelBinarizer().fit_transform(y)
print y
gives me
[[0 1 0 0 1 1]
[0 0 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 1 1 1]
[0 0 0 1 1 1]
[1 1 0 1 1 1]
[1 0 1 1 1 1]]
here i am wondering why there are 6 column? shouldn't there be only 3?
anyway, then i also created a test data file
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
so this looks like
DocID Content PredTags
34 abc abc qwe [1,3]
35 asd abc asd [1,2]
36 abc abc abc [1]
i have the PredTags column to check for accuracy. So finally i fit and predict as
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
which gives me
[[1 1 1 1 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]]
Now, how do i know which tags are being predicted? How can i check the accuracy against my PredTags column?
Update
Thanks a lot #lejlot :) i also manged to get the accuracy as follows
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
ty=sdf["PredTags"]
ty = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in ty]
yt=MultiLabelBinarizer().fit_transform(ty)
Xt=t.fit_transform(sdf["Content"]).toarray()
print Xt
print yt
print "One vs rest accuracy: %.3f" % clf.score(Xt,yt)
i just had to binarize the test set prediction column as well :)
The actual problem is the way you work with text, you should extract some kind of features and use it as text representation. For example you can use bag of words representation, or tfidf, or any more complex approach.
So what is happening now? You call multilabelbinarizer on list of strings thus, scikit-learn creates a set of all iterables in the list... leading to the set of letters representation. So for example
from sklearn.preprocessing import MultiLabelBinarizer
X = ['abc cde', 'cde', 'fff']
print MultiLabelBinarizer().fit_transform(X)
gives you
array([[1, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 1]])
| | | | | | |
v v v v v v v
a b _ c d e f
Consequently classification is nearly impossible as this does not capture any meaning of your texts.
You could do for example a Count Vectorization (bag of words)
from sklearn.feature_extraction.text import CountVectorizer
print CountVectorizer().fit_transform(X).toarray()
gives you
[[1 1 0]
[0 1 0]
[0 0 1]]
| | |
v | v
abc | fff
v
cde
Update
Finally, to make predictions with labels, and not their binarization you need to store your binarizer thus
labels = MultiLabelBinarizer()
y = labels.fit_transform(y)
and later on
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print labels.inverse_transform(predicted)
Update 2
If you only have three classes then the vector should have 3 elements, yours have 6 so check what you are passing as "y", there is probably some mistake in your data
from sklearn.preprocessing import MultiLabelBinarizer
MultiLabelBinarizer().fit_transform([[1,2], [1], [3], [2]])
gives
array([[1, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]])
as expected.
My best guess is that your "tags" are also strings thus you actually call
MultiLabelBinarizer().fit_transform(["[1,2]", "[1]", "[3]", "[2]"])
which leads to
array([[1, 1, 1, 0, 1, 1],
[0, 1, 0, 0, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 0, 1, 1]])
| | | | | |
v v v v v v
, 1 2 3 [ ]
And these are your 6 classes. Three true ones, 2 "trivial" classes "[" and "]" which are present always and also nearly trivial class "," which appears for every object beleonging to more than one class.
You should convert your tags to actual lists first, for example by
y = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in y]
I have a file that each line has 2 element like below which have nth lines:
1 2
2 3
3 4
4 5
1 6
2 7
1 8
I need to make a list in python.
list[1]=[2,6,8]
list[2]=[3,7]
list[3]=[4]
list[4]=[5]
How can I do?
Try
import pandas as pd
a = [[1,2], [2,3], [3,4], [4, 5], [1, 6], [2,7], [1,8]]
df = pd.DataFrame(a,columns=['b','c'])
print df
z = df.groupby(['b']).apply(lambda tdf:pd.Series(dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['b']])))
z = z.sort_index()
print z
print z['c'][1]
print z['c'][2]
print z['c'][3]
print z['c'][4]
z['d'] = 0.000
z[['d']] = z[['d']].astype(float)
len_b = len(z.index)
z['d'] = float(len_b)
z['e'] = 1/z['d']
z = z[['c', 'e']]
z.to_csv('your output folder')
print z
See this answer for more details: https://stackoverflow.com/a/24112443/2632856
My code is currently written as:
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - rows.count(0)
print numgood
>> 25
It always comes out as 25, so it's not just that rows contains no 0's.
Have you printed rows?
It's [[0, 1, 0, 0, 2], [1, 2, 0, 1, 2], [3, 1, 1, 1, 1], [1, 0, 0, 1, 0], [0, 3, 2, 0, 1]], so you have a nested list there.
If you want to count the number of 0's in those nested lists, you could try:
import random
convert = {0:0, 1:1, 2:2, 3:3, 4:0, 5:1, 6:2, 7:1}
rows = [[convert[random.randint(0, 7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - sum(e.count(0) for e in rows)
print numgood
Output:
18
rows doesn't contain any zeroes; it contains lists, not integers.
>>> row = [1,2,3]
>>> type(row)
<type 'list'>
>>> row.count(2)
1
>>> rows = [[1,2,3],[4,5,6]]
>>> rows.count(2)
0
>>> rows.count([1,2,3])
1
To count the number of zeroes in any of the lists in rows, you could use a generator expression:
>>> rows = [[1,2,3],[4,5,6], [0,0,8]]
>>> sum(x == 0 for row in rows for x in row)
2
You could also use numpy:
import numpy as np
import random
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - np.count_nonzero(rows)
print numgood
Output:
9