How to print numpy matrix nicely with text headers - python - python-2.7

I have a question on python:
how can I print matrix nicely with headers like this:
T C G C A
[0 -2 -4 -6 -8 -10]
T [-2 1 -1 -3 -5 -7]
C [-4 -1 2 0 -2 -4]
C [-6 -3 0 1 1 -1]
A [-8 -5 -2 -1 0 2]
I'v triad to print with numpy.matrix(mat)
But all I'v got was:
[[ 0 -2 -4 -6 -8 -10]
[ -2 1 -1 -3 -5 -7]
[ -4 -1 2 0 -2 -4]
[ -6 -3 0 1 1 -1]
[ -8 -5 -2 -1 0 2]]
And I also didn't succeed to add the headers.
Thanks!!!
update
Thank you all.
I'v succeed to install pandas' but I have 2 new problems.
here is my code:
import pandas as pd
col1 = [' ', 'T', 'C', 'G', 'C', 'A']
col2 = [' ', 'T', 'C', 'C', 'A']
df = pd.DataFrame(mat,index = col2, columns = col1)
print df
But I get this error:
df = pd.DataFrame(mat,index = col2, columns = col1)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 163, in __init__
copy=copy)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 224, in _init_ndarray
return BlockManager([block], [columns, index])
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 237, in __init__
self._verify_integrity()
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 313, in _verify_integrity
union_items = _union_block_items(self.blocks)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 906, in _union_block_items
raise Exception('item names overlap')
Exception: item names overlap
And when I am trying to change the letters it works:
T B G C A
0 -2 -4 -6 -8 -10
T -2 1 -1 -3 -5 -7
C -4 -1 2 0 -2 -4
C -6 -3 0 1 1 -1
A -8 -5 -2 -1 0 2
but as you can see the layout of the matrix is not quite well.
How can I fix those problems?

Numpy does not provide such a functionality out of the box.
(a) pandas
You may look into pandas. Printing a pandas.DataFrame usually looks quite nice.
import numpy as np
import pandas as pd
cols = ["T", "C", "S", "W", "Q"]
a = np.random.randint(0,11,size=(5,5))
df = pd.DataFrame(a, columns=cols, index=cols)
print df
will produce
T C S W Q
T 9 5 10 0 0
C 3 8 0 7 2
S 0 2 6 5 8
W 4 4 10 1 5
Q 3 8 7 1 4
(b) pure python
If you only have pure python available, you can use the following function.
import numpy as np
def print_array(a, cols, rows):
if (len(cols) != a.shape[1]) or (len(rows) != a.shape[0]):
print "Shapes do not match"
return
s = a.__repr__()
s = s.split("array(")[1]
s = s.replace(" ", "")
s = s.replace("[[", " [")
s = s.replace("]])", "]")
pos = [i for i, ltr in enumerate(s.splitlines()[0]) if ltr == ","]
pos[-1] = pos[-1]-1
empty = " " * len(s.splitlines()[0])
s = s.replace("],", "]")
s = s.replace(",", "")
lines = []
for i, l in enumerate(s.splitlines()):
lines.append(rows[i] + l)
s ="\n".join(lines)
empty = list(empty)
for i, p in enumerate(pos):
empty[p-i] = cols[i]
s = "".join(empty) + "\n" + s
print s
c = [" ", "T", "C", "G", "C", "A"]
r = [" ", "T", "C", "C", "A" ]
a = np.random.randint(-4,15,size=(5,6))
print_array(a, c, r)
giving you
T C G C A
[ 2 5 -3 7 1 9]
T [-3 10 3 -4 8 3]
C [ 6 11 -2 2 5 1]
C [ 4 6 14 11 10 0]
A [11 -4 -3 -4 14 14]

Consider a sample array -
In [334]: arr = np.random.randint(0,25,(5,6))
In [335]: arr
Out[335]:
array([[24, 8, 6, 10, 5, 11],
[11, 5, 19, 6, 10, 5],
[ 6, 2, 0, 12, 6, 17],
[13, 20, 14, 10, 18, 9],
[ 9, 4, 4, 24, 24, 8]])
We can use pandas dataframe, like so -
import pandas as pd
In [336]: print pd.DataFrame(arr,columns=list(' TCGCA'),index=list(' TCCA'))
T C G C A
24 8 6 10 5 11
T 11 5 19 6 10 5
C 6 2 0 12 6 17
C 13 20 14 10 18 9
A 9 4 4 24 24 8
Note that pandas dataframe expects headers(column IDs) and indexes for all rows and columns. So, to skip those for the first row and column, we have used the IDs with the first one being empty : ' TCGCA' and ' TCCA'.

Here's a quick version of adding labels with plain Python and numpy
Define a function that writes lines. Here is just prints the lines, but it could be set up to print to file, or to collect all the lines in a list and return that.
def pp(arr,lbl):
print(' ',' '.join(lbl))
for i in range(4):
print('%s %s'%(lbl[i], arr[i]))
In [65]: arr=np.arange(16).reshape(4,4)
the default display for a 2d array
In [66]: print(arr)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
In [67]: lbl=list('ABCD')
In [68]: pp(arr,lbl)
A B C D
A [0 1 2 3]
B [4 5 6 7]
C [ 8 9 10 11]
D [12 13 14 15]
Spacing is off because numpy is formatting each line separately, applying a different element width for each row. But it's a start.
It looks better with a random sample:
In [69]: arr = np.random.randint(0,25,(4,4))
In [70]: arr
Out[70]:
array([[24, 12, 12, 6],
[22, 16, 18, 6],
[21, 16, 0, 23],
[ 2, 2, 19, 6]])
In [71]: pp(arr,lbl)
A B C D
A [24 12 12 6]
B [22 16 18 6]
C [21 16 0 23]
D [ 2 2 19 6]

Related

Python Dask - vertical concatenation of 2 DataFrames

I am trying to vertically concatenate two Dask DataFrames
I have the following Dask DataFrame:
d = [
['A','B','C','D','E','F'],
[1, 4, 8, 1, 3, 5],
[6, 6, 2, 2, 0, 0],
[9, 4, 5, 0, 6, 35],
[0, 1, 7, 10, 9, 4],
[0, 7, 2, 6, 1, 2]
]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)
Here is the data as a Pandas DataFrame
A B C D E F
0 1 4 8 1 3 5
1 6 6 2 2 0 0
2 9 4 5 0 6 35
3 0 1 7 10 9 4
4 0 7 2 6 1 2
Here is the Dask DataFrame
Dask DataFrame Structure:
A B C D E F
npartitions=4
0 int64 int64 int64 int64 int64 int64
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
4 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks
I am trying to concatenate 2 Dask DataFrames vertically:
ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)
but I get this error:
Traceback (most recent call last):
...
File "...", line 572, in concat
raise ValueError('All inputs have known divisions which cannot '
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order
However, if I try:
dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)
then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?
If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.
When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.
mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.
Vertically concatenating DataFrames
Create two DataFrames, concatenate them, and view the results.
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf3 = dd.concat([ddf1, ddf2])
print(ddf3.compute())
nums letters
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
0 88 xx
1 99 yy
Divisions metadata when vertically concatenating
Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.
def print_partitions(ddf):
for i in range(ddf.npartitions):
print(ddf.partitions[i].compute())
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1.divisions # (0, 3, 5)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2.divisions # (0, 1)
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (None, None, None, None)
Set interleave_partitions=True to avoid losing the divisions metadata.
ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)
ddf3_interleave.divisions # (0, 1, 3, 5)
When interleaving isn't necessary
Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:
df = pd.DataFrame(
{"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1 = ddf1.set_index("some_index")
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2 = ddf2.set_index("some_index")
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (4, 6, 10, 20)
I wrote a blog post to explain this in more detail. Let me know if you'd like the link.

OnVsRestClassifier gives 0 accuracy

i am trying to solve a multilabel classification problem as
from sklearn.preprocessing import MultiLabelBinarizer
traindf = pickle.load("traindata.pkl","rb"))
X = traindf['Col1']
X=MultiLabelBinarizer().fit_transform(X)
y = traindf['Col2']
y= MultiLabelBinarizer().fit_transform(y)
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
from sklearn.linear_model import LogisticRegression
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(Xtrain,ytrain)
print "One vs rest accuracy: %.3f" % clf.score(Xvalidate,yvalidate)
in this way, i always get 0 accuracy. Please point out if i am doing something wrong. i am new to multilabel classification. Here is what my data looks like
Col1 Col2
asd dfgfg [1,2,3]
poioi oiopiop [4]
EDIT
Thanks for your help #lejlot. I think i am getting the hang of it. Here is what i tried
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
tdf = pd.read_csv("mul.csv", index_col="DocID",error_bad_lines=False)
print tdf
so my input data looks like
DocID Content Tags
1 abc abc abc [1]
2 asd asd asd [2]
3 abc abc asd [1,2]
4 asd asd abc [1,2]
5 asd abc qwe [1,2,3]
6 qwe qwe qwe [3]
7 qwe qwe abc [1,3]
8 qwe qwe asd [2,3]
so this is just some test data i created. then i do
text_clf = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=42)),
])
t=TfidfVectorizer()
X=t.fit_transform(tdf["Content"]).toarray()
print X
this gives me
[[ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0.89442719 0.4472136 0. ]
[ 0.4472136 0.89442719 0. ]
[ 0.55247146 0.55247146 0.62413987]
[ 0. 0. 1. ]
[ 0.40471905 0. 0.91444108]
[ 0. 0.40471905 0.91444108]]
then
y=tdf['Tags']
y=MultiLabelBinarizer().fit_transform(y)
print y
gives me
[[0 1 0 0 1 1]
[0 0 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 1 1 1]
[0 0 0 1 1 1]
[1 1 0 1 1 1]
[1 0 1 1 1 1]]
here i am wondering why there are 6 column? shouldn't there be only 3?
anyway, then i also created a test data file
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
so this looks like
DocID Content PredTags
34 abc abc qwe [1,3]
35 asd abc asd [1,2]
36 abc abc abc [1]
i have the PredTags column to check for accuracy. So finally i fit and predict as
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
which gives me
[[1 1 1 1 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]]
Now, how do i know which tags are being predicted? How can i check the accuracy against my PredTags column?
Update
Thanks a lot #lejlot :) i also manged to get the accuracy as follows
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
ty=sdf["PredTags"]
ty = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in ty]
yt=MultiLabelBinarizer().fit_transform(ty)
Xt=t.fit_transform(sdf["Content"]).toarray()
print Xt
print yt
print "One vs rest accuracy: %.3f" % clf.score(Xt,yt)
i just had to binarize the test set prediction column as well :)
The actual problem is the way you work with text, you should extract some kind of features and use it as text representation. For example you can use bag of words representation, or tfidf, or any more complex approach.
So what is happening now? You call multilabelbinarizer on list of strings thus, scikit-learn creates a set of all iterables in the list... leading to the set of letters representation. So for example
from sklearn.preprocessing import MultiLabelBinarizer
X = ['abc cde', 'cde', 'fff']
print MultiLabelBinarizer().fit_transform(X)
gives you
array([[1, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 1]])
| | | | | | |
v v v v v v v
a b _ c d e f
Consequently classification is nearly impossible as this does not capture any meaning of your texts.
You could do for example a Count Vectorization (bag of words)
from sklearn.feature_extraction.text import CountVectorizer
print CountVectorizer().fit_transform(X).toarray()
gives you
[[1 1 0]
[0 1 0]
[0 0 1]]
| | |
v | v
abc | fff
v
cde
Update
Finally, to make predictions with labels, and not their binarization you need to store your binarizer thus
labels = MultiLabelBinarizer()
y = labels.fit_transform(y)
and later on
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print labels.inverse_transform(predicted)
Update 2
If you only have three classes then the vector should have 3 elements, yours have 6 so check what you are passing as "y", there is probably some mistake in your data
from sklearn.preprocessing import MultiLabelBinarizer
MultiLabelBinarizer().fit_transform([[1,2], [1], [3], [2]])
gives
array([[1, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]])
as expected.
My best guess is that your "tags" are also strings thus you actually call
MultiLabelBinarizer().fit_transform(["[1,2]", "[1]", "[3]", "[2]"])
which leads to
array([[1, 1, 1, 0, 1, 1],
[0, 1, 0, 0, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 0, 1, 1]])
| | | | | |
v v v v v v
, 1 2 3 [ ]
And these are your 6 classes. Three true ones, 2 "trivial" classes "[" and "]" which are present always and also nearly trivial class "," which appears for every object beleonging to more than one class.
You should convert your tags to actual lists first, for example by
y = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in y]

Remove duplicate method for Python Pandas doesnt work

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]
You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5

How to filter dataframe in pandas by 'str' in columns name?

Following this recipe. I tried to filter a dataframe by the columns name that contain the string '+'. Here's the example:
B = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', '+B', '+C'], index=[1, 2, 3, 4, 5])
So I want a dataframe C with only '+B' and '+C' columns in it.
C = B.filter(regex='+')
However I get the error:
File "c:\users\hernan\anaconda\lib\site-packages\pandas\core\generic.py", line 1888, in filter
matcher = re.compile(regex)
File "c:\users\hernan\anaconda\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "c:\users\hernan\anaconda\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: nothing to repeat
The recipe says it is Python 3. I use python 2.7. However, I don't think that is the problem here.
Hernan
+ has a special meaning in regular expressions (see here). You can escape it with \:
>>> C = B.filter(regex='\+')
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4
Or, since all you care about is the presence of +, you could use the like argument instead:
>>> C = B.filter(like="+")
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4

Python read a file and make a nth list from the

I have a file that each line has 2 element like below which have nth lines:
1 2
2 3
3 4
4 5
1 6
2 7
1 8
I need to make a list in python.
list[1]=[2,6,8]
list[2]=[3,7]
list[3]=[4]
list[4]=[5]
How can I do?
Try
import pandas as pd
a = [[1,2], [2,3], [3,4], [4, 5], [1, 6], [2,7], [1,8]]
df = pd.DataFrame(a,columns=['b','c'])
print df
z = df.groupby(['b']).apply(lambda tdf:pd.Series(dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['b']])))
z = z.sort_index()
print z
print z['c'][1]
print z['c'][2]
print z['c'][3]
print z['c'][4]
z['d'] = 0.000
z[['d']] = z[['d']].astype(float)
len_b = len(z.index)
z['d'] = float(len_b)
z['e'] = 1/z['d']
z = z[['c', 'e']]
z.to_csv('your output folder')
print z
See this answer for more details: https://stackoverflow.com/a/24112443/2632856