reading a matrix and fetch row and column names in python - python-2.7

I would like to read a matrix file something which looks like:
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8 sample2,sample3 0.8 etc in a large file .
When I use csv.reader, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy or pandas.
Desired output:
sample1,sample3 0.8
sample2,sample3 0.8
1 can be ignored because between same sample, it will be 1 always.

You can mask out the off upper-triangular values with np.triu:
In [11]: df
Out[11]:
sample1 sample2 sample3
sample
sample1 1.0 0.7 0.8
sample2 0.7 1.0 0.8
sample3 0.8 0.8 1.0
In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. , 0.7, 0.8],
[ 0. , 0. , 0.8],
[ 0. , 0. , 0. ]])
In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False, True],
[False, False, True],
[False, False, False]], dtype=bool)
Then to extract the index/columns where it's True I think you have to use np.where*:
In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))
This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):
In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)
In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
('sample2', 'sample3', 0.80000000000000004)]
As desired.
*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)
You can use the same trick in pandas but with stack to get the index/columns natively:
In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
sample1 sample2 sample3
sample
sample1 0 0 0.8
sample2 0 0 0.8
sample3 0 0 0.0
In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()
In [23]: res
Out[23]:
sample
sample1 sample1 0.0
sample2 0.0
sample3 0.8
sample2 sample1 0.0
sample2 0.0
sample3 0.8
sample3 sample1 0.0
sample2 0.0
sample3 0.0
dtype: float64
In [24]: res[res!=0]
Out[24]:
sample
sample1 sample3 0.8
sample2 sample3 0.8
dtype: float64

If you want to use Pandas, the following answer will help. I am assuming you will figure out how to read your matrix files into Pandas by yourself. I am also assuming that your columns and rows are labelled correctly. What you will end up with after you read your data is a DataFrame which will look a lot like the matrix you have at the top of your question. I am assuming that all row names are the DataFrame index. I am taking that you have read the data into a variable called df as my starting point.
Pandas is more efficient row-wise than column-wise. So, we do things row-wise, looping over the columns.
pairs = {}
for col in df.columns:
pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist()
# If row names are not an index, but a different column named 'names' run the following line, instead of the line above
# pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()
Alternatively, you can use apply() to do this, because that too will loop over all columns. (Maybe in 0.17 it will release the GIL for faster results, I do not know because I have not tried it.)
pairs will now contain the column name as key and a list of the names of rows as values where the correlation is greater than 0.8, but less than 1.
If you also want to extract correlation values from the DataFrame, replace .tolist() by .to_dict(). .to_dict() will generate a dict such that index is key and value is value: {index -> value}. So, ultimately your pairs will look like {column -> {index -> value}}. It will also be guaranteed free of nan. Note that .to_dict() will only work if your index contains the row names that you want, else it will return the default index, which is just numbers.
Ps. If your file is huge, I would recommend reading it in chunks. In this case, the piece of code above will be repeated for each chunk. So it should be inside your loop that iterates over chunks. However, then you will have to be careful to append new data coming from the next chunk to pairs. The following links are for your reference:
Pandas I/O docs
Pandas read_csv() function
SO question on chunked read
You might also want to read reference 1 for other types of I/O supported by Pandas.

To read it in you need the skipinitialspace and index_col parameters:
a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)
To get the values pair wise:
[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]
Gives:
[['sample1', 'sample3', 0.8],
['sample2', 'sample3', 0.8]]

Using scipy.sparse.coo_matrix, as it works with a "(row, col) data" format.
from scipy.sparse import coo_matrix
import numpy as np
M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]])
S = coo_matrix(M)
Here, S.row and S.col are arrays of row and column indices, S.data is the array of values at those indices. So you can filter by
idx = S.data >= 0.8
And for instance create a new matrix only with those elements:
S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx])))
print S2
The output is
(0, 0) 1.0
(0, 2) 0.8
(1, 1) 1.0
(1, 2) 0.8
(2, 0) 0.8
(2, 1) 0.8
(2, 2) 1.0
Note (0,1) does not appear as the value is 0.7.

pandas' read_table can handle regular expressions in the sep parameter.
In [19]: !head file.txt
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
In [20]: df = pd.read_table('file.txt', sep='\s+')
In [21]: df
Out[21]:
sample sample1 sample2 sample3
0 sample1 1.0 0.7 0.8
1 sample2 0.7 1.0 0.8
2 sample3 0.8 0.8 1.0
From there, you can filter on all values >= 0.8.
In [23]: df[df >= 0.8]
Out[23]:
sample sample1 sample2 sample3
0 sample1 1.0 NaN 0.8
1 sample2 NaN 1.0 0.8
2 sample3 0.8 0.8 1.0

Related

dataframe from 3d list

this is what i have (a 3D list) :
**
[ STOCK NAME
last_price price2 price3
0 0.00 0.0 0.0
1 870.95 7650.0 2371500.0
2 870.95 7650.0 2371500.0
3 870.95 7650.0 2371500.0
4 877.30 7650.0 2371500.0
5 879.20 6800.0 2381700.0]
**
I want to create a dataframe exactly like the list that I have above. how do I do so? thank you very much.. i tried pd.DataFrame(the_list) but it gave me this error: ValueError: Must pass 2-d input. shape=(190, 6, 3).. thanks

select column with non-zero values from dataframe

I have data like the data below. I would like to only return the columns from the dataframe that contain at least one non-zero value. So in the example below it would be column ALF. Returning non-zero rows doesn’t seem that tricky but selecting the column and records is giving me a little trouble.
print df
Data:
Type ADR ALE ALF AME
Seg0 0.0 0.0 0.0 0.0
Seg1 0.0 0.0 0.5 0.0
When I try something like the link below:
Pandas: How to select columns with non-zero value in a sparse table
m1 = (df['Type'] == 'Seg0')
m2 = (df[m1] != 0).all()
print (df.loc[m1,m2])
I get a key error for 'Type'
In my opinion you get key error because first column is index:
Solution use DataFrame.any for check at least one non zero value to mask and then filter index of Trues:
m2 = (df != 0).any()
a = m2.index[m2]
print (a)
Index(['ALF'], dtype='object')
Or if need list:
a = m2.index[m2].tolist()
print (a)
['ALF']
Similar solution is filter columns names:
a = df.columns[m2]
Detail:
print (m2)
ADR False
ALE False
ALF True
AME False
dtype: bool

R Shiny interactive plots - data labels

I am plotting two matrices against each other using R Shiny. Both matrices have the same row and column names. When I click on a datapoint I would like the relevant column and row names to appear rather than the coordinates. Here is a sample of the code/data I am using. Thanks!
ui.R
library(shiny)
shinyUI(
fluidPage(
titlePanel("Matrix Plot"),
plotOutput("plot", click = "plot_click"), br(), verbatimTextOutput("info")
)
)
server.R
library(shiny)
d <- read.csv("d.csv",h=T, row.names=1)
e <- read.csv("e.csv",h=T, row.names=1)
shinyServer(function(input, output) {
d_matrix <-reactive({
as.matrix(d)
d
})
e_matrix <-reactive({
as.matrix(e)
e
})
output$plot<-renderPlot({
plot(d_matrix(),e_matrix())
})
output$info <- renderText({
#output row and column names here instead of data coordinates
paste0("x=", input$plot_click$x, "\ny=", input$plot_click$y)
})
})
d =
A B C D
A 0 1 5 4
B 2 0 5 6
C 3 5 0 8
D 4 6 7 0
e =
A B C D
A 0.0 0.1 0.5 0.4
B 0.2 0.0 0.3 0.6
C 0.3 0.5 0.0 0.8
D 0.4 0.6 0.7 0.0

Strange behaviour when adding columns

I'm using Python 2.7.8 |Anaconda 2.1.0. I'm wondering why the strange behavior below occurs
I create a pandas dataframe with two columns, then add a third column by summing the first two columns
x = pd.DataFrame(np.random.randn(5, 2), columns = ['a', 'b'])
x['c'] = x[['a', 'b']].sum(axis = 1) #or x['c'] = x['a'] + x['b']
Out[7]:
a b c
0 -1.644246 0.851602 -0.792644
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.737803 -1.612189 -0.874386
4 0.340671 -0.113334 0.227337
All good so far. Now I want to set the values of column c to zero if they are negative
x[x['c']<0] = 0
Out[9]:
a b c
0 0.000000 0.000000 0.000000
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.000000 0.000000 0.000000
4 0.340671 -0.113334 0.227337
This gives the desired result in column 'c', but for some reason columns 'a' and 'b' have been modified - i don't want this to happen. I was wondering why this is happening and how I can fix this behavior?
You have to specify you only want the 'c' column:
x.loc[x['c']<0, 'c'] = 0
When you just index with a boolean array/series, this will select full rows, as you can see in this example:
In [46]: x['c']<0
Out[46]:
0 True
1 False
2 False
3 True
4 False
Name: c, dtype: bool
In [47]: x[x['c']<0]
Out[47]:
a b c
0 -0.444493 -0.592318 -1.036811
3 -1.363727 -1.572558 -2.936285
Because you are setting to zero for all the columns. You should set it only for column c
x['c'][x['c']<0] = 0

Convert 2D numpy.ndarray to pandas.DataFrame

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete.
Is there some way I can speed it up?
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
In [30]:
df=pd.DataFrame(np.array(ndarr).ravel(),
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
columns=['val'])
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndex may be a better idea.
Something like this should work:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].