I am working with biological datasets, straight from transcriptome (RNA) to finding certain protein sequences. I have a set of the protein names for each dataset, and want to find which are common to all datasets. Due to how the data is processed, I end up with a one variable that contains all the sub sets.
Due to how the set.intersect() command works, it requires at least 2 sets as input:
IDs = set.intersection(transc1 & trans2)
However I only have one input, colA which contains 30 sets of 80 to 100 entries. Here is what I have so far:
from glob import glob
for file in glob('*_query.tsv'): #input all 30 datasets, first column with protein IDs
sources = file
colnames = ['a', 'b', 'c', 'd', 'e', 'f']
df = pandas.read_csv(sources, sep='\t', names=colnames) #colnames headers for df contruction
colA = df.a.tolist() #turn column a, protein IDs, into list
IDs = set(colA) #turn lists into sets
If I print(colA), the output is something like this, with two unnamed elements as sets:
set(['ID2', 'ID8', 'ID35', 'ID77', 'ID78', 'ID199', 'ID211'])
set(['ID1', 'ID5', 'ID8', 'ID88', 'ID105', 'ID205'])
At this point I get stuck. I can't get set.intersection() working with the IDs set of sets. Also tried pandas.merge(*IDs) for which the syntax seemed to work, but the number of entries for comparison exceeded the maximum (12).
I wanted to use sets because unlike lists, it should be quick to find common IDs between all the datasets. If there is a better way, I am all for it.
Help is much appreciated.
Related
I have a dataframe which represents unique items. Each item is uniquely identified by a set of varA, varB, and varC (so each item has 0 to n values for varA, varB, or varC). My df has multiple raws per unique item, with various combination of varA, varB, and varC.
The df is like this (ID is unique in the column, but it doesn't represent the unique item).
df = pd.DataFrame({'ID':[1,2,3,4,5],
'varA':['a', 'd', 'a', 'm','Z'],
'varB':['b', 'e', 'k', 'e',NaN],
'varC':['c', 'f', 'l', NaN ,'t']})
So in the df here, you can see that:
1 and 3 are the same item with: {varA:[a], varB:[b,k], varC: [c,l]}.
2 and 4 is also the same: {varA:[d,m], varB:[e], varC: [f]}
I would like to identify every unique item, give them a unique id, and store their information.
The code I have written is terribly inefficient:
Step1: I walk through each row of the dataframe and make a list of each variable
When the three variables are new, it's a new item and I give it an id.
When either of the variable is know, I store the new ones in their respective list and keep walking to the next row
Step2: Once I walked all the dataframe, I have two subsets:
1 with a unique id,
the other one without unique id, but whose information can be found in the ones that have unique id, either with varA, varB, or varC. So quite uglily I merge successively on either variable, and find the unique id.
Result: I have the same df than at the start, but with a column of repeated unique identifiers.
This works well with 20,000 rows in entry with varA and varB. This is running very slow and dying before the end (between Step1 and Step2) on 100,000 rows, and I need to make it on 1,000,000 rows.
Any pandanique way of doing this?
You can use chained boolean indexing using duplicated (pd.Series.duplicated):
If you want to keep the first occurence of a duplicated:
myfilter = ~df.varA.duplicated(keep='first') & \
~df.varB.duplicated(keep='first') & \
~df.varC.duplicated(keep='first')
If you don't want to
myfilter = ~df.varA.duplicated(keep=False) & \
~df.varB.duplicated(keep=False) & \
~df.varC.duplicated(keep=False)
Then you can for example give these an incremental uniqueID:
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
df
ID varA varB varC uniqueID
0 1 a b c 0.0
1 2 d e f 1.0
2 3 a k l NaN
3 4 m e NaN NaN
4 5 Z NaN t 2.0
Is there a way in pandas to merge two data frames with varying lengths by using a conditional statement?
eg:
pd.merge(df1, df2, on=<condition>)
For example assume there are two data frames, df1 and df2 with 10,000 and 15,000 objects respectively.
I want to match common objects between the two catalogues using their x and y position. Objects should be matched between df1 and df2 such that the matched objects fall within 1m radius of each other.
Other than x and y, there is nothing common between between the two data frames.
The best I can think so far involves a for loop. I'm sure there is a faster and better way to do this?
delta = 1.0
result = pd.concat([df1, df2], axis=1)
for index, values in result.T.iteritems():
if len(result[((result.x.iloc[:,1]-delta)<values.x.iloc[0]) & ((result.x.iloc[:,1]+delta)>values.x.iloc[0]) &
((result.y.iloc[:,1]-delta)<values.y.iloc[0]) & ((result.y.iloc[:,1]+delta)>values.y.iloc[0])])>0 :
print values.id
I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param
I'm trying to create a code that will open a file with a list of numbers in it and then take those numbers and smooth them as many times as the user wants. I have it opening and reading the file, but it will not transpose the numbers. In this format it gives this error: TypeError: unsupported operand type(s) for /: 'str' and 'float'. I also need to figure out how to make it transpose the numbers the amount of times the user asks it to. The list of numbers I used in my .txt file is [3, 8, 5, 7, 1].
Here is exactly what I am trying to get it to do:
Ask the user for a filename
Read all floating point data from file into a list
Ask the user how many smoothing passes to make
Display smoothed results with two decimal places
Use functions where appropriate
Algorithm:
Never change the first or last value
Compute new values for all other values by averaging the value with its two neighbors
Here is what I have so far:
filename = raw_input('What is the filename?: ')
inFile = open(filename)
data = inFile.read()
print data
data2 = data[:]
print data2
data2[1]=(data[0]+data[1]+data[2])/3.0
print data2
data2[2]=(data[1]+data[2]+data[3])/3.0
print data2
data2[3]=(data[2]+data[3]+data[4])/3.0
print data2
You almost certainly don't want to be manually indexing the list items. Instead, use a loop:
data2 = data[:]
for i in range(1, len(data)-1):
data2[i] = sum(data[i-1:i+2])/3.0
data = data2
You can then put that code inside another loop, so that you smooth repeatedly:
smooth_steps = int(raw_input("How many times do you want to smooth the data?"))
for _ in range(smooth_steps):
# code from above goes here
Note that my code above assumes that you have read numeric values into the data list. However, the code you've shown doesn't do this. You simply use data = inFile.read() which means data is a string. You need to actually parse your file in some way to get a list of numbers.
In your immediate example, where the file contains a Python formatted list literal, you could use eval (or ast.literal_eval if you wanted to be a bit safer). But if this data is going to be used by any other program, you'll probably want a more widely supported format, like CSV, JSON or YAML (all of which have parsers available in Python).
Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.