pandas merging of two data-frames using conditions - python-2.7

Is there a way in pandas to merge two data frames with varying lengths by using a conditional statement?
eg:
pd.merge(df1, df2, on=<condition>)
For example assume there are two data frames, df1 and df2 with 10,000 and 15,000 objects respectively.
I want to match common objects between the two catalogues using their x and y position. Objects should be matched between df1 and df2 such that the matched objects fall within 1m radius of each other.
Other than x and y, there is nothing common between between the two data frames.
The best I can think so far involves a for loop. I'm sure there is a faster and better way to do this?
delta = 1.0
result = pd.concat([df1, df2], axis=1)
for index, values in result.T.iteritems():
if len(result[((result.x.iloc[:,1]-delta)<values.x.iloc[0]) & ((result.x.iloc[:,1]+delta)>values.x.iloc[0]) &
((result.y.iloc[:,1]-delta)<values.y.iloc[0]) & ((result.y.iloc[:,1]+delta)>values.y.iloc[0])])>0 :
print values.id

Related

How to put parameters obtained through "pandas.describe" in a plot in one go?

Like if i have a data frame with four columns and i want to plot any two columns of it just to visualize my data. And we can find the value of all the parameters by using this
pd.describe()
count 332.000000
mean 5645.999337
std 391.081389
min 4952.290000
25% 5294.402500
50% 5647.905000
75% 6028.805000
max 6290.980000
Now, how can we put the information that we get with this function ('pandas.describe') into the plot in just one go. Instead of using the usual 'label' function from matplotlib.
Matplotlib has the option ax.text. So you need to convert this info into text.
Here comes an example:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3]})
desc=df.describe()
Describe is also a DataFrame, you can turn every column into a string list:
data1=[i for i in desc.index]
data2=[str(i) for i in desc.A]
Now you can join both with a colon in between:
text= ('\n'.join([ a +':'+ b for a,b in zip(data1,data2)]))
Then in your graph, you can input:
ax.text(pos1, pos2, text , fontsize=15)
Where pos1 and pos2are numbers for the position of your text.
Does that help?
Tell me!

How do I compare two columns at once against two different data frames in python (pandas)?

df1 contains two columns of Lat and Long, and several thousand rows. df2 also contains two columns of lat and long with many rows. Essentially, df2 is a list of reference locations that I want to compare df1 with. I want to compare both the Latitude and Longitude of df1 with df2 to say their locations match, or say they don't. i.e.,
my_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
reference_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
In simpler words, I want to say that if the location in each row in my_data is present in reference_data, label it 1, else label it 0. Since this location has two components Lat and Long, they BOTH need to be present next to each other anywhere in the reference dataframe. Is there an easy one-liner?
You could generate this by using the merge function to join the reference_data to my_data with an indicator.
new_df = pd.merge(my_data, reference_data, on=['Lat','Lon'], how='left', indicator='flag')
You'll get a dataframe that should look exactly like my_data but include a new column "flag" which either says "left_only" or "both".
To get it as a [0,1] label:
new_df['bin_flag'] = (new_df['flag']=='both').astype(int)
To my knowledge, there is not an actual one-liner for this one.
you can do also something like:
my_data.apply(lambda x: (x['Lat'] in reference_data['Lat'] and x['Lon'] in reference_data['Lon']) * 1.0, axis=1)
and then you can just assign it wherever you like.
or, the same way but maybe easier to see what's going on:
my_data.apply(lambda x: ((x['Lat'], x['Lon']) in zip(reference_data['Lat'], reference_data['Lon'])) * 1.0, axis=1)

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param

Column of lists inside a dataframe in R

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.

For loop using a t-stat function to create a list

I am using the following function to calculate the t-stat for data in data frame (x):
wilcox.test.all.genes<-function(x,s1,s2) {
x1<-x[s1]
x2<-x[s2]
x1<-as.numeric(x1)
x2<-as.numeric(x2)
wilcox.out<-wilcox.test(x1,x2,exact=F,alternative="two.sided",correct=T)
out<-as.numeric(wilcox.out$statistic)
return(out)
}
I need to write a for loop that will iterate a specific number of times. For each iteration, the columns need to be shuffled, the above function performed and the maximum t-stat value saved to a list.
I know that I can use the sample() function to shuffle the columns of the data frame, and the max() function to identify the maximum t-stat value, but I can't figure out how to put them together to achieve a workable code.
You are trying to generate empiric p-values, corrected for the multiple comparisons you are making because of the multiple columns in your data. First, let's simulate an example data set:
# Simulate data
n.row = 100
n.col = 10
set.seed(12345)
group = factor(sample(2, n.row, replace=T))
data = data.frame(matrix(rnorm(n.row*n.col), nrow=n.row))
Calculate the Wilcoxon test for each column, but we will replicate this many times while permuting the class membership of the observations. This gives us an empiric null distribution of this test statistic.
# Re-calculate columnwise test statisitics many times while permuting class labels
perms = replicate(500, apply(data[sample(nrow(data)), ], 2, function(x) wilcox.test(x[group==1], x[group==2], exact=F, alternative="two.sided", correct=T)$stat))
Calculate the null distribution of the maximum test statistic by collapsing across the multiple comparisons.
# For each permuted replication, calculate the max test statistic across the multiple comparisons
perms.max = apply(perms, 2, max)
By simply sorting the results, we can now determine the p=0.05 critical value.
# Identify critical value
crit = sort(perms.max)[round((1-0.05)*length(perms.max))]
We can also plot our distribution along with the critical value.
# Plot
dev.new(width=4, height=4)
hist(perms.max)
abline(v=crit, col='red')
Finally, comparing a real test statistic to this distribution will give you an empiric p-value, corrected for multiple comparisons by controlling the family-wise error to p<0.05. For example, let's pretend a real test stat was 1600. We could then calculate the p-value like:
> length(which(perms.max>1600))/length(perms.max)
[1] 0.074