Checking if two pandas dataframes have already been merged - python-2.7

Suppose that I have DataFrames df and df2. df2 may or may not have already been merged with df via
df = df.merge(df2,how='left',left_index=True,right_index=True)
When unmerged, they should have no column names in common.
What is the cleanest way to check if df and df2 have already been merged?

Combine Index.intersection and Index.empty to determine if there is are columns in common:
df.columns.intersection(df2.columns).empty
Returning True indicates that there are no columns in common.

Taking advantage of the fact that:
When unmerged, they should have no column names in common.
Check if any of the column names in df2 exist in df. To do so, you can utilize the "columns" property of the DataFrame.
For example:
# Create DataFrames with respectively unique columns
df = pd.DataFrame([1,2,3], columns=['a'])
df2 = pd.DataFrame([4,5,6], columns=['b'])
# False; the DataFrames have not been mereged
not df.columns.intersection(df2.columns).empty
# Merge
df = df.merge(df2,how='left',left_index=True,right_index=True)
# True; the DataFrames have been merged
not df.columns.intersection(df2.columns).empty
Update: Suggestion comes from the comments. See similar options here: python: check if an numpy array contains any element of another array

Related

How to convert a rpy2 matrix object into a Pandas data frame?

After reading in a .csv file using pandas, and then converting it into an R dataframe using the rpy2 package, I created a model using some R functions (also via rpy2), and now want to take the summary of the model and convert it into a Pandas dataframe (so that I can either save it as a .csv file or use it for other purposes).
I have followed out the instructions on the pandas site (source: https://pandas.pydata.org/pandas-docs/stable/r_interface.html) in order to figure it out:
import pandas as pd
from rpy2.robjects import r
import sys
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
caret = rpackages.importr('caret')
broom= rpackages.importr('broom')
my_data= pd.read_csv("my_data.csv")
r_dataframe= pandas2ri.py2ri(my_data)
preprocessing= ["center", "scale"]
center_scale= StrVector(preprocessing)
#these are the columns in my data frame that will consist of my predictors in the model
predictors= ['predictor1','predictor2','predictor3']
predictors_vector= StrVector(predictors)
#this column from the dataframe consists of the outcome of the model
outcome= ['fluorescence']
outcome_vector= StrVector(outcome)
#this line extracts the columns of the predictors from the dataframe
columns_predictors= r_dataframe.rx(True, columns_vector)
#this line extracts the column of the outcome from the dataframe
column_response= r_dataframe.rx(True, column_response)
cvCtrl = caret.trainControl(method = "repeatedcv", number= 20, repeats = 100)
model_R= caret.train(columns_predictors, columns_response, method = "glmStepAIC", preProc = center_scale, trControl = cvCtrl)
summary_model= base.summary(model_R)
coefficients= stats.coef(summary_model)
pd_dataframe = pandas2ri.ri2py(coefficients)
pd_dataframe.to_csv("coefficents.csv")
Although this workflow is ostensibly correct, the output .csv file did not meet my needs, as the names of the columns and rows were removed. When I ran the command type(pd_dataframe), I find that it is a <type 'numpy.ndarray'>. Although the information of the table is still present, the new formatting has removed the names of the columns and rows.
So I ran the command type(coefficients) and found that it was a <class 'rpy2.robjects.vectors.Matrix'>. Since this Matrix object still retained the names of my columns and rows, I tried to convert it into an R objects DataFrame, but my efforts proved to be futile. Furthermore, I don't know why the line pd_dataframe = pandas2ri.ri2py(coefficients) did not yield a pandas DataFrame object, nor why it did not retain the names of my columns and rows.
Can anybody recommend an approach so I can get some kind of pandas DataFrame that retains the names of my columns and rows?
UPDATE
A new method was mentioned in the documents of a slightly older version of the package called pandas2ri.ri2py_dataframe (source: https://rpy2.readthedocs.io/en/version_2.7.x/changes.html), and now I have a proper data frame instead of the numpy array. However, I still can't get the names of the rows and columns to be transferred properly. Any suggestions?
May be it should happen automatically during conversion, but in the meantime row and column names can easily be obtained from the R object and added to the pandas DataFrame. For example the column names for the R matrix should be at: https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.colnames

deleting pandas dataframe row based condition

My pandas dataframe has a column where each row is a string which corresponds to a filename. I read my data from a JSON file and extract the column like this:
df = pd.read_json("mergedJSON.txt",lines=True,orient='columns')
df2 = df.set_index("subject")
for key,value in some_dict.iteritems():
df2.loc[value,"file_name"].to_csv(outfile,index=False, header=False)
I need to drop certain rows from this dataframe based on whether the file is found on disk. Not sure how to do this. Appreciate help.
Just use this as the last line
df2[df2.file_name.str.contains('stringValue')].loc[value,:].to_csv()
First, set_index,reindex
use the filename as index,and then do df.drop(filename).

Add columns in a pandas dataframe to a list

I have a pandas dataframe in the following format:
Latitude Longiutde
53.553 -80.3123
58.1211 -81.3245
I am trying to convert this dataframe to a list of lists to use it in my plotting package.
[[53.553, -80.3123], [58.1211, -81.3245]]
I tried with iterating through the pandas rows to append these columns to a list but for some reason I am not even able to make my first level list.
list.append(row['Latitude'], row['Longitude'])
Any help would be appreciated.
use the tolist method on the underlying numpy array
df
df.values.tolist()
[[53.553000000000004, -80.3123], [58.1211, -81.3245]]

Deleting pandas dataframe rows if value in given column not contained in a list

I have pandas dataframe called df that contains several columns and a df['MY STATE'] column. My goal is to remove all the rows from the dataframe which to not contains US states. I want to do this by comparing the value in the cell to a pandas series I have containing all the state abbreviations. I have seen people use something like the following to clean a dataframe:
df = df[df['COST'] <= 0]
But something like what I need (below) doesn't work
df = df[df['MY STATE'] not in states['Abbreviation'].values]
Is there a way to do this simply?
I have read that df.query() can be used to do something like this, but I have not yet found an example, and have also read that df.query() cannot be used when there is a space in the name of the column.
Thank you,
Michael
IIUC you can use isin with inverse operator ~:
df = df[~df['MY STATE'].isin(states['Abbreviation'].values)]

Deleting Entire blank row in an existing Excel Sheet using Python

How to remove the entire blank row from the existing Excel Sheet using Python?
I need a solution which DOES NOT :
Include reading the whole file and rewriting it without the deleted row.
IS THERE ANY DIRECT SOLUTION?
I achieved using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)
If using memory is not an issue you can achieve this using an extra dataframe.
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
dfs = xl.parse(xl.sheet_names[0])
df1 = dfs[dfs['Sal'] == 1000]
df1 = df1[df1['Message']=="abc"]
ph_no = df1['Number']
print df1
To delete an Excel row, say row 5, column does not matter so 1:
sh.Cells(5,1).EntireRow.Delete()
To delete a range of Excel rows, say row 5 to 20
sh.Range(sh.Cells(5,1),sh.Cells(20,1)).EntireRow.Delete()