merging dataframes in pandas python - python-2.7
I have 10 dataframes and I'm trying to merge the data in them on the variable names. The purpose is to get one file which would contain all the data from the relevant variables
I'm using the below mentioned formula:
pd.merge(df,df1,df2,df3,df4,df5,df6,df7,df8,df9,df10, on = ['RSSD9999', 'RCFD0010','RCFD0071','RCFD0081','RCFD1400','RCFD1773','RCFD2123','RCFD2145','RCFD2160','RCFD3123','RCFD3210','RCFD3300','RCFD3360','RCFD3368','RCFD3792','RCFD6631','RCFD6636','RCFD8274','RCFD8275','RCFDB530','RIAD4000','RIAD4073','RIAD4074','RIAD4079','RCFD1403','RCON3646','RIAD4230','RIAD4300','RIAD4301','RIAD4302','RIAD4340','RIAD4475','RCFD1406','RCFD3230','RCFD2950','RCFD3632','RCFD3839','RCFDB529','RCFDB556','RCON0071','RCON0081','RCON0426','RCON2145','RCON2148','RCON2168','RCON2938','RCON3210','RCON3230','RCON3300','RCON3839','RCONB528','RCONB529','RCONB530','RCONB556','RCONB696','RCONB697','RCONB698','RCONB699','RCONB700','RCONB701','RCONB702','RCONB703','RCONB704','RCON1410','RCON6835','RCFD2210','RCONA223','RCONA224','RCON5311','RCON5320','RCON5327','RCON5334','RCON5339','RCON5340','RCON7204','RCON7205','RCON7206','RCON3360','RCON3368','RCON3385','RIAD3217','RCFDA222','RCFDA223','RCFDA224','RCON3792','RCON0391','RCFD7204','RCFD7206','RCFD7205','RCONB639','RIADG104','RCFDG105','RSSD9017','RSSD9010','RSSD9042','RSSD9050'],how='outer')
But I'm getting an error "merge() got multiple values for keyword argument 'on'". I think the code is correct, can anyone help me to understand whats wrong here?
Firstly you are using 10 dataframes to merge. Ok it's possible but all dataframe should have atleast one column should have same.
import pandas as pd
df=pd.Dataframe(data,column=[your columns],index=[index names])
df=df.set_index(common co)
# do for all ten dataframes
answer=pd.merge(df,df1........,df10,on=column name,how='outer')
pd.merge(Ray, Bec, Dan, on = 'Key', how ='outer')
Related
Any ideas on Iterating over dataframe and applying regex?
This may be a rudimentary problem but I am new to pandas. I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column) I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem? My dataframe looks like this: df =pd.read_csv('foobar.csv') df[['column1','column2, 'TEXT']] My approach looks like this: for Individual_row in df['TEXT'].iteritems(): parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row) res = {g[0].strip() : g[1].strip() for g in parsed} Many thanks in advance
you can try the following instead of loop: df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' ) This will create a new column with your resultant data.
Search pandas column and return all elements (rows) that contain any (one or more) non-digit character
Seems pretty straight forward. The column contains numbers in general but for some reason, some of them have non-digit characters. I want to find all of them. I am using this code: df_other_values.total_count.str.contains('[^0-9]') but I get the following error: AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas So I tried this: df_other_values = df_other.total_countvalues df_other_values.total_count.str.contains('[^0-9]') but get the following error: AttributeError: 'DataFrame' object has no attribute 'total_countvalues' So instead of going down the rabbit hole further, I was thinking there must be a way to do this without having to change my dataframe into a np.object. Please advise. Thanks.
I believe you need cast to strings first by astype and then filter by boolean indexing: df1 = df[df_other_values.total_count.astype(str).str.contains('[^0-9]')] Alternative solution with isnumeric: df1 = df[~df_other_values.total_count.astype(str).str.isnumeric()]
Formatting thousand separator for numbers in a pandas dataframe
I am trying to write a dataframe to a csv and I would like the .csv to be formatted with commas. I don't see any way on the to_csv docs to use a format or anything like this. Does anyone know a good way to be able to format my output? My csv output looks like this: 12172083.89 1341.4078 -9568703.592 10323.7222 21661725.86 -1770.2725 12669066.38 14669.7118 I would like it to look like this: 12,172,083.89 1,341.4078 -9,568,703.592 10,323.7222 21,661,725.86 -1,770.2725 12,669,066.38 14,669.7118
Comma is the default separator. If you want to choose your own separator you can do this by declaring the sep parameter of pandas to_csv() method. df.to_csv(sep=',') If you goal is to create thousand separators and export them back into a csv you can follow this example: import pandas as pd df = pd.DataFrame([[12172083.89, 1341.4078, -9568703.592, 10323.7222], [21661725.86, -1770.2725, 12669066.38, 14669.7118]],columns=['A','B','C','D']) for c in df.columns: df[c] = df[c].apply(lambda x : '{0:,}'.format(x)) df.to_csv(sep='\t') If you just want pandas to show separators when printed out: pd.options.display.float_format = '{:,}'.format print(df)
What you're looking to do has nothing to do with csv output but rather is related to the following: print('{0:,}'.format(123456789000000.546776362)) produces 123,456,789,000,000.546776362 See format string syntax. Also, you'd do well to pay heed to #Peter 's comment above about compromising the structure of a csv in the first place.
Extracting the hour from a datetime64[ns] variable
I have a column that I've converted to datetime using pandas so it's now in the format datetime64. LOCAL_TIME_ON 2014-06-21 15:32:09 2014-06-07 20:17:13 I want to extract the hour to a new column. The only thing I've found that works is below, however, I get a SettingWithCopyWarning. Does anyone have a cleaner way I can do this? TRIP_INFO_TIME['TIME_HOUR'] = pd.DatetimeIndex(TRIP_INFO_TIME['LOCAL_TIME_ON']).hour.astype(int) C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.1.4\helpers\pydev\pydevconsole.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy '''
try this: TRIP_INFO_TIME['TIME_HOUR'] = TRIP_INFO_TIME['LOCAL_TIME_ON'].dt.hour
Adding data to a Pandas dataframe
I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows: In [39]: from pyzipcode import ZipCodeDatabase zcdb = ZipCodeDatabase() zcdb = ZipCodeDatabase() zipcode = zcdb[54115] zipcode.state Out[39]: u'WI' What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following: df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb) Otherwise you can call apply like so: df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state) In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df: df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1) By using double square brackets we return a df allowing you to pass the axis param