How to Merge Two Dataframes Without Losing Any Rows - python-2.7

I have two data frames:
df1 =
Id ColA ColB ColC
1 aa bb cc
3 11 ww 55
5 11 bb cc
df2 =
Id ColD ColE ColF
1 ff ee rr
2 ww rr 55
3 hh 11 22
4 11 11 cc
5 cc bb aa
I need to merge these two data frames to get the following result:
result =
Id ColA ColB ColC ColD ColE ColF
1 aa bb cc ff ee rr
2 NaN NaN NaN ww rr 55
3 11 ww 55 hh 11 22
4 NaN NaN NaN 11 11 cc
5 11 bb cc cc bb aa
I do the merging this way:
import pandas as pd
result = pd.merge(df1,df2,on='Id')
However my result looks as follows instead of the expected above-shown result:
result =
Id ColA ColB ColC ColD ColE ColF
1 aa bb cc ff ee rr
3 11 ww 55 hh 11 22
5 11 bb cc cc bb aa

According to the documentation of merge, you need to specify the 'how' parameter as outer (the default is inner, which is consistent with what you're getting):
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)

Related

Looking for SAS coding to compute scores for a whole cohort based on scores calculated for a subgroup

I am Looking for SAS coding to compute scores for a whole cohort based on scores calculated for a subgroup
I can create scores in the whole population by itself as my whole dataset but have no experience in using the fitted values of a subgroup dataset to compute scores for the whole population
I work with SAS coding
NA
Welcome to stackoverflow! If I understand your question, this will do what you want.
I grabbed some data from sas support:
Data Neuralgia;
input Treatment $ Sex $ Age Duration Pain $ ##;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
run;
Then subsetted down to build a model using only the males:
data males;
set Neuralgia;
where sex = "M";
run;
Then I built a model and saved the model details, into the work library, in a file called theMaleModel.
proc logistic data=males outmodel=work.theMaleModel;
class Treatment;
model Pain = Treatment Age Duration ;
run;
Then I apply the male model to the full dataset and save the scored results into a dataset, in the work library, called scoreEverybody:
proc logistic inmodel=work.theMaleModel;
score data=Neuralgia out=scoreEverybody;
run;
You can see more examples like this if you look here. If that answers your question please click the check next to this answer.

Python pandas str.extract regex end of string

data.Hotel_Address.head(10)
0 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
1 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
2 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
3 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
4 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
5 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
6 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
7 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
8 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
9 s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
How can I extract country name after last space till the end of string with regex in pandas?
Regex is not necessary, use split and select last lists by str[-1]:
data['new'] = data.Hotel_Address.str.split().str[-1]

Sqoop --boundary-query

I have data set as below. Can someone help me to import data to hdfs using sqoop boundary query, Using the column (id) which is having duplicate keys.
mysql> select id,name,age from employee;
id name age
1 A 30
2 B 35
3 C 40
4 D 23
5 E 26
1 A 24
2 B 16
3 C 78
4 G 66
3 H 56
4 A 63
20 C 58
13 F 47
2 A 49
3 B 60

Merge on one key after the other

I often end up with the following situation. I have a dataframe with two IDs
A = pd.DataFrame([[1,'a', 'a1'], [2, None, 'a2'], [3,'c', 'a3'], [4,'None', 'a3'], [None, 'e', 'a3'], ['None', 'None', 'None']], columns = ['id1', 'id2', 'colA'])
id1 id2 colA
0 1 a a1
1 2 None a2
2 3 c a3
3 4 None a3
4 None e a3
5 None None None
and I have another dataframe with additional info I want to add to the first dataframe
B = pd.DataFrame([[1,'a', 'b1', 'c1'], [2, 'b', 'b2', 'c2'], [3,'c', 'b3', 'c3'], [4, 'd', 'b4', 'c4'], [5, 'e', 'b5', 'c5'], [6, 'e', 'b5', 'c5']], columns = ['id1', 'id2', 'colB', 'colC'])
Out[15]:
id1 id2 colB colC
0 1 a b1 c1
1 2 b b2 c2
2 3 c b3 c3
3 4 d b4 c4
4 5 e b5 c5
5 6 e b5 c5
I want to merge on id1, like this
A.merge(B, how='left', on='id1')
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN NaN NaN
5 None None None NaN NaN NaN
This is close to what I want. However for the failed lookups (that is when id1 is not available) I would like to merge on id2, so the result looks like
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN b5 c5
5 None None None NaN NaN NaN
What's the best way to achieve this? Note I don't really want 2 id2 columns in the result and id2 may have duplicates.
IIUC you use fillna. But it fill last row too.
print df
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN NaN NaN
5 None None None NaN NaN NaN
df = df.fillna(B)
print df
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN b5 c5
5 None None None NaN b5 c5
As EdChum mentioned in comments, next solution is use combine_first, but output is different:
print A.combine_first(B)
colA colB colC id1 id2
0 a1 b1 c1 1 a
1 a2 b2 c2 2 b
2 a3 b3 c3 3 c
3 a3 b4 c4 4 None
4 a3 b5 c5 5 e
5 None b5 c5 None None
Difference is:
In [142]: %timeit A.combine_first(B)
100 loops, best of 3: 3.44 ms per loop
In [143]: %timeit A.merge(B, how='left', on='id1').fillna(B)
100 loops, best of 3: 2.89 ms per loop

Use ODS Graphics to produce grouped histogram

I have this data set:
data a1q1;
input pid los age gender $ temp wbc anti service $ ;
cards;
1 5 30 F 99 82 2 M
2 10 73 F 98 52 1 M
3 6 40 F 99 122 2 S
4 11 47 F 98 42 2 S
5 5 25 F 99 112 2 S
6 14 82 M 97 61 2 S
7 30 60 M 100 81 1 M
8 11 56 F 99 72 2 M
9 17 43 F 98 72 2 M
10 3 50 M 98 122 1 S
11 9 59 F 98 72 1 M
12 3 4 M 98 32 2 S
13 8 22 F 100 111 2 S
14 8 33 F 98 141 1 S
15 5 20 F 98 112 1 S
16 5 32 M 99 92 2 S
17 7 36 M 99 61 2 S
18 4 69 M 98 62 2 S
19 3 47 M 97 51 2 M
20 7 22 M 98 62 2 S
21 9 11 M 98 102 2 S
22 11 19 M 99 141 2 S
23 11 67 F 98 42 2 M
24 9 43 F 99 52 2 S
25 4 41 F 98 52 2 M
;
I need to use PROC SGPLOT to output an identical, if not, similar barchart that would be outputted from the following PROC:
proc gchart data = a1q1;
vbar wbc / group = gender;
run;
I need PROC SGPLOT to group the two genders together and not stack them. I have tried coding this way but to no avail:
proc sgplot data = a1q1;
vbar wbc / group= gender response =wbc stat=freq nostatlabel;
run;
How would I go about coding to get the output I need?
Thank you for your time!
Sounds like you should use SGPANEL, not SGPLOT. SGPLOT can make grouped bar charts, but not automatically make histogram bins without using a format (you could do that if you want) and doesn't support group with the histogram plot. However, SGPANEL can handle that.
proc sgpanel data=a1q1;
panelby gender;
histogram wbc;
run;