Merge multiple rows with same value into one row in pandas - python-2.7

I have seen that there are similar questions, but the answers did not quite fit my exact needs. I have a dataframe that contains rows with different values. Some of the rows however have exactly the same value.
Column1 Column2 Column3
0 a x x
1 a x x
2 a x x
3 d y y
4 d y y
What I would like to have is:
Column1 Column2 Column3
0 a x x
1 d y y
So basically I want to merge all rows with the same values in all columns into one row. What is the most decent way to do that in python?
Thank you in advance!

Call drop_duplicates:
In [214]:
df.drop_duplicates()
Out[214]:
Column1 Column2 Column3
0 a x x
3 d y y

Related

Deleting first instance of a column after group by in sas proc sql

I have the following SAS dataset.
correlation
policynum
risknum
A
X
Y
A
X
Y
A
X
Y
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
C
Z
M
C
Z
M
C
Z
M
D
Z
M
D
Z
M
D
Z
M
In SAS, I want to filter the above dataset so I get my final output as:
correlation
policynum
risknum
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
D
Z
M
D
Z
M
D
Z
M
i.e. for each group of policynum and risknum, if multiple values exist for correlation, I want to keep the second value and get rid of the first value.
If only a single value of correlation exists for a group of policynum and risknum, I want to retain that group in my final output too.
What would be the best way to do this? It might be something simple as I am relatively new to SAS.
Thanks in advance!
If the order of the correlation values, in sort order, is the same ordering as they appear row-wise in the data set you can use SQL. Otherwise, SQL, being based on set theory, which does not have implicit row numbers, can not be used. A DATA step with DOW loop can be used.
Example:
FYI, one common situation in which SAS coders use the phrase 'DOW loop' is when SET & BY statements occur inside a DO loop.
data have;
input correlation $ policynum $ risknum $;
datalines;
A X Y
A X Y
A X Y
B X Y
B X Y
B X Y
B X L
B X L
B X L
C Z M
C Z M
C Z M
D Z M
D Z M
D Z M
;
/* keep last group of a nested group */
* SQL can be used only if correlation wanted is ALWAYS highest valued correlation;
proc sql;
create table want as
select * from have
group by policynum, risknum
having correlation = max(correlation)
;
* DATA Step DOW loops can be used when correlation wanted is last occurring correlation within by group;
data want;
do _n_ = 1 by 1 until (last.policynum);
set have;
by policynum risknum notsorted; /* presume at least contiguous */
end;
_want_correlation = correlation;
do _n_ = 1 to _n_;
set have;
if _want_correlation = correlation then OUTPUT;
end;
run;

COUNTING VALUE PER PARTCIPANTS

I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;

How to get only last 4 WORKING days data in SAS?

I'm trying to pull only last 4 working days data in SAS...I tried following code but I'm not getting what I'm intended to...
data input;
Input id $ id1 $ id2 $ num date date9.;
Format Date Date9.;
datalines;
x y z 3 19JUL2015
x y z 2 18JUL2015
x y z 3 17JUL2015
x y z 2 16JUL2015
x y z 3 15JUL2015
x y z 2 14JUL2015
x y z 3 13JUL2015
a b c 1 12JUL2015
a b c 1 11JUL2015
a b c 1 10JUL2015
a b c 1 09JUL2015
a b c 1 08JUL2015
a b c 2 07JUL2015
x y z 1 06JUL2015
;
Run;
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
*if intck('weekday',Date,today()) >4;
if 1<Weekday(Date)<7 and Date>=today()-4;
Run;
I think you need to reverse the > in your code, and add a qualification that you only want weekdays:
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
if intck('weekday',Date,'20JUL2015'd) le 4 and 1<weekday(Date)<7;
*if 1<Weekday(Date)<7 and Date>='20JUL2015'd-5;
Run;

Column combine two datasets of different size

I have two datasets of the following structure
ID1 Cat1
1 a
2 a
3 b
5 b
5 b
6 c
7 d
and
ID2 Cat2
11 z
12 z
13 z
14 y
15 x
I want to column-combine then and then have the unmatched rows just be missing. So ultimately I want:
ID1 Cat1 ID2 Cat2
1 a 11 z
2 a 12 z
3 b 13 z
4 b 14 y
5 b 15 x
6 c
7 d
The purpose of this is that I have two sorted datasets (by ID) and want to do a matching of the first category (Cat1) with the second (Cat2). The second category has a predefined number of "slots" and those slots should be matched on the order of the IDs. The only relationship between ID1 and ID2 is that they are ordered the same way. So the two lowest should be a match and so on.
You want a one to one merge.
The documentation is here
In order to do a one to one merge you just need to merge without a by statement
This type of merge simply matches the observations based on its row number, so be careful, it may give you unintended results if you are missing a row you thought you had or something else wasn't as you expected.
for example:
proc sort data = have1; run;
proc sort data = have2; run;
data want;
merge have1 have2;
run;

SQL Left Join logic in SAS Merge or Data step

I have below two datasets and need the third dataset as an output.
ONE TWO
---------- ----------
ID FLAG NUMB
1 N 2
2 Y 3
3 Y 9
4 N 2
5 N 3
9 Y 9
10 Y
OUTPUT
-------
ID FLAG NEW
1 N N
2 Y Y
3 Y Y
4 N N
5 N N
9 Y Y
10 Y N
If ONE.ID is found in TWO.NUMB and it's ONE.FLAG = Y then the new variable NEW = Y
else NEW = N
I was able to do this using PROC SQL as below.
proc sql;
create table output as
(
select distinct id, flag, case when numb is null then 'N' else 'Y' end as NEW
from one
left join
two
on id = numb
and flag = 'Y'
);
quit;
Could this be done in DATA step/MERGE?
since you have a sql step attempt here's an improvement on that
--this sql step does not require a merge--
proc sql noprint;
create table output as
select distinct *, case
when id in (select distinct numb from two) then "Y"
else "N"
end as new
from one
;
quit;