I'm trying to sum some columns based on several other columns, and then produce a new table with the results in.
Say I have the following data:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
3
1
AAAA
BBBB
CCCC
DDDD
5
1
WWWW
XXXX
YYYY
ZZZZ
1
4
WWWW
XXXX
YYYY
ZZZZ
8
2
And I want to sum Col5 and Col6 (separately) where Col 1-4 are the same. i.e. the output I want is:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
I've put my code below, but its giving me the following:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
WWWW
XXXX
YYYY
ZZZZ
9
6
Any help would be greatly appreciated to:
a) get this to code work.
b) show me a better (more efficient?) way of doing this? I think I've massively(!) overcomplicated this (I'm very new to SAS!).
--- Code ---
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
data test1;
set XXX;
groupID = put(md5(upcase(catx('|',Col1,Col2,Col3,Col4))),hex32.);
run;
proc sort data = test1;
by groupID;
run;
proc summary data = test1;
var Col5 Col6;
by groupID;
Output out = want sum=;
run;
proc sql;
create table test1_results as
select b.Col1,b.Col2,b.Col3,b.Col4, a.*
from want as a
left join test1 as b
on a.groupID = b.groupID;
run;
data Final_table;
set test1_results;
Keep Col1 Col2 Col3 Col4 Col5 Col6;
run;
I think you need Proc SUMMARY. The remaining steps are unnecessary.
Key concept - BY or CLASS statements take multiple variables.
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
proc summary data=xxx NWAY noprint;
class col1 col2 col3 col4;
var Col5 Col6;
Output out=want (drop=_type_ _freq_) sum=;
run;
proc print data=want;run;
Related
ATTACHED SCREENSHOT OF DESIRED OUTPUTthe required condition is
"SUBJECT in A = SUBJECT in B
and
VISIT in A NE(not equal to) VISIT in B"
I would like to find the exact mismatch and missing VISIT from the below Tables A and B by using Proc SQL procedure, Can anyone help me please?
Table A
SUBJECT Test VISIT
1001 ABCB 1
1001 ABCD 2
1001 ABCD 3
1001 ABCD 5
Table B
SUBJECT Test VISIT1
1001 ABCD 2
1001 ABCD 1
1001 ABCD 4
Expected output:
SUBJECT Test VISIT VISIT1
1001 ABCD 3
1001 ABCD 5
1001 ABCD 4
VISIT 3 AND 5 IS PRESENT IN DATASET A NOT IN B AND VISIT 4 IS PRESENT IN DATASET2 NOT IN DATASET A , LIKE WISE
CODE FOR DATASET-
DATA A;
LENGTH SUBJECT 8 Test $10 visit 8;
INPUT SUBJECT Test $ visit ;
DATALINES;
1001 ABCD 1
1001 ABCD 2
1001 ABCD 3
1001 ABCD 5
;
RUN;
DATA B;
LENGTH SUBJECT 8 Test $10 visit1 8;
INPUT SUBJECT Test $ visit1 ;
DATALINES;
1001 ABCD 2
1001 ABCD 1
1001 ABCD 4
;
RUN;
Thanks in advance!
the code i tried is below (but not working as expected)-
****************(VISIT ) in A and not in B****;
proc sql;
create table SS1 as
select distinct a.* FROM
A a where a.visit not in(select s.visit1 from B s WHERE A.SUBJECT = S.SUBJECT );
create table INRAVE as
select * from SS1 A
left join
B B
on a.subject=b.SUBJECT and a.VISIT NE b.VISIT1
where b.SUBJECT is not null
;
quit;
****************VISIT in B and not in A****;
proc sql;
create table SS2 as
select distinct a.* from
B a where a.VISIT1 not in(select S.VISIT from A s WHERE A.SUBJECT = S.SUBJECT );
create table INVENDOR as
select * from SS2 A
left join
A B
on a.subject=b.SUBJECT and a.VISIT1 NE b.VISIT
where b.SUBJECT is not null
;
quit;
data ALL;;
set inrave invendor;
where subject=subject ;
RUN;
Seems you know SQL very well, why not try union all, just like this:
proc sql noprint;
create table C as
select *, 'A' as Source from A
where catx('#',SUBJECT,Test,visit) not in (
select distinct catx('#',SUBJECT,Test,visit1) from B
)
union all corr
select *, 'B' as Source from B(rename=VISIT1=VISIT)
where catx('#',SUBJECT,Test,visit) not in (
select distinct catx('#',SUBJECT,Test,visit) from A
)
;
create table D(drop=TmpVISIT Source) as
select *,
case when Source = 'B' then . else TmpVISIT end as VISIT,
case when Source = 'B' then TmpVISIT else . end as VISIT1
from C(rename=VISIT=TmpVISIT);
quit;
I get all obs from dataset A where not repeat in dataset B and do the oppsite with dataset B.
Well, I also get another solution, which is shorter:
proc sql noprint;
select catx('#',SUBJECT,Test,visit) into :Ununique separated by '" "' from (
select * from A union all select * from B(rename=visit1=visit)
)
group by SUBJECT, Test, visit
having count(*) > 1;
quit;
data D;
set A B;
if catx('#',SUBJECT,Test,coalesce(visit1,visit)) in ("&Ununique") then delete;
run;
Whereas, this method is limited by the max lenth of macro variable.
I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;
The SAS proc sql allows user to do a count(distinct colname) , based on some group by dimension(s). What is the fastest way to achieve the same feature for SUM(distinct colname)?
data: have
grp1 grp2 col1 col2
a b 20 .
a b 30 10
a b 20 10
a b . 10
data want:
grp1 grp2 col1_sum col2_sum
a b 50(20+30) 10
So basically, for the dimension (a,b), I need a sum of the distinct values in col1 and col2.
sum(distinct col) as mentioned in your question should work:
data have;
input grp1 $1. grp2 $3 col1 col2;
datalines;
a b 20 .
a b 30 10
a b 20 10
a b . 10
;run;
proc sql;
select
grp1, grp2,
sum(distinct col1) as s1,
sum(distinct col2) as s2,
from have
group by grp1, grp2;
run;
... should yield results:
grp1 grp2 s1 s2
---- ---- ---- ----
a b 50 10
I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!
First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ
I have a df with
col0 col1 col2 col3
a 1 2 text1
b 1 2 text2
c 1 3 text3
and i another text file with
col0 col1 col2
met1 a text1
met2 b text2
met3 c text3
how do i match row values from col3 in my first df to the text file col2 and add to previous df only col0 string with out changing the structure of the df
desired output:
col0 col1 col2 col3 col4
a 1 2 text1 met1
b 1 2 text2 met2
c 1 3 text3 met3
You can use pandas.dataframe.merge(). E.g.:
df.merge(df2.loc[:, ['col0', 'col2']], left_on='col3', right_on='col2')
print(df)
col0 col1 col2 col3
0 a 1 2 text1
1 b 1 2 text2
2 c 1 3 text3
print(df2)
col0 col1 col2
0 met1 a text1
1 met2 b text2
2 met3 c text3
Merge df and df2
df3 = df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))
Housekeeping... renaming columns etc...
df3 = df3.rename(columns={'col0_1':'col4'}).drop(['col1_1','col2_1'], axis=1)
print(df3)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
And, reassign to df if you wish.
df = df3
OR
df = df.assign(col4=df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))['col0_1'])
print(df)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
Call your df df1. Then first load the text file into a dataframe using df2 = pd.read_csv('filename.txt'). Now, you want to rename the columns in df2 so that the column on which you want to merge has the same name in both columns:
df2.columns = ['new_col1', 'new_col2', 'col3']
Then:
pd.merge(df1, df2, on='col3')