Below given dataset I am trying to find New Users Vs Repeated Users.
DATE ID Unique_Event
20200901 a12345 1
20200902 a12345 1
20200903 b12345 1
20200903 a12345 1
20200904 c12345 1
In the above dataset, since a12345 appeared on multiple dates, should be counted as a "repeated" user whereas b12345 only appeared once, so he is a "new" user. Please note, this is only sample data as the actual data is quite large. I tried the below code, but I am not getting the correct count. Ideally, tot_num_users-num_new_users should be repeated users, but I am getting incorrect counts. Am I missing something?
Expected Output:
Month new_users repeated_users
9 2 1
Code:
data user_events;
set user_events;
new_date=input(date,yymmdd10.);
run;
proc sql;select month(new_date) as mm,
count(distinct vv.id) as total_num_users,
count(distinct case when v.new_date = vv.minva then v.id end) as num_new_users,
(count(distinct vv.id) - count(distinct case when v.new_date = vv.minva then id end)
) as num_repeated_users
from user_events v inner join
(select t.id, min(new_date) as minva
from user_events t
group by t.id
) vv
on v.id = vv.id
group by 1
order by 1;quit;
In a sub-select, for each ID you can count the number of distinct DATE to determine the new / repeated status. The all ids aggregate computations are made from the sub-select.
proc sql;
create table freq as
select
count(*) as id_count
, sum (status='repeated') as id_repeated_count /* sum counts a logic eval state */
, sum (status='new') as id_new_count
from
( select
id
, case
when count(distinct date) > 1 then 'repeated'
else 'new'
end as status
from
user_events
group by
id
) as statuses
;
An alternative solution not using proc sql (though I'm aware you tagged this with "proc sql").
data final;
set user_events;
Month=month(new_date);
run;
proc sort data=final; by Month ID;
data final;
set final;
by Month ID;
if first.Month then do;
new_users=0;
repeated_users=0;
end;
if last.ID then do;
if first.ID then
new_users+1;
else
repeated_users+1;
end;
if last.Month then
output;
keep Month new_users repeated_users;
run;
Since you are using proc sql, this is a sql question, not a SAS question.
Try something like:
proc sql;
select ID,count(Unique_Event)
from <that table>
group by ID
order by ID
run;
Related
I have a dataset with the first 4 columns and I want to create the last column. My dataset has millions of records.
ID
Date
Code
Event of Interest
Want to Create
1
1/1/2022
101
*
201
1
1/1/2022
201
yes
201
1
1/1/2022
301
*
201
1
1/1/2022
401
*
201
2
1/5/2022
101
*
301
2
1/5/2022
201
*
301
2
1/5/2022
301
yes
301
I want to group records by ID and date. If one of the records in the grouping has a 'yes' in the event of interest variable, I want to assign that code to the entire grouping. I am using base SAS.
Any ideas?
Assuming that you will only have one yes value for each id and date, you can use a lookup table and merge them together. Here are a few ways to do it.
1. Self-merge
Simply merge the data onto itself where event = yes.
data want;
merge have
have(rename=(code = new_code
event = _event_)
where =(upcase(_event_) = 'YES')
)
;
by id date;
drop _event_;
run;
2. SQL Self-join
Same as above, but using a SQL inner join.
proc sql;
create table want as
select t1.*
, t2.code as new_code
from have as t1
INNER JOIN
have as t2
ON t1.id = t2.id
AND t1.date = t2.date
where upcase(t2.event) = 'YES'
;
quit;
3. Hash lookup table
This is more advanced but can be quite performant if you have the memory. Notice that it looks very similar to our merge statement in Option 1. We're creating a lookup table, loading it to memory, and using a hash join to pull values from that in-memory table. h.Find() will check the unique combination of (id, date) in the value read from the set statement against the hash table in memory. If a match is found, it will pull the value of new_code.
data want;
set have;
if(_N_ = 1) then do;
dcl hash h(dataset: "have(rename=(code= new_code)
where =(upcase(event) = 'YES')
)"
, hashexp:20);
h.defineKey('id', 'date');
h.defineData('new_code');
h.defineDone();
call missing(new_code);
end;
rc = h.Find();
drop rc;
run;
You could just remember the last value of CODE you want for the group by using a double DOW loop.
In the first loop copy the code value to the new variable. The second loop can re-read the observations and write them out with the extra variable filled in.
data want;
do until (last.date);
set have;
by id date ;
if 'Event of Interest'n='yes' then 'Want to Create'n=code;
end;
do until (last.date);
set have;
by id date;
output;
end;
run;
I have a data that has column A with following data
Column A
--------
1
2
?
2
I used the query:
proc sql;
select
if A= '?' then A=., count(*) as N_obs
from freq_sex_Partner
group by Number_of_sexual_partners;
quit;
This is not working. Please suggest how can i replace the ? to any standard value?
In SQL it's a CASE statement, not IF/THEN.
proc sql;
select
case when a='?' then .
else a end as a, count(*) as N_obs
from freq_sex_Partner
group by Number_of_sexual_partners;
quit;
Or you could use an IFC() function as well.
proc sql;
select
ifc(a='?', ., a) as a, count(*) as N_obs
from freq_sex_Partner
group by Number_of_sexual_partners;
quit;
Column A contains "?" so it is character valued. The #reeza code should be then "" or ifc(a='?',"", a). Also, if you do not also select the grouping variable the context of the N_obs is lost.
Suggest
data have;
input a $ nsp ;
datalines;
1 2
2 3
? 7
2 7
run;
proc sql;
select
nsp
, case when a='?' then '' else a end as a
, count(*) as nsp_count
from have
group by nsp
;
quit;
The query will also log the message NOTE: The query requires remerging summary statistics back with the original data. as Proc SQL is performing an automatic remerge of group aggregates with individual rows within the group.
I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).
I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.
Cust_Expense
------------
Customer_Number Month
111 FEB2014
987 APR2017
784 FEB2014
768 APR2017
.....
145 AUG2017
345 AUG2014
I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time.
Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.
Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
create table want_&month. as
select *
from cust_expense(where=(month="&month.")) t1
inner join cust_&month. t2
on t1.customer_number=t2.customer_number
;
%end;
quit;
%mend;
%want;
Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
create table want as
select *
from cust_expense t1
inner join (
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
%if &i>1 %then union;
select *, "&month." as month
from cust_&month
%end;
) t2
on t1.customer_number=t2.customer_number
and t1.month=t2.month
;
quit;
%mend;
%want;
In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.
Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.
If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).
Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.
data want;
set jan feb mar apr ... ;
where put(id,CUSTEXPF1.) = '1';
run;
That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).
I guess you could output your data in specific dataset like this example :
data test;
infile datalines dsd;
input ID : $2. MONTH $3. ;
datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
;
run;
data JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;
You will avoid to loop through all your ID (250000)
and you will use dataset statement from SAS
At the end you will get 12 DATASET containing the ID related.
If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :
...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...
Regards
I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.
I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;
Trying to make a more simple unique identifier from already existing identifier. Starting with just and ID column I want to make a new, more simple, id column so the final data looks like what follows. There are 1million + id's, so it isnt an option to do if thens, maybe a do statement?
ID NEWid
1234 1
3456 2
1234 1
6789 3
1234 1
A trivial data step solution not using monotonic().
proc sort data=have;
by id;
run;
data want;
set have;
by id;
if first.id then newid+1;
run;
using proc sql..
(you can probably do this without the intermediate datasets using subqueries, but sometimes monotonic doesn't act the way you'd think in a subquery)
proc sql noprint;
create table uniq_id as
select distinct id
from original
order by id
;
create table uniq_id2 as
select id, monotonic() as newid
from uniq_id
;
create table final as
select a.id, b.newid
from original_set a, uniq_id2 b
where a.id = b.id
;
quit;