Proc sql: new and continue customers based on look back period - sas

I have following data:
wei 01feb2018 car
wei 02feb2018 car
wei 02mar2019 bike
carlin 01feb2018 car
carlin 05feb2018 bike
carlin 07mar2018 bike
carlin 01mar2019 car
I want to identify new and continue customers, if a customer had no purchase in last 12 months then it will become a new customer. Required output be like
wei 01feb2018 car new
wei 02feb2018 car cont.
wei 02mar2019 bike new
carlin 01feb2018 car new
carlin 05feb2018 bike cont.
carlin 07mar2018 bike cont.
carlin 01mar2019 car new
Now if a customer has purchased any item in the same month for ex -customer a purchased car on 01jan and bike on 15jan then I want two classify customer a as new for Jan for one report and in another report I want customer a as both new and continue.
I'm trying but not getting the logic -
proc sql;
select a.*,(select count(name) from t where intnx("month",-12,a.date) >= 356)
as tot
from t a;
Quit;

You appear to want two different 'status' variables, one for continuity over prior year and one for continuity within month.
In SQL an existential reflexive correlated sub-query result can be a case test for rows meeting the over and within criteria. Date arithmetic is used to compute days apart and INTCK is used to compute months apart:
data have; input
customer $ date& date9. item& $; format date date9.; datalines;
wei 01feb2018 car
wei 02feb2018 car
wei 02mar2019 bike
carlin 01feb2018 car
carlin 05feb2018 bike
carlin 07mar2018 bike
carlin 01mar2019 car
run;
proc sql;
create table want as
select *,
case
when exists
(
select * from have as inner
where inner.customer=outer.customer
and (outer.date - inner.date) between 1 and 365
)
then 'cont.'
else 'new'
end as status_year,
case
when exists
(
select * from have as inner
where inner.customer=outer.customer
and outer.date > inner.date
and intck ('month', outer.date, inner.date) = 0
)
then 'cont.'
else 'new'
end as status_month
from have as outer
;
quit;

You can use retain:
proc sort data=test out=test2;
by name type date;
run;
data test2 ;
set test2;
retain retain 'new';
by name type date;
if first.type then retain='new';
else retain='con';
run;
proc sort data=test2 out=test2;
by name date;
run;
Output:
+--------+-----------+------+--------+
| name | date | type | retain |
+--------+-----------+------+--------+
| carlin | 01FEB2018 | car | new |
| carlin | 05FEB2018 | bike | new |
| carlin | 01MAR2019 | car | con |
| wei | 01FEB2018 | car | new |
| wei | 02FEB2018 | car | con |
| wei | 02MAR2019 | bike | new |
+--------+-----------+------+--------+

Related

Group rows in PROC TABULATE

I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |
Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";

How to plot a simple lineplot in SAS

My data are structured as follows (these are just sample data as the original data are secret)
id | crime | location | crimedate
------------------------------
1 | Theft | public | 2019-01-04
1 | Theft | public | 2019-02-06
1 | Theft | public | 2019-02-20
1 | Theft | private | 2019-03-10
1 | Theft | private | 2019-03-21
1 | Theft | public | 2019-03-01
1 | Theft | private | 2019-03-14
1 | Theft | public | 2019-06-15
1 | Murder | private | 2019-01-04
1 | Murder | private | 2019-10-20
1 | Murder | private | 2019-11-18
1 | Murder | private | 2019-01-01
1 | Assault | private | 2019-03-19
1 | Assault | private | 2019-01-21
1 | Assault | public | 2019-04-11
1 | Assault | public | 2019-01-10
… | … | … | …
My goal is to create a lineplot (time series plot) showing how the numbers of the three crimes have changed over the year. Therefore on the x-axis I would like to show the monthes (1-12) and on the y-axis the number of crimes in each month. There should be two lines (one for each location).
I started with this code:
DATA new;
SET old;
month=month(datepart(crimedate));
RUN;
PROC sgplot DATA=new;
series x=month y=no_of_crimes / group=location;
run;
But I have no idea, how I can aggregate the number of crimes per month. Could anyone please give me a hint? I have been looking in the internet for a solution, but usually the examples just use data that are already aggregated.
The SG routines will aggregate Y axis values for a VBAR or HBAR statement. The same aggregate information displayed in a SERIES statement would have to be from a apriori aggregate computation, easily done with Proc SUMMARY.
Additionally, to plot the counts for each crime in a separate visual, you would want a BY CRIME statement, or Proc SGPANEL with PANELBY crime.
The crime datetime value does not have to be down transformed to a date value, you can use the appropriate datetime format in the procedures and they will auto-aggregate based on the formatted value.
Example with some simulated crime data:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
* shorter graphs for SO answer;
ods graphics / height=300px;
proc sgplot data=have;
title "VBAR all crimes combined by location";
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc sgpanel data=have;
title "VBAR crime * location";
panelby crime;
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc summary data=have noprint;
class crime_dt crime location;
format crime_dt dtmonyy7.;
output out=freqs;
run;
proc sgplot data=freqs;
title "SERIES all crimes,summary _FREQ_ * location";
where _type_ = 5;
series x=crime_dt y=_freq_ / group=location;
xaxis type=discrete;
run;
proc sgpanel data=freqs;
title "SERIES all crimes,summary _FREQ_ * crime * location";
where _type_ = 7;
panelby crime;
series x=crime_dt y=_freq_ / group=location;
rowaxis min=0;
colaxis type=discrete;
run;
If you want to group by location without definition by type of crime:
proc sql noprint;
create table new as
select id,location
, month(crimedate) as month,count(crime) as crime_n
from old
group by id,location,CALCULATED month;
quit;
proc sgplot data=new;
series x=month y=crime_n /group=location;
run;
The result:
To show different series by type of crime you could use sgpanel:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby location;
series x=month y=crime_n /group=crime;
run;
The result is:
One more variant of perfoming this data:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby crime;
series x=month y=crime_n /group=location GROUPDISPLAY=cluster;
run;
The result is:
Of course, you can specify this plots how you want.
To perhaps answer the question more directly, the VLINE or HLINE plots will summarize the data for you, similar to running a proc freq and then proc sgplot with series.
Using Richard's test data, you'll see this is exactly identical to the plot his PROC FREQ -> SERIES gives:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
proc sgplot data=have;
vline crime_dt/group=location groupdisplay=cluster;
format crime_dt dtmonyy7.;
run;

jump to next by group when condition is met

I have a file documenting changes in marital status - ID, type of change (marriage, divorce, being widowed) and year (and month) of change. I want to calculate each person's marital status (married, divorced, widow(er), never been married) for any given year. Since a person can go through many changes and my file is around 20 million rows I'd like to skip to the next person when I find the answer and not continue through all of that person's other records.
I thought to sort by ID and descending date of change and then set by ID. For each ID, if the year I'm interested in is greater than (or equal to) the year of change then calculate marital status and output the ID and marital status. If not, continue to the next record until the condition is met. If no record meets the condition then marital status=never been married.
data a;
length type_change $10;
input ID type_change yr_change mnth_change;
cards;
1 marriage 2006 9
1 divorce 2010 5
10 marriage 2005 2
10 divorce 2012 10
10 marriage 2016 8
23 marriage 2017 6
35 marriage 2002 7
35 widow 2013 12
;
run;
For 2015 I'd like to get:
- ID marital_status
- 1 divorced
- 10 divorced
- 23 never been married
- 35 widowed
Thanks in advance!
/* do this sort only once and save sorted */
proc sort data = have out = sorted;
by id yr_change;
run;
proc sort data = have (keep =id) out = ids nodupkey;
by id;
run;
data step1;
set sorted;
where yr_change <= &y;
by id;
if last.id;
run;
data want;
merge step1 (in =a) ids (in =b);
by id;
if b and not a then status = "never married";
else status = type_change;
run;
If by skip you mean not reading them then you cannot "skip" observations. But you can ignore them by using IF statement (or other conditional logic).
Using RETAIN and BY group processing should get you answer.
%let year=2015;
data want ;
set a ;
by id yr_change mnth_change ;
length status $20;
retain status ;
if first.id then status='never been married ';
if yr_change <= &year then status=type_change ;
if last.id;
keep id status;
run;
Result:
Obs ID status
1 1 divorce
2 10 divorce
3 23 never been married
4 35 widow
If you have access to a master list of ID's you could convert to using a WHERE statement which MIGHT reduce the I/O needed to process all of the records. For example merge the list of ID's with a subset of the marital status change records.
data want;
merge id_list a(in=in2 where=(yr_change <= &year));
by id;
length status $20;
retain status ;
if first.id then status='never been married ';
if in2 then status=type_change ;
if last.id;
keep id status;
run;
A DOW loop will let you compute a result over a group. An implicit output will save the result computed for the group. Because the result is dependent on your year of interest, you will want to track that also in any created data sets.
%let YEAR_CUTOFF = 2015;
data want (keep=id status year_cutoff);
attrib
id length = 8
status length=$20 label="Status at year end &YEAR_CUTOFF"
year_cutoff length = 8
;
retain year_cutoff &YEAR_CUTOFF;
status = 'never been married';
do until (last.ID); /* The DOW loop */
set have (rename=status=status_of_interest);
by id;
if year <= &YEAR_CUTOFF then status = status_of_interest;
end;
/* No explicit OUTPUT in the step, so,
* an implicit OUTPUT occurs here at the bottom of the step
*/
run;
Then use retain statement.
Extract all IDs:
proc sort data=a out=ids(keep= id) nodupkey ;
by id;
run;
Generate all years that you need to all IDs
data years;
set ids;
must_be_date=2000;
do i = 1 to 20;
must_be_date+1;
output;
end;
drop i;
run;
Join by condition:
proc sql;
create table res as
select *
from years left join a on years.must_be_date = a.yr_change and a.id = years.id
;
run;
proc sort ;
by id must_be_date;
run;
Use retain:
data res;
retain temp "never been married";
set res;
by id must_be_date;
if first.id then temp="never been married";
if type_change="" then type_change = temp;
else temp=type_change;
run;
to check :
data res_2015;
set res;
where must_be_date=2015;
run;
Result table:
+--------------------+----+--------------+-------------+-----------+-------------+
| temp | ID | must_be_date | type_change | yr_change | mnth_change |
+--------------------+----+--------------+-------------+-----------+-------------+
| divorce | 1 | 2015 | divorce | . | . |
| divorce | 10 | 2015 | divorce | . | . |
| never been married | 23 | 2015 | never been | . | . |
| widow | 35 | 2015 | widow | . | . |
+--------------------+----+--------------+-------------+-----------+-------------+

Proc REPORT move group value (row header) closer to totals

I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;

Updating value in one row to another row based on same column value

My dataset XXX comprises records where 2 rows form a pair based on same value of FRUIT column . The difference is that one row contains empty COUNTRY value field while second row contains actual COUNTRY value. Similarly that first row contains empty COLOUR field while second row contains actual COLOUR value. now I would like to populate the COLOUR value of row (source) where COUNTRY value is populated, to the first row's empty COLOUR field (destination) where COUNTRY field is empty.
XXX DATASET [current]
FRUIT | COUNTRY | COLOUR
Banana | . | .
Banana | Spain | Yellow
Apple | . | .
Apple | USA | Red
Pear | China | Green
Pear | . | .
YYY [DESIRED]
FRUIT | COUNTRY | COLOUR
Banana | . | Yellow
Banana | Spain | Yellow
Apple | . | Red
Apple | USA | Red
Pear | China | Green
Pear | . | Green
Of course this example is dumb, but it is valid business case.
Apologizes I could not attach code here as I am in a bus now frantically typing. I tried using first. and last. , But somehow the variable cannot be passed across rows.
Can you advise in this?
Here's one way of doing this, using retain to carry over values from previous rows. The trick is to retain a temporary column rather than the one you want to fill in:
data have;
input FRUIT $ COUNTRY $ COLOUR $;
infile cards dlm='|';
cards;
Banana | . | .
Banana | Spain | Yellow
Apple | . | .
Apple | USA | Red
Pear | China | Green
Pear | . | .
;
run;
/*Sort missing values of COLOUR to the bottom within each FRUIT*/
proc sort data = have out = temp;
by FRUIT descending COLOUR;
run;
data want;
set temp;
by FRUIT;
retain t_COLOUR 'placeholder';
if first.FRUIT then do;
t_COLOUR = .;
if not(missing(COUNTRY)) then t_COLOUR = COLOUR;
end;
else COLOUR = coalescec(COLOUR, t_COLOUR);
drop t_COLOUR;
run;
Try this out:
proc sort data=have;
by fruit country;
run;
data want( rename=(country1=country colour1=colour));
set have end=eof;
by fruit notsorted;
if first.fruit then do;
point = _N_ + 1;
set have (keep= country colour rename= (country = country1 colour = colour1)) point=point;
end;
else do;
country1=country;
colour1 = colour;
end;
drop country colour;
run;
So you want to apply the non-missing values of COLOUR to every record with the same value of FRUIT? Sounds like a simple MERGE problem.
data YYY;
merge XXX(drop=colour) XXX(keep=fruit colour where=(not missing(colour)));
by fruit;
run;