I am trying to explore if we can create user defined formats using proc sql instead of proc format? Is it achievable. Can we edit formats catalog using proc sql? I tried to query it, but i could not. Does anyone know if that's achievable ?
Thanks!
No, I think proc format is your only option for creating formats, and for editing catalogue files you can only use proc catalog.
I think User user667489 is correct... however there is a work-around, which might be useful.
Your can create classifying variables via application of case-logic:
proc sql outobs=30;
select case
when 10 <= vehicle_type or vehicle_type < 20 then 'passenger'
when 20 < vehicle_type or vehicle_type < 30 then 'lcv'
end as type
from begin;
quit;
See more from Documentation
Edit:
Lets see if I can work out better example. For sake of argument, I define that we're interested in cheap spotrscars/SUVS and want to create categorical variable to detect any combination of these.
proc sql outobs=30;
create table results as
select type
,case when type in('SUV' 'Sports') then 1
else 0
end as wanted_type
,MSRP
,case when MSRP < 30000 then 'cheap'
when MSRP > 45000 then 'Expensive'
else 'inconclusive'
end as price_range
from sashelp.cars;
quit;
At this point we have the original data and new grouping variables wanted-type and price range like below. Of course, you don't have to select the original value, but in this example it's to show the result and original.
Type wanted_type MSRP price_range
SUV 1 $36,945 inconclusive
Sedan 0 $23,820 cheap
Sedan 0 $26,990 cheap
Sedan 0 $33,195 inconclusive
Sedan 0 $43,755 inconclusive
Sedan 0 $46,100 Expensive
Sports 1 $89,765 Expensive
Sedan 0 $25,940 cheap
Related
I want to use SAS and eg. proc report to produce a custom table within my workflow.
Why: Prior, I used proc export (dbms=excel) and did some very basic stats by hand and copied pasted to an excel sheet to complete the report. Recently, I've started to use ODS excel to print all the relevant data to excel sheets but since ODS excel would always overwrite the whole excel workbook (and hence also the handcrafted stats) I now want to streamline the process.
The task itself is actually very straightforward. We have some information about IDs, age, and registration, so something like this:
data test;
input ID $ AGE CENTER $;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
The goal is to produce a table report which should look like this structure-wise:
ID NO-ID Total
Count 3 2 5
Age (mean) 27 45.5 34.4
Count by Center:
A 2 1 3
B 0 1 1
A 1 0 1
It seems, proc report only takes variables as columns but not a subsetted data set (ID NE .; ID =''). Of course I could just produce three reports with three subsetted data sets and print them all separately but I hope there is a way to put this in one table.
Is proc report the right tool for this and if so how should I proceed? Or is it better to use proc tabulate or proc template or...?
I found a way to achieve an almost match to what I wanted. First if all, I had to introduce a new variable vID (valid ID, 0 not valid, 1 valid) in the data set, like so:
data test;
input ID $ AGE CENTER $;
if ID = '' then vID = 0;
else vID = 1;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
After this I was able to use proc tabulate as suggested by #Reeza in the comments to build a table which pretty much resembles what I initially aimed for:
proc tabulate data = test;
class vID Center;
var age;
keylabel N = 'Count';
table N age*mean Center*N, vID ALL;
run;
Still, I wonder if there is a way without introducing the new variable at all and just use the SAS counters for missing and non-missing observations.
UPDATE:
#Reeza pointed out to use the proc format to assign a value to missing/non-missing ID data. In combination with the missing option (prints missing values) in proc tabulate this delivers the output without introducing a new variable:
proc format;
value $ id_fmt
' ' = 'No-ID'
other = 'ID'
;
run;
proc tabulate data = test missing;
format ID $id_fmt.;
class ID Center;
var age;
keylabel N = 'Count';
table N age*(mean median) Center*N, (ID=' ') ALL;
run;
I would like to see all the data from "one" dataset. If join between tables not exist overwrite the value 0. The current code gives me values only where there is a connection. This table I need:
data one;
input lastname: $15. typeofcar: $15. mileage;
datalines;
Jones Toyota 3000
Smith Toyota 13001
Jones2 Ford 3433
Smith2 Toyota 15032
Shepherd Nissan 4300
Shepherd2 Honda 5582
Williams Ford 10532
;
data two;
input startrange endrange typeofservice & $35.;
datalines;
3000 5000 oil change
5001 6000 overdue oil change
6001 8000 oil change and tire rotation
8001 9000 overdue oil change
9001 11000 oil change
11001 12000 overdue oil change
12001 14000 oil change and tire rotation
15032 14999 overdue oil change
13001 15999 15000 mile check
;
data combine;
do until (mileage<15000);
set one;
do i=1 to nobs;
set two point=i nobs=nobs;
if startrange = mileage then
output;
end;
end;
run;
proc print;
run;
Description of the code from the SAS support site:
Read the first observation from the SAS data set outside the DO loop. Assign the FOUND variable to 0. Start the DO loop reading observations from the SAS data set inside the DO loop. Process the IF condition; if the IF condition is true, OUTPUT the observation and set the FOUND variable to 1. Assigning the FOUND variable to 1 will cause the DO loop to stop processing because of the UNTIL (FOUND) that is coded on the DO loop. Go back to the top of the DATA step and read the next observation from the data set outside the DO loop and process through the DATA step again until all observations from the data set outside the DO loop have been read.
You could do that with a LEFT JOIN in a proc sql
taking all variables from one
then making 2 conditions to fill startrange and endrange with 0 when missing.
proc sql noprint;
create table want as
select t1.*
, case when t2.startrange=. then 0 else t2.startrange end as startrange
, case when t2.endrange=. then 0 else t2.endrange end as endrange
, t2.typeofservice
from one t1 left join two t2
on (t1.mileage = t2.startrange)
;run;quit;
Or do it in 2 steps (I personally find the if of the data step cleaner than the case when of the proc sql.)
proc sql noprint;
create table want as select *
from one t1 left join two t2 on (t1.mileage = t2.startrange)
;run;quit;
data want; set want;
if startrange=. then do; startrange=0; endrange=0; end;
run;
I can't use proc sql because I need Vlookup inside loop UNTIL. I need another solution.
Data step is not the best way to code this. It is much easier to code fuzzy matches using SQL code.
Not sure why you need to have zeros instead of missing values, but coalesce() should make it easy to provide them.
proc sql ;
create table combine as
select a.*
, coalesce(b.startrange,0) as startrange
, coalesce(b.endrange,0) as endrange
, b.typeofservice
from one a left join two b
on a.mileage between b.startrange and b.endrange
;
quit;
I have a dataset that contains the ID and a variable called CC. The CC holds multiple numbered values where each value represents something. It looks like this:
An ID can have the same CC in multiple rows, I just want to flag if the CC exists or not so even if Joe had five rows stating that he has CC equal to 3 I just want a 1 or 0 stating if Joe ever had a CC equal to 3.
I want it to look like this:
I tried coding it as shown below but the issue is that although I know an ID can have more than one type of CC the final dataset that's created from the code only shows 1 CC for each ID that is filled. I think maybe it's overwriting it?
Also I should note that prior to this code I created the CC Flag variables and filled it all as zeros.
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select
ID,
COUNT(ID) as count_ID,
case when CC=1 then 1 end as CC_1,
case when CC=2 then 1 end as CC_2,
case when CC=3 then 1 end as CC_3
from Original_Dataset
group by ID;
quit;
Any help is appreciated, thank you.
Is your issue the fact that after running your new code you still get multiple line per ID?
If so I propose this:
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select ID
,case when CC_1 >0 then 1 else 0 end as CC_1
,case when CC_2 >0 then 1 else 0 end as CC_2
,case when CC_3 >0 then 1 else 0 end as CC_3
from (
select
ID,
COUNT(ID) as count_ID,
sum(case when CC=1 then 1 end) as CC_1,
sum(case when CC=2 then 1 end) as CC_2,
sum(case when CC=3 then 1 end) as CC_3
from Original_Dataset
group by ID
);
quit;
The reason you are having the issue is that you are only aggregating the count of ID and not the other values, using an aggregate on them will eliminate duplicate records.
Hope this helps
If you're looking for a report here's one method, using PROC TABULATE.
proc format ;
value indicator_fmt
low - 0, . = 0
0 - high = 1;
run;
proc tabulate data=have;
class id cc;
table id , cc*N=''*f=indicator_fmt.;
run;
Your output will look like this then:
If you want a fully dynamic approach in a table where you don't need to know anything ahead of time, such as the number of CC's this is a different approach. It's a bit longer but the dynamic part makes it possibly worthwhile to implement.
I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1
I'm very new to SAS, trying to learn everything I need for my analytical task. The task I have now is to create a flag for the ongoing application. I think it might be easier to show it in a table, just to illustrate my problem:enter image description here
[Update 2017.10.27] data sample in code, big thanks to Richard :)
data sample;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
In the system, people apply for the service. When a person does that, he gets a PeopleID, which does not change when the person applies again. And also each application gets an applicationID, which is unique and later applications have larger applicationID. What I want is to create an Ongoing flag for each application. The propose is to show that: by the time this application came in, the same person has or does not have an ongoing application (application which has not received a decision). See some examples from the table above:
Person#2 has two applications #8 and #7, by the time he applied #8, #7 has not been decided, therefore #8 should get ongoing flag.
Person#1 applied multiple times. Application #3 and #2 have ongoing application due to App#1. Application #6 and #5 came in at the same date, but according to application ID, we can tell that #6 came in later than #5, and as #5 have not been decided by then, #6 gets ongoing flag.
As you might notice, application with a positive ongoing flag always receives decisions on the same date as it came in. That is because applications with ongoing cases are automatically declined. However, I cannot use this as an indicator: there are many other reasons that trigger an automatic decline.
The ongoing_flag is what I want to create in my dataset. I have tried to sort by 1.peopleID, 2.descending applicationID, 3. descending applied_date, so my entire dataset looks like the small example table above. But then I don't know how to make SAS compare within the same variable (peopleID) but different lines (applicationID) and columns (compare Applied_date with Decision_date). I want to compare, for each person, every application's applied_date with all the previous applications' decision_date, such that I can tell by the time this application came in, whether or not there is an ongoing application from previously in the system.
I know I used too many words to explain my problem. For those who read through, thank you for reading! For those who have any idea on what might be a good approach, please leave your comments! Millions of thanks!
Min:
For problems of this type you want to mentally break the data structure into different parts.
BY GROUP
The variables whose unique combination defines the group. There are one or more rows in a group. Let's call them items.
GROUP DETAILS
Variables that are observational in nature. They may be numbers such as temperature, weight or dollars, or, characters or strings that represent some state being tracked. The details (at the state you are working) themselves might be aggregates for a deeper level of detail.
GOAL
Compute additional variables that further elucidate an aspect of the details over the group. For numeric the goal might be statistical such as MIN, MAX, MEAN, MEDIAN, RANGE, etc. Or it might be identificational such as which ID had
highest $, or which name was longest, or any other business rule.
Your specific problem is one of determining claim activity on a given date. I think of it as a coverage type of problem because the dates in question cover a range. The BY GROUP is person and an 'Activity' date.
Here is one data-centric approach. The original data is expanded to have one row per date from applied to decided. Then simple BY group processing and the automatic first. are used to determine if an application is during one as yet undecided.
data have;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
data coverage;
do _n_ = 1 by 1 until (last.PeopleID);
set have;
by PeopleID;
if Decision_date > Max_date then Max_date = Decision_date;
end;
put 'NOTE: ' PeopleID= Max_date= yymmdd10.;
do _n_ = 1 to _n_;
set have;
do Activity_date = Applied_date to ifn(missing(Decision_date),Max_date,Decision_date);
if missing(Decision_date) then Decision_date = Max_date;
output;
end;
end;
keep PeopleID ApplicationID Applied_date Decision_date Activity_date;
format Activity_date yymmdd10.;
run;
proc sort data=coverage;
by PeopleID Activity_date ApplicationID ;
run;
data overlap;
set coverage;
by PeopleID Activity_date;
Ongoing_flag = not (first.Activity_date);
if Activity_date = Applied_date then
output;
run;
proc sort data=overlap;
by PeopleID descending ApplicationID ;
run;
Other approaches could involve arrays, hashes, or SQL. SQL is very different from DATA Step code and some consider it to be more clear.
proc sql;
create table want as
select
PeopleID, ApplicationID, Applied_date, Decision_date
, case
when exists (
select * from have as inner
where inner.PeopleID = outer.PeopleID
and inner.ApplicationID < outer.ApplicationID
and
case
when inner.Decision_date is null and outer.Decision_date is null then 1
when inner.Decision_date is null then 1
when outer.Decision_date is null then 0
else outer.Decision_date < inner.Decision_date
end
)
then 1
else 0
end as Ongoing_flag
from have as outer
;