New SAS variable conditional on observations - sas

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).

You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).

Related

How to accumulate the records of a newly imported table with the records of another table that I have stored on the servers in SAS?

I am new to SAS and I have the following problem:
When trying to join records I just imported (in one table) with records I have stored in another table.
What happens is that I am going to run the code in SAS daily, and I need the table that I am going to create today (17/05/2021) by importing a file 'X', to join the table that I created yesterday (16/05/2021) by importing a file 'Y'.
And so the code will be executed tomorrow, the next day and so on.
In conclusion the records will accumulate as the days go by.
To tackle this problem, I am first creating two variables, one with the date of the day the code will be executed and the other with the date of the last execution.
%let daily_date = 20210423; /*AAAAMMDD*/
%let last_execution_date = 20210422; /*AAAAMMDD*/
Then the import of a file is done, we can see that the name of this created table has the date of the day in which the code is being executed.
data InputAC.RA_ratings&daily_date;
infile "&ruta_InputRA." FIRSTOBS=2
dsd lrecl=4096 truncover;
input
#1 RA_Customer_ID $10.
#11 Rating_ID 10.
#21 ISRM_Model_Overlay_ID $10.
#31 Constant_ID 10.
#41 Value $100.
;
run;
proc sort data=inputac.RA_ratings&daily_date;
by RA_Customer_ID Rating_ID;
quit;
Finally the union of InputAC.RA_ratings&daily_date with InputAC.RA_ratings&last_execution_date is made. ('InputAC.RA_ratings&last_execution_date' should be the table that was imported at an earlier date than today.)
data InputAC.RA_ratings&fec_diario;
merge
InputAC.RA_ratings&fec_diario
InputAC.RA_ratings&ultima_fecha_de_ejecucion;
by RA_Customer_ID Rating_ID;
run;
This is how the tables are being stored on the server.
(Ignore date 20210413, let's imagine it is 20210422)
However, I have to perform this task without using the variable 'last_execution_date'.
I've been researching but I still can't find any SAS function that can help me with this problem.
I hope someone can help me, thank you very much in advance.
This is a pretty complex and interesting question from an operations point of view. The answer depends on a few things.
How much control do you have over the execution of this process?
Is "yesterday" guaranteed, or does the process need to work if "last execution date" is not yesterday?
What should happen if the process is run twice today?
The best practices way to solve this is to have a dataset (or table) that stores the last execution date. That allows you to handle #2 trivially, and the answer to #3 might guide exactly how you store this but is easily handled anyway.
Say for example you have a table, MetaAC.LastExecDate (or, in spanish, MetaAC.UltimaFecha or similar). It could store things this way:
data LastExecDate;
timestamp = datetime();
execdate = input(&daily_date,yymmdd8.);
run;
proc append base=MetaAC.LastExecDate data=LastExecDate;
run;
This lets you store an arbitrary execdate even if it's not today, and also store when you ran it (for audit purposes), and you could even add who ran it if that's interesting (there is a macro variable &sysuserid or similar). Then put all this at the bottom of your process, and it updates as you go.
Then, you can pull out from this the exact info you want - for example,
proc sql;
select max(execdate)
into :last_exec_date
from MetaAC.LastExecDate
where execdate ne today()
;
quit;
Now, if you don't have control over this for some reason, you could determine this in a different way. Again, the exact process depends on your circumstances and your answers to 2 and 3.
If your answer to 2 is you always want it to be yesterday, then this is really easy - just do this:
%let daily_date=20210517;
%let last_execution_date = %sysfunc(putn(%sysevalf(%sysfunc(inputn(&daily_date,yymmdd8.))-1),yymmddn8.));
%put &=last_execution_date;
The two %sysfuncs just do the input/put from SAS datastep inside the macro language, and %sysevalf lets you do math.
If you don't want it to always be the prior day (if there are weekends, or other days you don't necessarily want to assume it's the prior day), then your best bet is to either use the dictionary tables to look at what's there and find the largest date prior to your date, or maybe use a x command to look at the folder and do the same thing (might be easier to use OS command than to use SQL for this, sometimes SQL dictionary tables can be slow).

Percent split with where condition in SAS

I am new to SAS and data analytics in general. So sorry if my question sound too dumb.
I have a dataset of brand medicine with three variables. Variable 1 contains the drug name, variable two contains whether that drug is BRANDED, Generic or Brand-Generic and variable 3 contains the total sale of that drug.
What I want is percent split the BRANDED, GENERIC AND BRANDED GENERIC drugs among total drug sale. The final output should look like
Branded : 35%
Generic : 25%
Branded-Generic : 40%
Any help with a sas code which would do that is greatly appreciated thank you.
So you want a % sale split! You can try using SQL (proc sql) to get your desired answer.
proc sql;
create table want as
select drug_type, sum(total_sale) as tot_sale
from have
group by drug_type;
create table want as
select *, tot_sale/sum(tot_sale) as percent_sale format=percent10.2
from want;
quit;
I created a table 'want' that will have total sale for each drug type. Using that table, I created a column that has the calculated sale percentage and formatted it to a percent (for easy view).
Of course, there are other ways of doing it, like using proc summary or proc freq or even a data step. But as a beginner, I guess starting out with SQL would be a good decision.

Missing values in a FREQ (SAS)

I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

SAS drop cloned observation

I have a dataset structured with an ID and two other variables.
The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement.
Thanks.
proc sql;
select distinct * from olddata;
quit;
Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.
If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.
data want;
set have;
by id var1 var2 notsorted;
if first.var2;
run;
That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.
I prefer #JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:
proc sort data=mydata nodupe;
by id;
run;
What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.