Setting op table in SAS via proc tabulate - sas

I have some data about students and their dropout procent. I have information abaout which education they started on, in a city (some educations are found in more cities) and the year they started their education. I also have information about wheter there were a quotient the studetns had to meet to be able to start their education.
The quotient variable can consist of numeric values and character values (see the table)
I want to make a table in SAS where I have the quotient and the dropout % like in the below picure:
So for each education and for each city I have the years out as rows and in the cells I have the quota for that year and the dropout % for the year.
I can not do it in SAS. I have tried:
proc tabulate data= sammensat missing;
var dropout;
class education year city quota ;
Table education* city,year *dropout all/ rts=180;
run;
This gives me part of the output I want. But I want another row showing the quota for each combination of education and city for each year.

Two problems: including the quota, and dealing with the char values.
Including quota is easy, if it's numeric.
proc tabulate data= sammensat missing;
class education year city;
var dropout quota;
Table education*city*(quota dropout),year all/ rts=180;
run;
You might need to add in statistics to those if they're not both the same (and both N); probably *mean for both, not sure exactly what your data looks like.
To deal with the character problem, you need to create a format that has either special values for the quota, if they're just assigned values (this city-year-education combination has no quota by definition), or uses values that show there is no quota (missing, 0, etc.).
proc format;
value quotaf
-1='NO QUOTA'
-9='PASSED AUDITON'
0-high=[3.1]
;
quit;
Then use that to format quota, either in the dataset or with a f= option on quota in the proc tabulate.

Related

SAS How to sum a variable in duplicate records

Noob SAS user here.
I have a hospital data set with patientID and a variable that counts the days between admission and discharge.
Those patients who had more than one hospital admission show up with the same patientID and with a record of how many days they were in hospital each time.
I want to sum the total days in hospital per patient, and then only have one patientID record with the sum of all hospital days across all stays. Does anyone know how I would go about this?
You want to select distinct the sum of days_in_hospital and group by patientID This will get what you want:
proc sql;
create table want as
select distinct
patientID,
sum(days_in_hospital) as sum_of_days
from have
group by patientID;
quit;
Alternatively you can use proc summary.
proc summary data= hospital_data nway;
class patientID;
var days;
output out=summarized_data (drop = _type_ _freq_) sum=;
run;
This creates a new dataset called summarized_data which has the summed days for each patientID. (The nway option removes the overall summary row, and the drop statement removes extra default summary columns you don't need.)

How can I pull certain records and update data in SAS?

I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1

SAS: Multiple patient diagnoses on multiple lines

I have a dataset of patient diagnoses with one diagnosis code per line, resulting in patient diagnoses on multiple lines. Each patient has a unique patientID. I also have age, race, gender, etc. data on these patients.
How do I indicate to SAS when using PROC FREQ, Logistic, Univariate, etc. that they are the same patient?
This is an example of what the data looks like:
patientID diagnosis age gender lab
1 15.02 65 M positive
1 250.2 65 M positive
2 348.2 23 M negative
2 282.1 23 M negative
3 50 F positive
I was given data on every patient who has had a certain lab (regardless of positive result), as well as all of their diagnoses, which each appear on a different line (as a different observation to SAS). First, I will need to exclude every patient who has a negative result for the lab, which I plan on using an IF statement for. The lab determines if the patient has disease X. Some patients do not have any additional diseases, other than disease X, such as patient #3.
Analyses I would like to perform:
Calculate the frequency of each disease using PROC FREQ.
Characterize the age and race relationships for each diagnosis using PROC FREQ chi square.
PROC Logistic to determine risk factors (age, race, gender, etc.)for developing an additional disease on top of disease X.
Thanks!
The answer to your question is you cannot by default. But when you're processing the data you can account for it easily. IMO keeping it long is easier.
You've asked too many questions above so I'll answer just one, how to count the number of people with disease x.
Proc sort data = have out = unique_disease_patient nodupkey;
By patientID Diag;
Run;
Proc freq data = unique_disease_patient noprint;
Table disease / out = disease_patient_count;
Run;
Note that this is much easier in SQL
Proc sql;
Create table want as
Select diag, count(distinct patientID)
From have
Group by diag;
Quit;
I'm assuming this is homework because you're unlikely to do this in practice except for exploratory analysis.

Contingency table in SAS

I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;

edit similar values in sas

I have a dataset of transactional data per week. (quantity, price, week, etc.)
However in the dataset i have two prices for the same week.
eg two observations for week 28 (one at price 5.03 and one at price 5.20)
what i want to do is calculate the weighted average price depending on the quantity and sum the quantity for the two different obs so that i have only one obs for week 28.
this happens frequently so i would like to be able to do this quickly without editing manually all prices and quantities.
Oh and this is in SAS btw!
Thanks!
PROC SUMMARY with the WEIGHT statement applied against price will calculate this for you.
proc summary data=have nway;
class week;
var quantity;
var price / weight=quantity;
output out=want (drop=_:) sum(quantity)= mean(price)=;
run;