I am taking a scripting class and I have no idea what I'm doing!
For my assignment, I am supposed to print min/max/mean/std for each year. The .csv file I was given to use has a year column with the years as
1949.083
1949.167
1949.25
1949.333
1949.417
1949.5
1949.583
1949.667
1949.75
1949.833
1949.917
1950
1950.083
1950.167
and so on, all the way to 1960.
Assuming I am using PROC MEANS, is there a way to maybe combine the years so I can print a single set of calculations (min/max/mean/std) for each year? As in one set of calculations for the year 1949 (data values from 1949-1949.917), another one for 1950 (data values from 1950-1950.917), etc. Not sure if I'm making sense! I've been looking everywhere for hours and I can't figure it out! :(
If you want PROC MEANS to calculate separate statistics per year you can use a CLASS statement. With a CLASS statement it will define the groups based on the formatted value. So if you just use the format 4. with the variable YEAR then each value will be mapped to a simple 4 digit value.
proc means data=have min max mean std ;
class year;
format year 4.;
var analysis_var ;
run;
But that will round values like 1,949.667 to 1950 and not 1949. If you want to ignore the fractional part of the year you can use the INT() function. So first create a new variable and then use that new variable in the CLASS statement.
data step1;
set have;
yrnum = int(year);
run;
proc means data=step1 min max mean std ;
class yrnum ;
var analysis_var ;
run;
Related
I am using the INTNX function to calculate month intervals. I'm finding that the results are frequently one day off from what I would expect... For example, look at this code:
data test;
olddate='20140531';
oldsasdate=input(olddate,yymmdd8.);
newsasdate=intnx('month',oldsasdate,-17);
newdate=put(newsasdate,yymmdd8.);
run;
In this code, I try to find the date 17 months before 05/31/2014. I would expect the function to return 11/30/2012, but it instead returns 12/1/2012. Any idea what's going on here? Is there a way to fix this?
The default for intnx is to align with the start of the month. It basically tracks interval boundaries, so each time it goes from MM/01/YY to MM/30/YY it ticks one interval crossed.
So,
data _null_;
x = intnx('month','31MAY2014'd, -1);
put x= date9.;
run;
Returns '01APR14'd, not '30APR14'd.
You can change it to 'same' alignment with the optional 4th parameter (SAS 9.2+ I believe).
data _null_;
x = intnx('month','31MAY2014'd, -1,'s');
put x= date9.;
run;
I have a question regarding moving average. I use Proc Expand (cmovave 3), but those three days can be non consecutive I suppose. I want to avoid missing data between days and use moving average for just those adjacent days.
Is there any way that I can do this? If I want to put it in another way 'how can I select a part of my data set where I have values for consecutive period (days)?'. I hope you give me some examples for this problem.
Use Expand to make sure you have all the values in the timeseries interval. Then use a data step to calculate the ma3 with the lagN() functions.
If you data already has the correct timeseries interval, then skip the PROC EXPAND step.
data test;
start = "01JAN2013"d;
format date date9.
value best.;
do i=1 to 365;
r = ranuni(1);
value = rannor(1);
date = intnx('weekday',start,i);
dummy=1;
if r > .33 then output;
end;
drop i start r;
run;
proc expand data=test out=test2 to=weekday ;
id date;
var dummy;
run;
data test(drop=dummy);
merge test2 test;
by date;
ma3 = (value + lag(value) + lag2(value))/3;
run;
I use the DUMMY variable so that EXPAND will convert the series to WEEKDAY. Then drop it afterwards.
I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;
I have this dataset here which looks like this:
Basically I want to manipulate the data set so that I have
GVKEY1 as unique such as 1004 then a unique year number such as 1996 then several gvkey2 after that. However the number of gvkey2 for each year is not the same. Does anyone know how to get around this problem? This means I will have several 12 lines of data for gvkey1 for 1004 since i have years from 1996 to 2008. Then for each year I will have many columns where each column will have a gvkey2.
Best Regards,
Naz
Can you not just use PROC TRANSPOSE?
proc sort data=your_data_set out=temp1;
by gvkey1 year;
run;
proc transpose data=temp1 out=temp2;
by gvkey1 year;
var gvkey2;
run;
This will give you a series of variables COL1 - COLx. Use the PREFIX option for different variable names.
I'm not sure I've understood your question, but if you're looking for unique gvkey1/year pairs, you could do either of these:
proc sql;
create table results as
select distinct gvkey1, year
from _your_data_set;
quit;
or
proc sort data=_your_data_set(keep=gvkey1 year) out=results nodupkey;
by gvkey1 year;
run;
If that's not what you're looking for, I suggest posting an example of the results you want.