Not getting the results from the suggestions trying to subset data - sas

DATA proj4.gasQTR;
SET proj4.gasQTR;
INPUT Q1 Q2 Q3 Q4;
IF MONTH = 1 or 2 or 3 THEN Q1 = 1;
ELSE IF MONTH = 4 or 5 or 6 THEN Q2 = 2;
ELSE IF MONTH = 7 or 8 or 9 THEN Q3 = 3;
ELSE IF MONTH = 10 or 11 or 12 THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;
I am trying to get a 1-4 value for each qtr of each year, my error comes from Quarter qtrw. 'ERROR 388-185 Expecting an arithmetic operator'
*Data is already in 1-4 format for the month variable
What am I doing wrong?
Any help would be appreciated!
Thank you!

You normally do not use both a SET statement to retrieve data from an existing dataset and an INPUT statement to read values from a text file in the same data step. And if you do want to INPUT values from a text file you must tell SAS where to find the text by including either an INFILE statement or add the text in-line with the code by using a DATALINES (or CARDS) statement.
SAS will consider any number that is not zero or missing as TRUE. So the condition 2 or 3 or 4 is always TRUE. So Q1 will always be set to 1 and Q2, Q3 and Q4 will always be missing (or if they existed already unchanged). If you want to test if a variables has any of a number of values use the IN operator instead of the equality operator. month in (1 2 3 4)
You also should not be reading and writing the same dataset. If there are logic issues in your coding you might destroy the original dataset. So hopefully you have backup copy of proj4.gasQTR, or a program that can recreate it.
What is the format QTRW ? Is that something you created? Show its definition.
Assuming you have a variable named MONTH with integer values in the range 1 to 12 you can calculate QUARTER with integer values in the range 1 to 4 with a simple arithmetic function instead of coding a series of IF conditions.
data want;
set have;
quarter = ceil(month/3) ;
run;
If you actually have a DATE variable then perhaps all you were supposed to do was use the MONTH or QTR format to display the dates as the month number or quarter number that they fall into.
Try this program to see the impact of applying different formats to the same values.
data test;
do month=1 to 12;
date1=mdy(month,1,2022);
date2=date1;
date3=date1;
output;
end;
format date1 date9. date2 month. date3 qtr.;
run;
proc print;
run;

Use the in operator or repeat the equality for every case.
Example from the doc:
You can use the IN operator with character strings to determine whether a variable's value is among a list of character values. The following statements produce the same results:
if state in ('NY','NJ','PA') then region+1;
if state='NY' or state='NJ' or state='PA' then region+1;
Therefore
DATA proj4.gasQTR;
SET proj4.gasQTR;
IF MONTH = 1 or MONTH = 2 or MONTH = 3 THEN Q1 = 1;
ELSE IF MONTH = 4 or MONTH = 5 or MONTH = 6 THEN Q2 = 2;
ELSE IF MONTH = 7 or MONTH = 8 or MONTH = 9 THEN Q3 = 3;
ELSE IF MONTH = 10 or MONTH = 11 or MONTH = 12 THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;
is equivalent to
DATA proj4.gasQTR;
SET proj4.gasQTR;
IF MONTH in (1,2,3) THEN Q1 = 1;
ELSE IF MONTH in (4,5,6) THEN Q2 = 2;
ELSE IF MONTH in (7,8,9) THEN Q3 = 3;
ELSE IF MONTH in (10,11,12) THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;

Related

SAS-How to count the number of observation over the 10 years prior to certain month

I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;

How can I select the first and last week of each month in SAS?

I have monthly data with several observations per day. I have day, month and year variables. How can I retain data from only the first and the last 5 days of each month? I have only weekdays in my data so the first and last five days of the month changes from month to month, ie for Jan 2008 the first five days can be 2nd, 3rd, 4th, 7th and 8th of the month.
Below is an example of the data file. I wasn't sure how to share this so I just copied some lines below. This is from Jan 2, 2008.
Would a variation of first.variable and last.variable work? How can I retain observations from the first 5 days and last 5 days of each month?
Thanks.
1 AA 500 B 36.9800 NH 2 1 2008 9:10:21
2 AA 500 S 36.4500 NN 2 1 2008 9:30:41
3 AA 100 B 36.4700 NH 2 1 2008 9:30:43
4 AA 100 B 36.4700 NH 2 1 2008 9:30:48
5 AA 50 S 36.4500 NN 2 1 2008 9:30:49
If you want to examine the data and determine the minimum 5 and maximum 5 values then you can use PROC SUMMARY. You could then merge the result back with the data to select the records.
So if your data has variables YEAR, MONTH and DAY you can make a new data set that has the top and bottom five days per month using simple steps.
proc sort data=HAVE (keep=year month day) nodupkey
out=ALLDAYS;
by year month day;
run;
proc summary data=ALLDAYS nway;
class year month;
output out=MIDDLE
idgroup(min(day) out[5](day)=min_day)
idgroup(max(day) out[5](day)=max_day)
/ autoname ;
run;
proc transpose data=MIDDLE out=DAYS (rename=(col1=day));
by year month;
var min_day: max_day: ;
run;
proc sql ;
create table WANT as
select a.*
from HAVE a
inner join DAYS b
on a.year=b.year and a.month=b.month and a.day = b.day
;
quit;
/****
get some dates to play with
****/
data dates(keep=i thisdate);
offset = input('01Jan2015',DATE9.);
do i=1 to 100;
thisdate = offset + round(599*ranuni(1)+1); *** within 600 days from offset;
output;
end;
format thisdate date9.;
run;
/****
BTW: intnx('month',thisdate,1)-1 = first day of next month. Deduct 1 to get the last day
of the current month.
intnx('month',thisdate,0,"BEGINNING") = first day of the current month
****/
proc sql;
create table first5_last5 AS
SELECT
*
FROM
dates /* replace with name of your data set */
WHERE
/* replace all occurences of 'thisdate' with name of your date variable */
( intnx('month',thisdate,1)-5 <= thisdate <= intnx('month',thisdate,1)-1 )
OR
( intnx('month',thisdate,0,"BEGINNING") <= thisdate <= intnx('month',thisdate,0,"BEGINNING")+4 )
ORDER BY
thisdate;
quit;
Create some data with the desired structure;
Data inData (drop=_:); * froget all variables starting with an underscore*;
format date yymmdd10. time time8.;
_instant = datetime();
do _i = 1 to 1E5;
date = datepart(_instant);
time = timepart(_instant);
yy = year(date);
mm = month(date);
dd = day(date);
*just some more random data*;
letter = byte(rank('a') +floor(rand('uniform', 0, 26)));
*select week days*;
if weekday(date) in (2,3,4,5,6) then output;
_instant = _instant + 1E5*rand('exponential');
end;
run;
Count the days per month;
proc sql;
create view dayCounts as
select yy, mm, count(distinct dd) as _countInMonth
from inData
group by yy, mm;
quit;
Select the days;
data first_5(drop=_:) last_5(drop=_:);
merge inData dayCounts;
by yy mm;
_newDay = dif(date) ne 0;
retain _nrInMonth;
if first.mm then _nrInMonth = 1;
else if _newDay then _nrInMonth + 1;
if _nrInMonth le 5 then output first_5;
if _nrInMonth gt _countInMonth - 5 then output last_5;
run;
Use the INTNX() function. You can use INTNX('month',...) to find the beginning and ending days of the month and then use INTNX('weekday',...) to find the first 5 week days and last five week days.
You can convert your month, day, year values into a date using the MDY() function. Let's assume that you do that and create a variable called TODAY. Then to test if it is within the first 5 weekdays of last 5 weekdays of the month you could do something like this:
first5 = intnx('weekday',intnx('month',today,0,'B'),0) <= today
<= intnx('weekday',intnx('month',today,0,'B'),4) ;
last5 = intnx('weekday',intnx('month',today,0,'E'),-4) <= today
<= intnx('weekday',intnx('month',today,0,'E'),0) ;
Note that those ranges will include the week-ends, but it shouldn't matter if your data doesn't have those dates.
But you might have issues if your data skips holidays.

Calculating number of correct of multiple choice questions

I have data on questions which students answered. The format is such
Student Q1 Q2 Q3 Q4
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
For this example, lets say 1 is the correct answer for question 1, 2 is the correct answer for question 2,3 and 4.
How would I generate a statistic table that would tell me how many questions a student answered correctly? In the example above, it would say something like
Student Answered Correct:
A 2/4
You can create an array of the correct answers, then just loop through the student answers to compare them.
I've created the final variable as character to display in the format you've shown. Obviously this means you won't have access to the underlying value, so you may want to keep the number of correct answers in the data for other analysis purposes.
data have;
input Student $ Q1 Q2 Q3 Q4;
datalines;
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
;
run;
data want;
set have;
array correct{4} (1 2 3 4); /* create array of correct answers */
array answer{4} q1-q4; /* create array of student answers */
_count=0; /* reset count to 0 */
do i = 1 to dim(correct);
if answer{i} = correct{i} then _count+1; /* compare student answer to correct answer and increment count by 1 if they match */
end;
length answered_correct $8; /* set length for variable */
answered_correct = catx('/',_count,dim(correct)); /* display result in required format */
drop q: correct: i _count; /* drop unwanted variables */
run;
First you have to create variable num_questions and set it to the number of questions. Then you need to write as many if-then-else statements as questions to create binary variables (flags) to check if each answer is correct (e.g. Correct_Q1). Use sum(of Correct:) to get the total of correct answers for each student. Correct: references all variable names starting with 'Correct'.
data want;
set have;
num_questions = 4;
if Q1 = 1 then Correct_Q1 = 1; else Correct_Q1 = 0;
if Q2 = 2 then Correct_Q2 = 1; else Correct_Q2 = 0;
if Q3 = 2 then Correct_Q3 = 1; else Correct_Q3 = 0;
if Q4 = 2 then Correct_Q4 = 1; else Correct_Q4 = 0;
format Answered_Correct $3. Answered_Correct_pct percent.;
Answered_Correct = compress(put(sum(of Correct:),$8.)||'/'||put(num_questions, 8.));
Answered_Correct_pct = sum(of Correct:) / num_questions;
label Student = 'Student' Answered_Correct = 'Answered correct' Answered_Correct_pct = 'Answered correct (%)';
keep Student Answered_Correct Answered_Correct_pct;
run;
proc print data=want noobs label;
run;
If you only have just four questions the fastest solution would probably be to just use conditional statements:if Q1 = 1 then answer + 1;
For a more general solution using a lookup/answer table:
Transpose the data, merge the answer table, summarize on student.
data broad_data;
infile datalines missover;
input Student $ Q1 Q2 Q3 Q4;
datalines;
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
;
data answers;
infile datalines missover;
input question $ correct_answer ;
datalines;
Q1 1
Q2 2
Q3 2
Q4 2
;
data long_data;
set broad_data;
length question $10 answer 8;
array long[*] Q1--Q4;
do i = 1 to dim(long);
question = vname(long[i]);
answer = long[i];
output;
end;
keep Student question answer;
run;
proc sort data = long_data; by question student; run;
data long_data_answers;
merge long_data
answers
;
by question;
run;
proc sort data = long_data_answers; by student; run;
data result;
do i = 1 by 1 until (last.student);
set long_data_answers;
by student;
count = sum(count, answer eq correct_answer);
end;
result = count/i;
keep student result;
format result fract8.;
run;
If you like sql/want to compress your code you can combine the last two datasteps + sorts into one statement.
proc sql;
create table result as
select student, sum(answer eq correct_answer)/count(*) as result format fract8.
from long_data a
inner join answers b
on a.question eq b.question
group by student
;
quit;

How to sum value goup by 4 Quarter data into one value in SAS

I have a data set that contains quarterly data value. But now I want to sum the quarterly values which have the same year.
Data h :
time value
01JAN90 23
01APR90 31
01JUL90 13
01OCT90 45
01JAN91 11
01APR91 4
01JUL91 1
01OCT91 17
I want my result data like this:
time value
1990 53
1991 35
If your time variable is numeric, you can use a FORMAT statement within PROC SUMMARY to automatically extract the year as the PROC runs. (Thanks to #Joe for showing this in comments to my original answer.)
PROC SUMMARY NWAY DATA = h;
CLASS time;
FORMAT time YEAR. ;
OUTPUT
OUT = result (
KEEP = year value
)
SUM (value) =
;
RUN;

SAS - aggregating daily content with same day occurrences

I have daily data, not completely consecutive (i.e., not all days are present in a week) and I need to convert it to weekly totals. The catch is that the data pertains to transactions such that there are multiple observations with the same day. Using the following PROC EXPAND procedure results in an error "The value of the ID variable, FixtureDate=04JAN2011, at observation number 2 in data set RAW.VLCC2011 is the same as the previous observation":
PROC EXPAND DATA = raw.VLCC2011 OUT = raw.VLCC2011_wkly FROM= Day TO = Week;
convert FixtureCargoSize/ OBSERVED=TOTAL method=aggregate;
ID FixtureDate;
run;
Here's a solution with proc sql. You could also do something similar with a data step.
proc sql;
create table VLCC2011_wkly as
select intnx('week', date, 0, 'end') as week, sum(FixtureCargoSize) as FixtureCargoSizeTotal
from VLCC2011
group by calculated week;
quit;
The intnx function takes a date and moves it to some other date. In this case, it takes any date and moves it to the last day of the week. Summing over all dates that have the same end-of-week date in this way will give you want you want.
I've not used PROC EXPAND. However the error message tells you that it doesn't like that there are multiple observations per ID value. Maybe you need to pre-process the input data set RAW.VLCC2011 such that there is at most one observation per FIXTUREDATE.
Here is how I solved it (the long way):
data raw.VLCC2011_wkly;
set raw.VLCC2011;
IF FixtureDay < 8 then FixtureWeek = 1;
IF FixtureDay > 7 and FixtureDay < 15 then FixtureWeek = 2;
IF FixtureDay > 14 and FixtureDay < 23 then FixtureWeek = 3;
IF FixtureDay > 22 and FixtureDay < 30 then FixtureWeek = 4;
IF FixtureDay > 29 and FixtureDay < 32 then FixtureWeek = 5;
run;
proc sql;
create table raw.VLCC2011_wkly1 as
select FixtureMonth, FixtureDay, FixtureWeek, FixtureCargoSize, sum(FixtureCargoSize) as CargoSizeTotal
from raw.VLCC2011_wkly
group by FixtureMonth, FixtureWeek
Order by FixtureMonth, FixtureWeek, FixtureDay;
quit;