I have data on questions which students answered. The format is such
Student Q1 Q2 Q3 Q4
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
For this example, lets say 1 is the correct answer for question 1, 2 is the correct answer for question 2,3 and 4.
How would I generate a statistic table that would tell me how many questions a student answered correctly? In the example above, it would say something like
Student Answered Correct:
A 2/4
You can create an array of the correct answers, then just loop through the student answers to compare them.
I've created the final variable as character to display in the format you've shown. Obviously this means you won't have access to the underlying value, so you may want to keep the number of correct answers in the data for other analysis purposes.
data have;
input Student $ Q1 Q2 Q3 Q4;
datalines;
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
;
run;
data want;
set have;
array correct{4} (1 2 3 4); /* create array of correct answers */
array answer{4} q1-q4; /* create array of student answers */
_count=0; /* reset count to 0 */
do i = 1 to dim(correct);
if answer{i} = correct{i} then _count+1; /* compare student answer to correct answer and increment count by 1 if they match */
end;
length answered_correct $8; /* set length for variable */
answered_correct = catx('/',_count,dim(correct)); /* display result in required format */
drop q: correct: i _count; /* drop unwanted variables */
run;
First you have to create variable num_questions and set it to the number of questions. Then you need to write as many if-then-else statements as questions to create binary variables (flags) to check if each answer is correct (e.g. Correct_Q1). Use sum(of Correct:) to get the total of correct answers for each student. Correct: references all variable names starting with 'Correct'.
data want;
set have;
num_questions = 4;
if Q1 = 1 then Correct_Q1 = 1; else Correct_Q1 = 0;
if Q2 = 2 then Correct_Q2 = 1; else Correct_Q2 = 0;
if Q3 = 2 then Correct_Q3 = 1; else Correct_Q3 = 0;
if Q4 = 2 then Correct_Q4 = 1; else Correct_Q4 = 0;
format Answered_Correct $3. Answered_Correct_pct percent.;
Answered_Correct = compress(put(sum(of Correct:),$8.)||'/'||put(num_questions, 8.));
Answered_Correct_pct = sum(of Correct:) / num_questions;
label Student = 'Student' Answered_Correct = 'Answered correct' Answered_Correct_pct = 'Answered correct (%)';
keep Student Answered_Correct Answered_Correct_pct;
run;
proc print data=want noobs label;
run;
If you only have just four questions the fastest solution would probably be to just use conditional statements:if Q1 = 1 then answer + 1;
For a more general solution using a lookup/answer table:
Transpose the data, merge the answer table, summarize on student.
data broad_data;
infile datalines missover;
input Student $ Q1 Q2 Q3 Q4;
datalines;
A 1 3 2 3
B 2 3 2 2
C 1 2 1 2
D 3 3 1 2
;
data answers;
infile datalines missover;
input question $ correct_answer ;
datalines;
Q1 1
Q2 2
Q3 2
Q4 2
;
data long_data;
set broad_data;
length question $10 answer 8;
array long[*] Q1--Q4;
do i = 1 to dim(long);
question = vname(long[i]);
answer = long[i];
output;
end;
keep Student question answer;
run;
proc sort data = long_data; by question student; run;
data long_data_answers;
merge long_data
answers
;
by question;
run;
proc sort data = long_data_answers; by student; run;
data result;
do i = 1 by 1 until (last.student);
set long_data_answers;
by student;
count = sum(count, answer eq correct_answer);
end;
result = count/i;
keep student result;
format result fract8.;
run;
If you like sql/want to compress your code you can combine the last two datasteps + sorts into one statement.
proc sql;
create table result as
select student, sum(answer eq correct_answer)/count(*) as result format fract8.
from long_data a
inner join answers b
on a.question eq b.question
group by student
;
quit;
Related
I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;
I am pretty new is SAS and now i am struggling with division table by table. Tables have the same size.
My goal is to divide each element of table_1 by the corresponding element of table_2, producing a new table.
Google advises to use SAS/IML but i have no access to it.
Is there any option to do that in data step? Another ideas?
For instance first table looks like:
30 30
30 30
Second table:
2 3
5 6
Then output table should be:
15 10
6 5
Thank you a lot in advice!
One way to do would be merge the two tables together and perform the division.
data dividend;
a = 30; b = 30; output;
a = 30; b = 30; output;
run;
data divisor;
c = 2; d = 3; output;
c = 5 ; d = 6 ; output;
run;
data comb;
merge dividend divisor;
q1 = a/c;
q2 = b/d;
keep q1 q2;
run;
This assumes that there a 1 - to - 1 correspondence between the dividend and divisor rows
EDITED TO RESPOND TO QUESTION IN COMMENTS
Assuming you have years 2015 to 2010, each with 4 quarters, you could write a macro loop:
%macro divide;
%let years = %str(2015 2016 2017 2018 2019);
%let qtrs = %str(Q1 Q2 Q3 Q4);
data comb;
merge dividend divisor;
%let i = 1;
%do %while (%scan(&years, &i) ne );
%let year = %scan(&years, &i);
%let j = 1;
%do %while (%scan(&qtrs, &j) ne );
%let q = %scan(&qtrs, &j);
R_&q._&year = &q._&year._D / &q._&year._A;
%let j = %eval(&j + 1);
%end;
%let i = %eval(&i + 1);
%end;
keep R_:;
run;
%mend;
%divide;
You probably should store your data in a different way to make this problem easier (and most problems easier). Instead of storing a "matrix" where the row number and column number have some hidden meaning you could create a dataset where that meaning is stored into variables. So instead of:
30 30
30 30
You could have:
Row Col Value
1 1 30
1 2 30
2 1 30
2 2 30
Just use meaningful names instead of ROW,COL and VALUE. Like COUNTRY, YEAR, POPULATION.
Then your division problem becomes both simpler and clearer. Combine the two datasets by the id variables and divide the variables to make a new variable.
data death_rate ;
merge deaths population;
by country year ;
death_rate = deaths / population;
run;
If you do want to work with matrices then look into PROC IML. Here is example from documentation.
proc iml;
a = {1 2,
3 4};
b = {5 6,
7 8};
c = a/b;
I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;
how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;
How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !