insert columns that are missing in a range - sas

I've created panel data by transposing columns, based on weeks, and some of the weeks never had observations, so those weeks never showed up as columns. Is there a reasonable way to insert the weeks that had no observations.
I need week0-week61, but currently I am missing week0, week4, week8... It seems silly to do this by hand in excel.

The simplest way is like this:
data ttt;
input id week0 week4;
datalines;
1 10 20
2 11 21
;
data ttt1;
set ttt;
array a{*} week0-week61;
run;

Related

SAS: How can we replace values for bunch of data using loops?

I have been trying to replace bunch of data to different format.
For instance, the variable called Week starts from a value 1124 to a value 1175. I want to change this value starting from 1.
That is,
Week Week
1124 1
1125 2
If this was R, I would be using for-loop and store them back to week to replace them, but I am not sure how to formulate something similar in SAS.The only method I got was:
if Week = 1124 then Week = 1;
and so forth.
run;
This is very inefficient as I have to write 30+ times. Are there any efficient method to tackle this issue? In other words, is there something similar to for-loops?
SAS DATA step is an implicit loop -- every row in the data set is processed until there are no more rows. Use simple arithmetic to transform the value of the week variable.
data want;
set have;
week = week - 1123;
run;

SAS is the number lower than the highest number in the column so far

I am using sas and I have a column in my table that consists of various numbers. I want to go down the column and select the number if it is smaller than the highest number so far. I posted a picture of an example of what I am looking for. I also have a column with the year, that I didn't post in the picture if that matters. I am guessing I will need some sort of loop.n is the original column and output is what I would like my loop to do.
example:
n - current column
28
22
30
40
39
55
110
89
98
160
155
157
250
output - desired output
22
39
89
98
155
157
I attempted this in proc sql because I am new to sas and know much more about sql. As I was attempting proc sql I realized I am not going to be able to do in proc sql.
Here is what I tried in proc sql.
I can post more things I have tried as I attempt more loops. As of now my loops are too far off.
proc sql;
select a.*
from homework a
full join homework b on a.make = b.make
and a.model = b.model
where a.[Initial Model Year] < b.[Initial Model Year]
and a.MPH < b.MPH;
quit;
Why always use SQL? SAS has a lot of facilities that are often more suited for the job dan SQL. Junior in SAS tend to use the only thing they know from school: SQL, and neglect all the rest.
By definition, SQL is not suited for this job! SQL does not even guarantee the order of rows is prevailed, let alone that you can use the order of the input rows in your logic. (Yes, there are SQL dialects that can do this, but not standard SQL)
Use a data step. That reads in your data row by row, in the order they occur.
Avoid writing loops explicitely whenever you can. The data step implicitely loops over it's input.
By default, the data step writes one row for each row read. You can remove a row from the output with a delete statement. You can also write explicit output statements. Then only the rows for which you do execute output wil be in the output. (output is also used if you want more than one row in the output per row in the input.)
However, by default, row by row means if forgets the previous row and all that is related to it. So you need to explicitly retain some information.
Attention, by default SAS keeps all intermediate results of calculations. If you don't want that, you need either an explicit keep statement, or a drop.
Example sollution:
data MY_SELECTION;
set MY_INPUT;
retain largest 0; * largest is initialized to 0 for the first row only *;
if largest < number then largest = number;
else if number < largest then output;
drop largest;
run;
Final remark: By default, SQL writes a report and the data step creates a new data set. If you want SQL to behave as the data step, preceed your query with create table MY_SELECTION as. If you want the data step to behave as SQL, insert proc print; before the run;

Compare datasets and plot in sas

I have two variables in 2 separate datasets in sas. Both have a primary key of Customer_Id and another column say LVR . One dataset has old values for the LVR Column. The other one has values from the new calculation for the same column.
I need to show the differences between both on a graph.
I tried to merge them and then tried proc gplot to plot the two LVRs.
Merged dataset looks something like this :
Cust_id LVR_new LVR_old
111 1 2
222 2 .
333 5 4
The dataset containing LVR_new is almost twice in size (number of rows) than the one containing LVR_old.We got more customers qualifying post the new calculations.
The merged dataset has 3046778 observations and 3 variables.
I tried to use proc gplot using the code below:
proc plot data=djia;
plot LVR_old*LVR_new = Cust_id;
run;
This has been running since long so i don't expect the results are going to be very useful.
Can anyone please suggest how can I achieve this. I need to showcase the differences between the two datasets on a graph to be able to show the shift in the results.
Thanks!
Why not use PROC TTEST? There are some ODS GRAPHICS plots that PROC TTEST makes.
Your problem looks exactly liked the paired comparisons example in the documentation.
http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_ttest_examples03.htm
proc ttest;
paired LVR_old*LVR_new;
run;

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).

SAS: backward looking data step to compute the average

Sorry for the "not really informative" title of this post.
I have the following data set in SAS:
time Add time_delete
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted.
Any help?
You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.
First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.
data have;
input time Add time_delete;
datalines;
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
;;;;
run;
data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
output;
end;
keep time add;
run;
proc means data=want mean max n;
class time;
var add;
run;
You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.
The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.
You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.
proc sql;
select H.time, mean(V.add), max(V.add) from (
select distinct H.time from have H
left join
(select * from have) V
on V.time le H.time
and V.time_delete gt H.time )
group by 1;
;
quit;
Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.
Other options:
Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.
Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.
Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.