Please find the dataset below:
ID amt order_type no_of_order
1 200 em 6
1 300 on 5
2 600 em 10
Output desired:
ID amt order_type no_of_order
1 500 on 11
2 600 em 10
based on the highest amount i need to pick the order_type.
How can this be achieved in sas code
Sounds like you want to get the sum of the two numeric variables for each value of ID and also select one value for ORDER_TYPE. You appear to want to take the value of ORDER_TYPE which had the largest AMT. Here is simply way using PROC SUMMARY.
data have;
input ID amt order_type $ no_of_order;
cards;
1 200 em 6
1 300 on 5
2 600 em 10
;
proc summary data=have ;
by id;
var amt no_of_order;
output out=want sum= idgroup(max(amt) out[1] (order_type)=);
run;
Results:
no_of_ order_
Obs ID _TYPE_ _FREQ_ amt order type
1 1 0 2 500 11 on
2 2 0 1 600 10 em
Related
I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);
This is my code:
DATA sales;
INFILE 'D:\Users\...\Desktop\Onions.dat';
INPUT VisitingTeam $ 1-20 ConcessionSales 21-24 BleacherSales 25-28
OurHits 29-31 TheirHits 32-34 OurRuns 35-37 TheirRuns 38-40;
PROC PRINT DATA = sales;
TITLE 'SAS Data Set Sales';
RUN;
This is the data, but the spacing may be incorrect.
Columbia Peaches 35 67 1 10 2 1
Plains Peanuts 210 . 2 5 0 2
Gilroy Garlics 151035 12 11 7 6
Sacramento Tomatoes 124 85 15 4 9 1
;
I need to add or delete a blank column at the 19th
column. Can someone help?
Just open the dataset and then look at what the variable name is. Then do:
Data Want (drop=varible_name_you_are_dropping); /*This is your output dataset*/
Set have; /*this is your dataset you have*/
Run;
Sorry I'm new to a lot of the features of SAS - I've only been using for a couple months, mostly for survey data analysis but now I'm working with a dataset which has individual level data for a cross-over study. It's in the form: ID treatment period measure1 measure2 ....
What I want to do is be able to group these individuals by their treatment group and then output a variable with a group average for measure 1 and measure 2 and another variable with the count of observations in each group.
ie
ID trt per m1 m2
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
what I want is the data in the form:
group a = where trt=1 & per=1
group b = where trt=2 & per=2
group c = where trt=2 & per=1
group d = where trt=1 & per=2
trtgrp avg_m1 avg_m2 n
A 102 76 2
B ... ... ...
C
D
Thank you for the help.
/Creating Sample dataset/
data test;
infile datalines dlm=" ";
input ID : 8.
trt : 8.
per : 8.
m1 : 8.
m2 : 8.;
put ID=;
datalines;
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
;
run;
/Using proc summary to summarize trt and per/
Variables(dimensions) on which you want to summarize would go into class
Variables(measures) for which you want to have average would go into var
Since you want to have produce average so you will have to write mean as the desired statistics.
Read more about proc summary here
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473735.htm
and here
http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/m41/m41_19.htm
proc summary data=test nway;
class trt per;
var m1 m2;
output out=final(drop= _type_)
mean=;
run;
The alternative method uses PROC SQL, the advantage being that it makes use of plain-English syntax, so the concept of a group in your question is maintained in the syntax:
PROC SQL;
CREATE TABLE final AS
SELECT
trt,
per,
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY trt, per;
QUIT;
You can even add your own group headings by applying conditional CASE logic as you did in your question:
PROC SQL;
CREATE TABLE final AS
SELECT
CASE
WHEN trt=1 AND per=1 THEN 'A'
WHEN trt=2 AND per=2 THEN 'B'
WHEN trt=2 AND per=1 THEN 'C'
WHEN trt=1 AND per=2 THEN 'D'
END AS group
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY group;
QUIT;
COUNT(*) simply counts the number of rows found within the group. The AVG function calculates the average for the given column.
In each example, you can replace the explicitly named columns in the GROUP BY clause with a number representing column position in the SELECT clause.
GROUP BY 1,2
However, take care with this method, as adding columns to the SELECT clause later can cause problems.
My data set is in this format as mentioned below:
NEWID
Age
H_PERS
Income
OCCU
FAMTYPE
REGION
Metro(Yes/No)
Exp_alcohol
population sample-(This is the weighted population each new id represents) etc.
I would like to generate a summarized view like below:
average expenditure value (This should be sum of (exp_alcohol/population sample))
% of population sample across Region Metro and each demographic variable
Please help me with your ideas.
Since I can't see your data set and your description was not very clear, I'm going to guess that you have data that looks something like this and you would like add some new variables that summarizes your data...
data alcohol;
input NEWID Age H_PERS Income OCCU $ FAMTYPE $ REGION $ Metro $
Exp_alcohol population_sample;
datalines;
1234 32 4 65000 abc m CA Yes 2 4
5678 23 5 35000 xyz s WA Yes 3 6
9923 34 3 49000 def d OR No 3 9
8844 26 4 54000 gdp m CA No 1 5
;
run;
data summar;
set alcohol;
retain TotalAvg_expend metro_count total_pop;
Divide = exp_alcohol/population_sample;
TotalAvg_expend + Divide;
total_pop + population_sample;
if metro = 'Yes' then metro_count + population_sample;
percent_metro = (metro_count/total_pop)*100;
drop NEWID Age H_PERS Income OCCU FAMTYPE REGION Divide;
run;
Output:
Exp_ population_ TotalAvg_ metro_ total_ percent_
Metro alcohol sample expend count pop metro
Yes 2 4 0.50000 4 4 100.000
Yes 3 6 1.00000 10 10 100.000
No 3 9 1.33333 10 19 52.632
No 1 5 1.53333 10 24 41.667
I have big panel time series data set. I wish to do this basic SAS regression code:
proc sort data=dataset;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=dataset;
by time_id;
model y=x1 x2 x3....x15;
quit;
run;
I get this error when I run the code:
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=1
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=2....
Why? My time_id variable exists... is it because I have too many time_id variables? If I select firm_id it works but I want time_id.
Here's a sample of my data (panel time series):
y x firm_id time_id
3.4 100 1 1
2.3 200 1 2
6.5 653 1 3
3 50 2 1
4.34 23 2 2
4.8 55 2 3
1.311 400 3 1
1.23 200 3 2
5.63 50 3 3
You'll get this error message if all values of a particular x variable are missing for a given time_id. Take a look at the example below where all values of x2 are missing for time_id 1, when you run the code the Results Output window details the problem (number of missing observations the same as the number of observations).
It works for firm_id because you have fewer values than time_id, therefore not all values of a particular x variable are missing for each firm_id.
data have;
input y x1 x2 firm_id time_id;
cards;
3.4 100 . 1 1
2.3 200 200 1 2
6.5 653 653 1 3
3 50 . 2 1
4.34 23 23 2 2
4.8 55 55 2 3
1.311 400 . 3 1
1.23 200 200 3 2
5.63 50 50 3 3
;
run;
proc sort data=have;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=have;
by time_id;
model y=x1-x2;
quit;
run;