How to transpose EG5.1 - sas

I have a data set of approximately this format:
Table format :
ID
2012
2013
2014
A
1
3
B
2
4
And I want to transpose it to this format:
Table format :
ID
Source
Value
A
2012
1
A
2013
3
B
2012
2
B
2014
4
Using the Transpose task. I'm working in EG 5.1 and I've got a massive mental block on how to do this. Most of the guides are for doing this the opposite way around. Thanks so much in advance for any advice.

Use proc transpose instead. Create a new SAS program and run the following code:
proc transpose data = have
out = want(rename = (COL1 = Value)
where = (NOT missing(Value) )
)
name = Source;
by id;
var _NUMERIC_;
run;
Output:
ID Source Value
A 2012 1
A 2013 3
B 2012 2
B 2014 4
In Enterprise Guide, this is the Stack Columns task:

Related

SAS sum by group and then create new variable for each group

I want to do summation for each group and create a new variable for the sum for each group. I tried proc sql, but it only created a new variable.
My dataset looks like:
data have;
input firm year product$ value;
datalines;
1 2012 a 5
1 2012 a 6
1 2012 b 3
1 2013 a 4
1 2013 a 3
1 2013 b 4
1 2013 b 3
2 2012 a 5
2 2012 a 6
2 2012 b 3
2 2012 b 4
2 2012 b 2
2 2013 a 4
2 2013 a 5
2 2013 b 3
2 2013 b 3
;
run;
what I want is a table with four columns: firm year productA_sum productB_sum.
I tried this way:
proc sql;
create table h.want as
select a.*, sum(a.value) as sumvalue
from h.have as a
group by firm, year, product;
quit;
But it only create a new column.
because u group three variables, but in the select, you choose all variables. it will cause group by function useless.
/*Try this one*/
proc sql;
create table h.want as
select a.firm, a.year, a.product, sum(a.value) as sumvalue
from h.have as a
group by firm, year, product;
quit;
To get separate SUM() results based on another variable's value you need to use a CASE statement, not include it in the grouping variables.
proc sql;
create table want as
select firm, year
, sum(case when (product='a') then value else . end) as sum_product_A
, sum(case when (product='b') then value else . end) as sum_product_B
from have
group by firm,year
;
quit;
If you want the sum to be zero instead of missing if the product never appears then replace the missing values in the else clauses with 0 instead.
You are pivoting an aggregate sum. A two step approach could be more desirable if there are more than two product values to contend with.
proc summary data=have nway noprint;
class firm year product;
var value;
output out=class_sums sum=sum;
run;
proc transpose data=sums suffix=_sum out=want(drop=_name_);
by firm year;
id product;
var sum;
run;

Linear Interpolation on missing values at the end of the period

Here is a dataset example :
data data;
input group $ date value;
datalines;
A 2001 1.5
A 2002 2.6
A 2003 2.8
A 2004 2.9
A 2005 .
B 2001 0.1
B 2002 0.6
B 2003 0.7
B 2004 1.4
B 2005 .
C 2001 4.7
C 2002 4.6
C 2003 4.8
C 2004 5.0
C 2005 .
;
run;
I want to replace the missing values of the variable "value" for each group using linear interpolation.
I tried using proc expand :
proc expand data=data method = join out=want;
by group;
id date;
convert value;
run;
But it's not replacing any value in the output database.
Any idea what I'm doing wrong please?
Here are three ways to do it. Your missing data is at the end of the series. You are effectively doing a forecast with a few points. proc expand isn't good for that, but for the purposes of filling in missing values, these are some of the options available.
1. PROC EXPAND
You were close! Your missing data is at the end of the series, which means it has no values to join between. You need to use the extrapolate option in this case. If you have missing values between two data points then you do not need to use extrapolate.
proc expand data=data method = join
out=want
extrapolate;
by group;
id date;
convert value;
run;
2. PROC ESM
You can do interpolation with exponential smoothing models. I like this method since it can account for things like seasonality, trend, etc.
/* Convert Date to SAS date */
data to_sas_date;
set data;
year = mdy(1,1,date);
format year year4.;
run;
proc esm data=to_sas_date
out=want
lead=0;
by group;
id year interval=year;
forecast value / replacemissing;
run;
3. PROC TIMESERIES
This will fill in values using mean/median/first/last/etc. for a timeframe. First convert the year to a SAS date as shown above.
proc timeseries data=to_sas_date
out=want;
by group;
id year interval=year;
var value / setmissing=average;
run;
I don't know much about the expand procedure, but you can add extrapolate to the proc expand statement.
proc expand data=data method = join out=want extrapolate;
by group;
id date;
convert value;
run;
Results in:
Obs group date value
1 A 2001 1.5
2 A 2002 2.6
3 A 2003 2.8
4 A 2004 2.9
5 A 2005 3.0
6 B 2001 0.1
7 B 2002 0.6
8 B 2003 0.7
9 B 2004 1.4
10 B 2005 2.1
11 C 2001 4.7
12 C 2002 4.6
13 C 2003 4.8
14 C 2004 5.0
15 C 2005 5.2
Please take note of the statement here
By default, PROC EXPAND avoids extrapolating values beyond the first or last input value for a series and only interpolates values within the range of the nonmissing input values. Note that the extrapolated values are often not very accurate and for the SPLINE method the EXTRAPOLATE option results may be very unreasonable. The EXTRAPOLATE option is rarely used."

Tracking ID in SAS

I have a SAS question. I have a dataset containing ID and year. I want to create the dummyvariables "2011" and "2012" that should take on the value 1 if the ID has an observation in the given year and 0 otherwise. Eg. ID 2 should have 2011=1 and 2012=0, since the ID only has an observation for 2011.
ID Year 2011 2012
1 2011 1 1
1 2012 1 1
2 2011 1 0
3 2012 0 1
Can anyone help? Thanks!
For one thing, 2011 or 2012 are not valid names for SAS variables. SAS variables must start with a letter or an underscore (e.g., _2011).
If you really need to, you can get around that limitation by setting the system option validvarname=any and surrounding your 'invalid' variable names with single quotes and appending an n.
This would do what you want:
data have;
infile datalines;
input ID year;
datalines;
1 2011
1 2012
2 2011
3 2012
;
run;
options validvarname=ANY;
proc sql;
create table want as
select ID
,year
,exists(select * from have b where year=2011 and a.id=b.id) as '2011'n
,exists(select * from have b where year=2012 and a.id=b.id) as '2012'n
from have a
;
quit;

Defining a new field conditionally using put function with user-defined formats

I am trying to define a new value for an observation with a user defined format. However, my if/then/else statement seems to only work for observations with a year value of "2014". The put statements are not working for other values. In SAS, the put statement is blue in the first statement, and black in the other two. Here is a picture of what I mean:
Does anyone know what I am missing here? Here is my complete code:
data claims_t03_group;
set output.claims_t02_group;
if year = "2014" then test = put(compress(lookup,"_"),$G_14_PROD35.);
else if year = "2015" then test = put(compress(lookup,"_"),$G_15_PROD35.);
else test = put(compress(lookup,"_"),$G_16_PROD35.);
run;
Here is an example of what I mean when I say that the process seems to "work" for 2014:
As you can see, when the Year value is 2014, the format lookup works correctly, and the test field returns the value I am expecting. However, for years 2015 and 2016, the test field returns the lookup value without any formatting.
Your code utilises user-defined formats, $G_14_PROD.-$G_16_PROD.. My guess would be that there is a problem with one or more of these, but unless you can provide the format definitions it will be difficult to assist you further.
Try running the following and sharing the resulting output dataset work.prdfmts:
proc sql noprint;
select cats(libname,'.',memname) into :myfmtlib
from sashelp.vcatalg
where objname = 'G_14_PROD';
quit;
proc format cntlout = prdfmts library=&myfmtlib;
select G_14_PROD G_15_PROD G_16_PROD;
run;
N.B. this assumes that you only have one catalogue containing a format with that name, and that the format definitions for all 3 formats are contained in the same catalogue. If not, you will need to adapt this a bit and run it once for each format to find and export the definition.
Not that it solves your actual problem, but you could eliminate the IF/THEN by using the PUTC() function instead.
data have ;
do year=2014,2015,2016;
do lookup='00_01','00_02' ;
output;
end;
end;
run;
proc format ;
value $G_14_PROD '0001'='2014 - 1' '0002'='2014 - 2' ;
value $G_15_PROD '0001'='2015 - 1' '0002'='2015 - 2' ;
value $G_16_PROD '0001'='2016 - 1' '0002'='2016 - 2' ;
run;
data want ;
set have ;
length test $35 ;
if 2014 <= year <= 2016 then
test = putc(compress(lookup,'_'),cats('$G_',year-2000,'_PROD.'))
;
run;
Result
Obs year lookup test
1 2014 00_01 2014 - 1
2 2014 00_02 2014 - 2
3 2015 00_01 2015 - 1
4 2015 00_02 2015 - 2
5 2016 00_01 2016 - 1
6 2016 00_02 2016 - 2

sas coding: choosing max variable

I have two tables and need to create one more table working with other two:
first_table: SECOND TABLE
id term id term majr_code
3 2014 3 2010 ACT
3 2015 3 2010 ACT
4 2014 3 2011 GNST
4 2015 3 2015 BUSA
5 2013 3 2015 BUSA
5 2014 4 2009 TIM
6 2013 4 2010 BAL
6 2014 4 2014 TAR
5 2011 SAR
5 2013 COR
6 2010 PAT
6 2013 TOR
This is two tables I have. I need to create another table which is same with first table and adding one more column majr_code.
first_table:
id term majr_code
3 2014 GNST
3 2015 BUSA
4 2014 TAR
4 2015 TAR
5 2013 COR
5 2014 COR
6 2013 TOR
6 2014 TOR
what I need to do is, for the same id if second table has the same term with first table, I will keep same majr_code. For example: For first table has 2014 and second table has 2011 and 2015, I need to use 2011's majr_Code for 2014 term. For example: first table has 2013 and 2014 terms for the same id, and if second table's highest term is 2013, I will keep same majr_Code for 2013 and 2014
I know its complicated, it should be more clear if you check the tables and result. If still complicated, I can delete the question. This is how I can explain. Thanks!
I think the below code should do the trick. It works as follows:
1) reads in the sample datasets.
2) Create a table titled second_table_nogaps which is just the second_table but with no yearly gaps up through 2015. Basically, for each ID in the second table, it checks if a given yearly record exists. If so, the record is output, if not, it creates a new record with the prior year's majr_code. If the last record for a given id is not 2015, then new records are generated up through 2015. (for example a new record is created for id=4, year=2014, majr_code = TAR)
3) Merged the unique values of id+term+majr_code to first_table. The resulting table First_table_2 should be what you're looking for! However, BE CAREFUL, if there are multiple majr_codes for the same id+term this step will result in duplication.
Hope this helps! The code in step 2 could probably be simplified as my handling of the first and last record was not particularly efficient.
data first_table;
infile datalines ;
input id term;
datalines ;
3 2014
3 2015
4 2014
4 2015
5 2013
5 2014
6 2013
6 2014
;
run;
data second_table;
infile datalines ;
input id term majr_code $;
datalines ;
3 2010 ACT
3 2010 ACT
3 2011 GNST
3 2015 BUSA
3 2015 BUSA
4 2009 TIM
4 2010 BAL
4 2014 TAR
5 2011 SAR
5 2013 COR
6 2010 PAT
6 2013 TOR
;
run;
proc sort data=second_table ; by id term; run;
data second_table_nogaps (keep=id_nogaps term_nogaps majr_code_nogaps );
set second_table end=eof;
retain id_nogaps term_nogaps majr_code_nogaps ;
*first set up the first row... establishes retained variables and outputs;
if _N_ = 1 then do;
id_nogaps = id ;
term_nogaps = term;
majr_code_nogaps = majr_code;
output;
end;
*for all but the first and last row;
else if not eof then do;
do while ( (term_nogaps + 1 < term ) /*this is to fill in gaps between years. (e.g. major code in 2011 and major code in 2014 within the same id*/
or
((id_nogaps ne id) and term_nogaps < 2015) /*this is to fill major code for all terms up through 2015 (e.g. last major code for id 4 is in 2014)*/
);
term_nogaps = term_nogaps + 1;
output;
end;
id_nogaps=id;
term_nogaps = term;
majr_code_nogaps=majr_code;
output;
end;
else do;
do while (term_nogaps + 1 < term );
term_nogaps = term_nogaps + 1;
output;
end;
id_nogaps=id;
term_nogaps = term;
majr_code_nogaps=majr_code;
output;
do while ( term_nogaps < 2015 );
term_nogaps = term_nogaps + 1;
output;
end;
end;
run;
proc sql;
create table First_table_2 as
Select a.* , b.majr_code_nogaps as majr_code
from first_table a
left join
(select distinct id_nogaps, term_nogaps, majr_code_nogaps from second_table_nogaps) b /*select distinct values to prevent duplication*/
on a.id = b.id_nogaps and a.term = b.term_nogaps;
quit;
There are a few approaches to this, but sql is probably easiest. You don't provide code, so i'll just include a pointer. You need to use having to filter the table after it's been grouped to having term=max(term).