transposing in SAS software - sas

just have some problems with transposing data in SAS; in particular, it could be because the year headings are in year format 2013, 2014, etc but have no idea how to resolve it
so for instance-->
sampledata set:
CompanyID 2013 2014 2015 2016 2017 2018 2019 2020 2021
1 5.3 3.4 6.4 7.8 5.4 9.8 2.4 4.2 4.2
2 2.3 ... ... ... ... ... ... ... ...
proc transpose data=sampledata out=long1;
var 2013-2021;
by CompanyID;
run;
so basically SAS cannot seem to recognise '2013-2021' as a variable, what would be the recommendations? thanks!

The only way you could have variables with names like 2013 would be if you had accidentally set the VALIDVARNAME option to ANY.
In that case you need to use NAME LITERALS when referencing names that do not follow the normal rules for names (only contain digits, letters or underscores and do not start with a digit).
proc transpose data=sampledata out=long1;
by CompanyID;
var '2013'n-'2021'n;
run;
If you had imported the data with the VALIDVARNAME option set to V7 then those strings would have been converted into names like _2013, _2014 etc. In which case use
var _2013 - _2021 ;

If the data was imported from Excel, the variable labels will be the original column headers, but the variable names will be the header values transformed to valid SAS names, most likely a leading underscore was added.
The following code would work based on the above presumptions.
proc transpose data=sampledata out=long1;
var _2013 - _2021;
by CompanyID;
run;

SAS has certain rules about variable names, but there is support for variables that break those rules. SAS handles those through a concept called a name literal. Name literals take the form 'variable'n. In your case, you just need to specify name literals.
proc transpose data=sampledata out=long1;
var '2013'n-'2021'n;
by CompanyID;
run;

Related

Changing the first row name conditionally on character interval in SAS

Consider the following data:
data GDP;
input Year $ Agriculture Industry;
datalines;
2016 195 1634
2017 220 1986
;
When exporting as a .dat file:
proc export
data = GDP
outfile = '....\GDP.dat'
dbms = TAB
replace;
run;
Then I get the following file:
However, I want the following file:
Where:
Mydata is a text I manually add.
The number after for instance Year (that is Year: 1-4) is the character intervals where the values are within. For instance, the values in the Year column is from characther 1 to 4. The values in the agriculture column goes from 9 to 11, and so on.
So SAS should count the interval for the values and add it to the first row name. How to do it in SAS?
You can fudge this with labels to your variables and then add the LABEL option to PROC EXPORT.
data GDP;
input Year $ Agriculture Industry;
label Year = "Mydata, Year:1-4" Agriculture = "Agriculture:9-11";
datalines;
2016 195 1634
2017 220 1986;
run;
proc export
data = GDP
outfile = '....\GDP.dat'
dbms = TAB
LABEL
replace;
run;
FYI - it looks like you're trying to create a fixed width file and put the specifications in the header. I'd advise against this and either put the specifications in a separate file or to include it at the top of the file instead.
Putting it in the header makes it harder for any other system to process correctly.
If you really need this for some reason, you may also want to consider using a data step to create your export instead of using PROC EXPORT.
AFAIK there is no easy way to define the specifications automatically though you could push the PROC CONTENTS output to a separate data set.

Linear Interpolation on missing values at the end of the period

Here is a dataset example :
data data;
input group $ date value;
datalines;
A 2001 1.5
A 2002 2.6
A 2003 2.8
A 2004 2.9
A 2005 .
B 2001 0.1
B 2002 0.6
B 2003 0.7
B 2004 1.4
B 2005 .
C 2001 4.7
C 2002 4.6
C 2003 4.8
C 2004 5.0
C 2005 .
;
run;
I want to replace the missing values of the variable "value" for each group using linear interpolation.
I tried using proc expand :
proc expand data=data method = join out=want;
by group;
id date;
convert value;
run;
But it's not replacing any value in the output database.
Any idea what I'm doing wrong please?
Here are three ways to do it. Your missing data is at the end of the series. You are effectively doing a forecast with a few points. proc expand isn't good for that, but for the purposes of filling in missing values, these are some of the options available.
1. PROC EXPAND
You were close! Your missing data is at the end of the series, which means it has no values to join between. You need to use the extrapolate option in this case. If you have missing values between two data points then you do not need to use extrapolate.
proc expand data=data method = join
out=want
extrapolate;
by group;
id date;
convert value;
run;
2. PROC ESM
You can do interpolation with exponential smoothing models. I like this method since it can account for things like seasonality, trend, etc.
/* Convert Date to SAS date */
data to_sas_date;
set data;
year = mdy(1,1,date);
format year year4.;
run;
proc esm data=to_sas_date
out=want
lead=0;
by group;
id year interval=year;
forecast value / replacemissing;
run;
3. PROC TIMESERIES
This will fill in values using mean/median/first/last/etc. for a timeframe. First convert the year to a SAS date as shown above.
proc timeseries data=to_sas_date
out=want;
by group;
id year interval=year;
var value / setmissing=average;
run;
I don't know much about the expand procedure, but you can add extrapolate to the proc expand statement.
proc expand data=data method = join out=want extrapolate;
by group;
id date;
convert value;
run;
Results in:
Obs group date value
1 A 2001 1.5
2 A 2002 2.6
3 A 2003 2.8
4 A 2004 2.9
5 A 2005 3.0
6 B 2001 0.1
7 B 2002 0.6
8 B 2003 0.7
9 B 2004 1.4
10 B 2005 2.1
11 C 2001 4.7
12 C 2002 4.6
13 C 2003 4.8
14 C 2004 5.0
15 C 2005 5.2
Please take note of the statement here
By default, PROC EXPAND avoids extrapolating values beyond the first or last input value for a series and only interpolates values within the range of the nonmissing input values. Note that the extrapolated values are often not very accurate and for the SPLINE method the EXTRAPOLATE option results may be very unreasonable. The EXTRAPOLATE option is rarely used."

special characters in alias Proc sql- SAS 9.3

I need to have a special character (% and space) in the alsias name of a proc sql statement.
proc sql DQUOTE=ANSI;
create table final_data as
select a.column1 as XYZ,
((a.colum2/b.colum2)-1) as "% VS LY"
from table1 a
join table2 b on a.colum3=b.colum3;
quit;
according to the documention, having the option proc sql DQUOTE=ANSI should work..
http://support.sas.com/documentation/cdl/en/acreldb/63647/HTML/default/viewer.htm#a001393333.htm
However, I'm getting this error in SAS 9.3
ERROR: The value % VS LY is not a valid SAS name.
What should I do to make this work?
Thank you so much in advance!
Perhaps a simpler solution would be to use standard naming and a SAS label. If the computed value is between 0 and 1 you can also add a SAS format.
((a.colum2/b.colum2)-1) as vs_ly_pct label='% VS LY' format=percent5.2
If you truly want non-standard column names, you will also need to set
options validvarname = any;
before the Proc SQL.
In SQL an alias is what you use to prefix variable references to tell which input table (or subquery) the variable comes from. Like the a and b in your query. What you are talking about is the variable NAME.
SAS variable names normally are restricted to underscore and alphanumeric characters (and cannot start with a number), but variable LABELS can be any string. You can just specify the label after the name.
select a.column1 as XYZ
, ((a.colum2/b.colum2)-1) as var2 '% VS LY'
Or use the SAS specific LABEL= syntax
select a.column1 as XYZ
, ((a.colum2/b.colum2)-1) as var2 label='% VS LY'

Defining a new field conditionally using put function with user-defined formats

I am trying to define a new value for an observation with a user defined format. However, my if/then/else statement seems to only work for observations with a year value of "2014". The put statements are not working for other values. In SAS, the put statement is blue in the first statement, and black in the other two. Here is a picture of what I mean:
Does anyone know what I am missing here? Here is my complete code:
data claims_t03_group;
set output.claims_t02_group;
if year = "2014" then test = put(compress(lookup,"_"),$G_14_PROD35.);
else if year = "2015" then test = put(compress(lookup,"_"),$G_15_PROD35.);
else test = put(compress(lookup,"_"),$G_16_PROD35.);
run;
Here is an example of what I mean when I say that the process seems to "work" for 2014:
As you can see, when the Year value is 2014, the format lookup works correctly, and the test field returns the value I am expecting. However, for years 2015 and 2016, the test field returns the lookup value without any formatting.
Your code utilises user-defined formats, $G_14_PROD.-$G_16_PROD.. My guess would be that there is a problem with one or more of these, but unless you can provide the format definitions it will be difficult to assist you further.
Try running the following and sharing the resulting output dataset work.prdfmts:
proc sql noprint;
select cats(libname,'.',memname) into :myfmtlib
from sashelp.vcatalg
where objname = 'G_14_PROD';
quit;
proc format cntlout = prdfmts library=&myfmtlib;
select G_14_PROD G_15_PROD G_16_PROD;
run;
N.B. this assumes that you only have one catalogue containing a format with that name, and that the format definitions for all 3 formats are contained in the same catalogue. If not, you will need to adapt this a bit and run it once for each format to find and export the definition.
Not that it solves your actual problem, but you could eliminate the IF/THEN by using the PUTC() function instead.
data have ;
do year=2014,2015,2016;
do lookup='00_01','00_02' ;
output;
end;
end;
run;
proc format ;
value $G_14_PROD '0001'='2014 - 1' '0002'='2014 - 2' ;
value $G_15_PROD '0001'='2015 - 1' '0002'='2015 - 2' ;
value $G_16_PROD '0001'='2016 - 1' '0002'='2016 - 2' ;
run;
data want ;
set have ;
length test $35 ;
if 2014 <= year <= 2016 then
test = putc(compress(lookup,'_'),cats('$G_',year-2000,'_PROD.'))
;
run;
Result
Obs year lookup test
1 2014 00_01 2014 - 1
2 2014 00_02 2014 - 2
3 2015 00_01 2015 - 1
4 2015 00_02 2015 - 2
5 2016 00_01 2016 - 1
6 2016 00_02 2016 - 2

SAS: Loop over dataset, make temp data step for ith row, do some proc w/ temp data, return results to first dataset

So I have a dataset_a that looks like this:
Name Month
Dick Aug
Dick Sep
Dick Oct
Jane Aug
Jane Sep
...
And some other, much larger dataset_b like this:
Name Day X Y
Dick 12-Jul-13 14.8 2.3
Jane 05-Sep-13 12.2 2.0
Dick 02-Aug-13 15.1 3.2
Dick 07-Aug-13 14.5 3.0
Jane 05-Aug-13 12.8 2.5
Dick 08-Aug-13 14.5 3.0
Dick 10-Aug-13 13.5 2.3
Jane 31-Jul-13 13.0 2.2
...
I want to iterate over it, and for each row in dataset_a, do a data step that gets the appropriate records from dataset_b and puts them in a temp dataset--temp, let's call it. Then I need to do a proc reg on temp and stick the results (row-vector-style) back into dataset_a, like so:
Name Month Parameter-est.-for-Y p-value R-squared
Dick Aug Some # Some # Some #
Dick Sep Some # Some # Some #
Dick Oct Some # Some # Some #
Jane Aug Some # Some # Some #
Jane Sep Some # Some # Some #
...
Here's some code/pseudocode to illustrate my need:
for each row in dataset_a
data temp;
set dataset_b; where name=['i'th name] and month(day)=['i'th month];
run;
proc reg /*noprint*/ alpha=0.1 outest=[?] tableout; model X = Y; run;
/*somehow put these regression results back into 'i'th row of dataset_a*/
next
Please post a comment if something doesn't make sense. Thanks very much in advance!
The efficient approach for this is somewhat different than what you are listing. In the particular instance you show, the most efficient approach would be to use a format to group the Day values into Months, and run your regression by name day, assuming regression respects formats (if not, then create a new variable month and assign that using the format).
For example:
data for_reg/view=for_reg;
set dataset_b;
month=put(day,MONNAME3.);
run;
Or
proc datasets lib=work;
modify dataset_b;
format day MONNAME3.;
quit;
Then
proc reg data=for_reg;
by name month; *or if using the other one, by name day;
**other proc reg statements**;
run;
Then merge that output dataset with dataset_a if needed. It will run the proc reg as if you'd run it once for each name/month combination, but all in one call and one pass through the data.
If PROC REG doesn't respect by groups (and I think it does, but who knows), the best solution is still to do something like this; write a macro to run the proc reg taking arguments of name and month, and call the macro from the dataset_a. Then generate common output files (or proc append them into a single master output dataset in the macro) and merge the result to dataset_a if needed at the end.
Something like
%macro run_procreg(name=,month=);
data for_run/view=for_run;
set dataset_b;
where name=&name. and put(day,MONNAME3.)=&month.;
run;
proc reg data=for_run;
*other stuff*;
output out=tempdataset; *or however you create your output;
run;
proc append base=master_output data=tempdataset force;
run;
%mend run_procreg;
proc sql;
select cats('%run_procreg(name=',name,',month=',month,')') into :macrocalllist
separated by ' ' from dataset_a;
quit;
&macrocalllist;
data fin;
merge dataset_a (in=a) master_output(in=b);
by name month;
run;
You probably don't need to merge on dataset_a at the end if it just has those two variables. This will be a lot slower than one call with by, but if it's necessary, this is the way to do it.
You can also use call execute in the datastep to drive a macro list like above - that's nearly the most similar concept to your stated pseudocode, it's almost identical - but it doesn't return the information back to the data step (it executes after the data step completes), and it's slightly more troublesome than the above method. There is also, in 9.3+, dosubl in the FCMP language which allows you to do a bit closer to what you want, but I don't know it well enough to explain or know that it does indeed meet your needs.