How do you enter variables with numeric names in SAS? - sas

I would like to enter the name of my variables as numbers e.g. '1950-1959' and I'm using the INPUT statement, but the output is not appearing correctly.
DATA data1;
INPUT AgeGroup$ 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Could you please tell me if I need to use any special characters to specify that '1950-1959' etc. are names of the numeric variable?
Thanks!

You can use name literals to specify names that don't follow the normal rules, for example '1950-1959'n. Make sure that the VALIDVARNAME option is set to ANY so that SAS will allow the non-standard names. You could use standard names for the variables and use the label to store that description.
input AgeGroup :$5. period1-period6 ;
label period1 = '1950-1959' period2 = '1960-1969' ....
It would probably be more useful to store the time period into a variable instead.
data data1;
length AgeGroup $5 Period $9 count 8;
input AgeGroup #;
do period='1950-1959','1960-1969','1970-1979','1980-1989','1990-1992','Total';
input count #;
output;
end;
datalines;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
In that structure you can more easily filter to the data for subset of the time periods. But you could still easily create a report that displays the data in that tabular layout.
proc report data=data1;
columns agegroup count,period ;
define agegroup / group ;
define period / across ' ';
define count / ' ';
run;
Results:
AgeGr
oup 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449

Enable extended character names with options validvarname=any, then specify each as a name literal like 'this'n:
options validvarname=any;
DATA data1;
INPUT AgeGroup$ '1950-1959'n '1960-1969'n '1970-1979'n '1980-1989'n '1990-1992'n Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Most modern SAS applications automatically specify this option, but occasionally you'll run into systems that still have v7 names.

Related

Replace missing values in SAS by Specific Condition

I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to
Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific
Carrier.
Hereby is the sample of the dataset : -
date carrier Flight tailnum origin dest Distance Air_time Arr_Delay
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
code I tried : -
Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
var Arrival_delay_minutes;
Run;
But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!
stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-
Proc means data=cs1.Planes noprint;
var Arr_Delay;
class carrier origin dest;
output out=mean1;
Run;
proc sort data=cs1.Planes;
by carrier origin dest;
run;
proc sort data=mean1;
by carrier origin dest;
run;
data cs1.Complete_data(drop=Arr_Delay1 _stat_);
merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
keep=carrier origin dest Arr_Delay _stat_
rename=(Arr_Delay = Arr_Delay1) in=b);
by carrier origin dest;
if a;
if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.
You can find the SAS documentation here and available options here.
Code:
data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input
date carrier $ Flight tailnum $ origin $ dest $ Distance Air_time Arr_Delay;
datalines;
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 15
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
;
run;
proc sort data=work.have; by origin dest carrier; run;
Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
var Arr_Delay;
by origin dest carrier ;
Run;

Reading and halving a sas data set

I have to read a data set of 50 numbers from a text file. It's all in a row with a space delimiter and in multiple uneven lines. for example:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15
16 17 18 19 20 21
Etc.
The first 25 numbers belong to group 1, and the 2nd 25 belong to group 2. So I need to make a group variable (binary either 1 or 2), a count number (1 to 25), and a value variable which is holding the value of the number.
I am stuck on how to split the data in half when reading it. I tried to use truncover but it did not work.
Try something like this, replacing the datalines keyword with the path to your file:
data groups;
infile datalines;
format number 8. counter 2. group 1.; * Not mandatory, used here to order variables;
retain group (1);
input number ##;
counter + 1;
if counter = 26 then do;
group = 2;
counter = 1;
end;
datalines;
192 105 435 448 160 499 184 246 388 190 316
139 146 147 192 231 449 101 216 342 399 352 122 418
280 400 187 352 321 180 425 500 320 179 105
232 105 323 132 106 255 449
186 135 472 174 119 255
308 350
run;

How to extract next12 month data from master table for each id in Sample table based on the yearmonth and ID using sas

I am currently practicing SAS programming on using two SAS dataset(sample and master) . Below are the hypothetical or dummy data created for illustration purpose to solve my problem through SAS programming . I would like to extract the data for the id's in sample dataset from master dataset(test). I have given an example with few id's as sample dataset, for which i need to extract next 12 month information from master table(test) for each id's based on the yearmonth information( desired output given in the third output).
Below is the code to extract the previous 12 month data but i am not getting idea to extract next 12 month records as pulled for previous months, Can anyone help me in solving this problem using SAS programming with optimized way.
proc sort data=test;
by id yearmonth;
run;
data result;
set test;
array prev_month {13} PREV_MONTH_0-PREV_MONTH_12;
by id;
if first.id then do;
do i =1 to 13;
prev_month(i)=0;
end;
end;
do i = 13 to 2 by -1;
prev_month(i)=prev_month(i-1);
end;
prev_month(1)=no_of_cust;
drop i prev_month_0;
retain prev_month:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
One sample dataset (dataset name - sample).
ID YEARMONTH NO_OF_CUST
1 200909 50
1 201005 65
1 201008 78
1 201106 95
2 200901 65
2 200902 45
2 200903 69
2 201005 14
2 201006 26
2 201007 98
One master dataset - dataset name (test) (huge dataset over the year for each id from start of the account to till date.)
ID YEARMONTH NO_OF_CUST
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
Desired Output should like below,
ID YEARMONTH NO_OF_CUST AFTER_MONTH_1 AFTER_MONTH_2 AFTER_MONTH_3 AFTER_MONTH_4 AFTER_MONTH_5 AFTER_MONTH_6 AFTER_MONTH_7 AFTER_MONTH_8 AFTER_MONTH_9 AFTER_MONTH_10 AFTER_MONTH_11 AFTER_MONTH_12
1 200909 50 77 86 95 95 87 79 71 65 66 66 78 88
Step1: Join your sample table with the main(test) table and using intnx to get all the values for next 12 months.
Step2: Making a column names "after month"
Step3: Transpose to get your final output
proc sql;
create table abc as
select a.id,a.yearmonth,b.yearmonth as yearmonth1, b.no_of_cust
from
sample a
left join
test b
on a.id = b.id and a.yearmonth <= b.yearmonth <= intnx("month",a.yearmonth,12)
order by a.id,a.yearmonth,b.yearmonth;
quit;
data abc1(drop=col yearmonth1);
set abc;
by id yearmonth;
if first.yearmonth then col=-1;
col+1;
columns = compress("after_month_"||col);
run;
proc transpose data=abc1 out=abc2(rename=(after_month_0 = no_of_cust) drop=_name_);
by id yearmonth;
id columns;
var no_of_cust;
run;
My Output:
Or
If you want to make changes in your query then you could use the below code.
proc sort data=test;
by id descending yearmonth;
run;
data result;
set test;
array after_month {13} after_MONTH_0-after_MONTH_12;
by id;
if first.id then do;
do i = 1 to 13;
after_month(i) = 0;
end;
end;
do i = 13 to 2 by -1;
after_month(i) = after_month(i-1);
end;
after_month(1) = NO_OF_CUST;
drop i after_MONTH_0;
retain after_MONTH:;
run;
data sample1;
set sample(drop=no_of_cust);
run;
proc sort data=result;
by id yearmonth;
run;
proc sort data=sample1;
by id yearmonth;
run;
data all;
merge sample1(in=a) result(in=b);
by id yearmonth;
if a;
run;
Let me know in case of any queries.

Formatting MMSS in SAS

DATA USCFootballStatsProject;
INPUT OPPONENT $ 1-12 WINORLOSS 13-14 TIMEOFPOSSESSION 15-19 THIRDDOWNCONVERSIONPERCENTAGE 21-26 RUSHINGYARDS;
FORMAT TIMEOFPOSSESSION MMSS.;
CARDS;
UCF 1 24:30 0.125 32
GEORGIA 0 26:59 0.333 43
ALABAMA 0 22:53 0.333 71
TROY 1 31:47 0.333 116
AUBURN 0 28:51 0.167 70
KENTUCKY 1 29:11 0.636 139
VANDERBILT 1 24:50 0.333 132
TENNESSEE 1 32:08 0.353 65
ARKANSAS 1 27:53 0.429 45
FLORIDA 1 25:50 0.300 120
CLEMSON 0 28:12 0.250 167
MISSOURI 0 31:19 0.316 142
MISSSTATE 1 32:39 0.231 81
GEORGIA 0 29:08 0.364 35
WOFFORD 1 24:21 0.417 165
FLOATLANTIC 1 32:39 0.429 200
AUBURN 0 30:20 0.462 109
KENTUCKY 1 31:07 0.538 190
VANDERBILT 1 30:54 0.727 194
TENNESSEE 0 31:33 0.417 165
ARKANSAS 0 23:06 0.333 51
FLORIDA 0 30:31 0.500 135
MTENNESSEE 1 28:17 0.778 154
CLEMSON 1 32:57 0.600 208
HOUSTON 1 33:19 0.500 189
LOUISLFY 1 31:38 0.538 195
GEORGIA 1 30:34 0.091 140
SCSTATE 1 26:53 0.364 223
LSU 0 27:09 0.500 17
MISSSTATE 1 29:39 0.500 123
KENTUCKY 1 29:57 0.462 86
NCAROLINA 1 25:56 0.083 110
VANDERBILT 0 26:36 0.083 26
TENNESSEE 0 36:25 0.438 39
ARKANSAS 0 30:55 0.400 125
FLORIDA 0 25:39 0.167 68
CLEMSON 0 21:23 0.375 80
NCSTATE 1 34:15 0.357 171
VANDERBILT 0 29:58 0.400 92
GEORGIA 0 24:47 0.417 18
WOFFORD 1 31:07 0.636 172
UAB 1 34:59 0.500 158
OLEMISS 1 31:21 0.538 78
KENTUCKY 1 30:43 0.471 74
LSU 0 25:28 0.111 39
TENNESSEE 1 32:30 0.500 101
ARKANSAS 1 29:06 0.417 132
FLORIDA 0 29:50 0.067 53
CLEMSON 0 27:13 0.471 92
IOWA 0 24:06 0.455 43
NCSTATE 1 32:25 0.333 108
GEORGIA 0 34:21 0.353 114
FLOATLANTIC 1 27:50 0.300 287
OLEMISS 1 33:35 0.375 65
SCSTATE 1 27:13 0.538 213
KENTUCKY 1 29:17 0.500 128
ALABAMA 0 31:43 0.474 64
VANDERBILT 1 32:22 0.375 119
TENNESSEE 0 26:35 0.267 65
ARKANSAS 0 27:37 0.500 53
FLORIDA 0 31:21 0.250 61
CLEMSON 1 36:31 0.375 223
CONN 0 24:32 0.200 76
SOUTHERMISS 1 28:52 0.400 224
GEORGIA 1 35:15 0.643 189
FURMAN 1 33:30 0.600 182
AUBURN 0 28:52 0.500 79
ALABAMA 1 27:33 0.545 110
KENTUCKY 0 25:13 0.500 90
VANDERBILT 1 37:21 0.529 129
TENNESSEE 1 28:28 0.538 212
ARKANSAS 0 25:40 0.417 105
FLORIDA 1 40:46 0.500 239
TROY 1 30:49 0.455 212
CLEMSON 1 34:43 0.333 95
AUBURN 0 28:59 0.417 156
FLORIDAST 0 26:32 0.500 139
ECU 1 30:02 0.500 220
GEORGIA 1 29:02 0.286 253
NAVY 1 31:15 0.556 254
VANDERBILT 1 34:08 0.526 131
AUBURN 0 24:13 0.200 129
KENTUCKY 1 38:37 0.500 288
MISSSTATE 1 32:34 0.200 110
TENNESSEE 1 36:18 0.556 231
ARKANSAS 0 29:05 0.667 79
FLORIDA 1 32:04 0.214 215
CITADEL 1 26:44 0.667 256
CLEMSON 1 37:17 0.444 210
NEBRASKA 1 29:11 0.308 121
VANDERBILT 1 31:36 0.250 205
ECU 1 28:42 0.533 131
UAB 1 23:47 0.500 179
MISSOURI 1 32:33 0.500 144
KENTUCKY 1 31:17 0.500 200
GEORGIA 1 33:13 0.417 230
LSU 0 23:03 0.231 34
FLORIDA 0 24:32 0.214 36
TENNESSEE 1 35:22 0.400 147
ARKANSAS 1 31:45 0.538 104
WOFFORD 1 29:11 0.538 171
CLEMSON 1 39:58 0.524 134
MICHIGAN 1 22:01 0.300 85
NCAROLINA 1 29:33 0.357 228
GEORGIA 0 24:58 0.455 226
VANDERBILT 1 37:10 0.647 220
UCF 1 30:49 0.556 225
KENTUCKY 1 29:45 0.556 178
ARKANSAS 1 43:25 0.563 277
TENNESSEE 0 27:38 0.286 218
MISSOURI 1 34:27 0.294 75
MISSSTATE 1 26:14 0.091 160
FLORIDA 1 28:59 0.313 164
CCAROLINA 1 34:31 0.556 352
CLEMSON 1 38:09 0.526 140
WISCONSIN 1 30:34 0.444 117
TAMU 0 22:22 0.222 67
ECU 1 36:19 0.538 175
GEORGIA 1 31:27 0.222 176
VANDERBILT 1 31:02 0.583 212
MISSOURI 0 35:55 0.381 119
KENTUCKY 0 34:20 0.600 282
FURMAN 1 29:37 0.333 267
AUBURN 0 33:31 0.429 119
TENNESSEE 0 30:13 0.462 248
FLORIDA 1 31:30 0.471 95
SALABAMA 1 24:12 0.400 210
CLEMSON 0 31:20 0.400 63
MIAMI 1 28:50 0.467 60
;
PROC PRINT DATA = USCFootballStatsProject;
RUN;
As it currently stands, it is not printing any of the times in the column for TIMEOFPOSSESSION, but it is printing everything else fine. Any ideas as to why it isn't printing that column? I'm using FORMAT TIMEOFPOSSESSION MMSS.;
I plan on doing a logistic regression with the WINORLOSS being the response variable, but I want to make sure that the data is being read in correctly.
Thanks.
Because you told it to read columns 15 to 19 as a number, but the value has a colon in it. You need to either use formatted input.
input ... #15 TIMEOFPOSSESSION STIMER5. ... ;
Or use list mode input with an attached INFORMAT.
informat TIMEOFPOSSESSION stimer5.;
input ... #15 TIMEOFPOSSESSION ... ;

Creating statistical data from a table

I have a table with 20 columns of measurements. I would like 'convert' the table into a table with 20 rows with columns of Avg, Min, Max, StdDev, Count types of information. There is another question like this but it was for the 'R' language. Other question here.
I could do the following for each column (processing the results with C++):
Select Count(Case When [avgZ_l1] <= 0.15 and avgZ_l1 > 0 then 1 end) as countValue1,
Count(case when [avgZ_l1] <= 0.16 and avgZ_l1 > 0.15 then 1 end) as countValue2,
Count(case when [avgZ_l1] <= 0.18 and avgZ_l1 > 0.16 then 1 end) as countValue3,
Count(case when [avgZ_l1] <= 0.28 and avgZ_l1 > 0.18 then 1 end) as countValue4,
Avg(avgwall_l1) as avg1, Min(avgwall_l1) as min1, Max(avgZ_l1) as max1,
STDEV(avgZ_l1) as stddev1, count(*) as totalCount from myProject.dbo.table1
But I do not want to process the 50,000 records 20 times (once for each column). I thought there would be away to 'pivot' the table onto its side and process the data at the same time. I have seen examples of the 'Pivot' but they all seem to pivot on a integer type field, Month number or Device Id. Once the table is converted I could then fetch each row with C++. Maybe this is really just 'Insert into ... select ... from' statements.
Would the fastest (execution time) approach be to simply create a really long select statement that returns all the information I want for all the columns?
We might end up with 500,000 rows. I am using C++ and SQL 2014.
Any thoughts or comments are welcome. I just don't want have my naive code to be used as a shining example of how NOT to do something... ;)...
If your table looks the same as the code that you sent in r then the following query should work for you. It selects the data that you requested and pivots it at the same time.
create table #temp(ID int identity(1,1),columnName nvarchar(50));
insert into #temp
SELECT COLUMN_NAME as columnName
FROM myProject.INFORMATION_SCHEMA.COLUMNS -- change myProject to the name of your database. Unless myProject is your database
WHERE TABLE_NAME = N'table1'; --change table1 to your table that your looking at. Unless table1 is your table
declare #TableName nvarchar(50) = 'table1'; --change table1 to your table again
declare #loop int = 1;
declare #query nvarchar(max) = '';
declare #columnName nvarchar(50);
declare #endQuery nvarchar(max)='';
while (#loop <= (select count(*) from #temp))
begin
set #columnName = (select columnName from #temp where ID = #loop);
set #query = 'select t.columnName, avg(['+#columnName+']) as Avg ,min(['+#columnName+']) as min ,max(['+#columnName+'])as max ,stdev(['+#columnName+']) as STDEV,count(*) as totalCount from '+#tablename+' join #temp t on t.columnName = '''+#columnName+''' group by t.columnName';
set #loop += 1;
set #endQuery += 'union all('+ #query + ')';
end;
set #endQuery = stuff(#endQuery,1,9,'')
Execute(#endQuery);
drop table #temp;
It creates a #temp table which stores the values of your column headings next to an ID. It then uses the ID when looping though the number of columns that you have. It then generates a query which selects what you want and then unions it together. This query will work on any number of columns meaning that if you add or remove more columns it should give the correct result.
With this input:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
It will produce this output:
avg min max stdev totalcount
age 20 17 31 3.3 29
height_seca1 1737 1569 1877 91.9 29
height_chad1 1736 1570 1880 92.7 29
height_DL 173 155 188 9.7 29
weight_alog1 73 50 106 14.5 29
Hope this helps and works for you. :)