Extracting Minimum value after subsetting in PROC SQL - sas

I have a sample dataset named Flights.I want to Extract from it the Origin
Airport name which has least number of departure delays.
Sample Flights data:-
Date (Sched_dep_time) (dep_time)(flight)(origin) (Dep_delay_min)
01-01-2013 5:15 5:17 1545 EWR -2
01-01-2013 5:29 5:33 1714 LGA -4
01-01-2013 5:40 5:42 1141 JFK -2
01-01-2013 21:10 21:04 725 JFK 6
01-01-2013 20:30 21:04 461 LGA -74
01-01-2013 21:06 21:05 1696 EWR 1
01-01-2013 20:55 21:10 507 EWR -55
01-01-2013 20:25 21:14 5708 LGA -89
01-01-2013 21:10 21:15 79 JFK -5
01-01-2013 21:24 21:16 301 LGA 8
01-01-2013 6:00 5:58 49 JFK 42
01-01-2013 6:00 5:58 71 JFK 42
01-01-2013 6:00 5:58 194 JFK 42
Code i Tried: -
Proc sql;
Create table least_delay as
Select origin,min(number_of_delays)as min_delay from
(Select Origin,Count(Departure_delay_minutes) as Number_of_delays from
Flight
Where (Departure_delay_minutes>0))
Group by Origin
;
Quit;
The output i get is following: -
Origin min_delay
1 NLI 1135504
2 JFK 1135504
3 LGA 1135504
It shows same result for all the origin!
Can anybody help me on this?

The specific problem in your code is that you need to add a group by Origin clause in the sub query. However, all this would do is return the number of delays for each Origin, not the Origin(s) with the least delay. A small change to the code, adding a having clause, fixes this.
data flight;
input Date :ddmmyy10. (Sched_dep_time dep_time) (:time.) flight origin $ Dep_delay_min;
format date date9. Sched_dep_time dep_time time. ;
datalines;
01-01-2013 5:15 5:17 1545 EWR -2
01-01-2013 5:29 5:33 1714 LGA -4
01-01-2013 5:40 5:42 1141 JFK -2
01-01-2013 21:10 21:04 725 JFK 6
01-01-2013 20:30 21:04 461 LGA -74
01-01-2013 21:06 21:05 1696 EWR 1
01-01-2013 20:55 21:10 507 EWR -55
01-01-2013 20:25 21:14 5708 LGA -89
01-01-2013 21:10 21:15 79 JFK -5
01-01-2013 21:24 21:16 301 LGA 8
01-01-2013 6:00 5:58 49 JFK 42
01-01-2013 6:00 5:58 71 JFK 42
01-01-2013 6:00 5:58 194 JFK 42
;
run;
proc sql;
create table least_delay
as select *
from (
select
origin,
count(0) as num_delays
from
flight
where
dep_delay_min>0
group by
origin
)
having num_delays = min(num_delays);
quit;

Related

How do you enter variables with numeric names in SAS?

I would like to enter the name of my variables as numbers e.g. '1950-1959' and I'm using the INPUT statement, but the output is not appearing correctly.
DATA data1;
INPUT AgeGroup$ 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Could you please tell me if I need to use any special characters to specify that '1950-1959' etc. are names of the numeric variable?
Thanks!
You can use name literals to specify names that don't follow the normal rules, for example '1950-1959'n. Make sure that the VALIDVARNAME option is set to ANY so that SAS will allow the non-standard names. You could use standard names for the variables and use the label to store that description.
input AgeGroup :$5. period1-period6 ;
label period1 = '1950-1959' period2 = '1960-1969' ....
It would probably be more useful to store the time period into a variable instead.
data data1;
length AgeGroup $5 Period $9 count 8;
input AgeGroup #;
do period='1950-1959','1960-1969','1970-1979','1980-1989','1990-1992','Total';
input count #;
output;
end;
datalines;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
In that structure you can more easily filter to the data for subset of the time periods. But you could still easily create a report that displays the data in that tabular layout.
proc report data=data1;
columns agegroup count,period ;
define agegroup / group ;
define period / across ' ';
define count / ' ';
run;
Results:
AgeGr
oup 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
Enable extended character names with options validvarname=any, then specify each as a name literal like 'this'n:
options validvarname=any;
DATA data1;
INPUT AgeGroup$ '1950-1959'n '1960-1969'n '1970-1979'n '1980-1989'n '1990-1992'n Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Most modern SAS applications automatically specify this option, but occasionally you'll run into systems that still have v7 names.

Replace missing values in SAS by Specific Condition

I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to
Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific
Carrier.
Hereby is the sample of the dataset : -
date carrier Flight tailnum origin dest Distance Air_time Arr_Delay
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
code I tried : -
Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
var Arrival_delay_minutes;
Run;
But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!
stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-
Proc means data=cs1.Planes noprint;
var Arr_Delay;
class carrier origin dest;
output out=mean1;
Run;
proc sort data=cs1.Planes;
by carrier origin dest;
run;
proc sort data=mean1;
by carrier origin dest;
run;
data cs1.Complete_data(drop=Arr_Delay1 _stat_);
merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
keep=carrier origin dest Arr_Delay _stat_
rename=(Arr_Delay = Arr_Delay1) in=b);
by carrier origin dest;
if a;
if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.
You can find the SAS documentation here and available options here.
Code:
data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input
date carrier $ Flight tailnum $ origin $ dest $ Distance Air_time Arr_Delay;
datalines;
01-01-2013 UA 1545 N14228 EWR IAH 1400 227 17
01-01-2013 UA 1714 N24211 LGA IAH 1416 227 .
01-01-2013 AA 1141 N619AA JFK MIA 1089 160 .
01-01-2013 EV 5708 N829AS LGA IAD 229 53 -18
01-01-2013 B6 79 N593JB JFK MCO 944 140 14
01-01-2013 AA 301 N3ALAA LGA ORD 733 138 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 .
01-01-2013 B6 49 N793JB JFK PBI 1028 149 15
01-01-2013 B6 71 N657JB JFK TPA 1005 158 19
01-01-2013 UA 194 N29129 JFK LAX 2475 345 23
01-01-2013 UA 1124 N53441 EWR SFO 2565 361 -29
;
run;
proc sort data=work.have; by origin dest carrier; run;
Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
var Arr_Delay;
by origin dest carrier ;
Run;

I don't understand the behavior of char in this example

So I have the following code :
int main()
{
int b=0;
for(char c=0;c<256;c++)
b++;
cout<<b;
return 0;
}
Why does it run indefinitely? (I tried c<255 - because I thought of char like a circle:once it reaches 360 degrees, it goes back from 0 and c<255 should break this loop - it turned out it still ran indefinitely)
Why does it run indefinitely?
Irrespective of whether char is signed type or unsigned type, it will not reach the value of 256 if char is represented by 8 bits, which is the most common representation. It will always be less than 256.
Why don't you run the code and print value to understand how char c in your code is getting incremented?
int b=0;
for(char c=0;c<256;c++)
{
cout<<(int)c<<" ";
b++;
if(b == 256)break;
}
This code snippet outputs the following.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
-128 -127 -126 -125 -124 -123 -122 -121 -120 -119 -118 -117 -116 -115 -114 -113
-112 -111 -110 -109 -108 -107 -106 -105 -104 -103 -102 -101 -100 -99 -98 -97 -96
-95 -94 -93 -92 -91 -90 -89 -88 -87 -86 -85 -84 -83 -82 -81 -80 -79 -78 -77 -76
-75 -74 -73 -72 -71 -70 -69 -68 -67 -66 -65 -64 -63 -62 -61 -60 -59 -58 -57 -56
-55 -54 -53 -52 -51 -50 -49 -48 -47 -46 -45 -44 -43 -42 -41 -40 -39 -38 -37 -36
-35 -34 -33 -32 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16
-15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
So, char never gets to 256. It gets to 127, and then when you try to increment it again, it wraps to -128.
Remember, typically char is represented by 8 bits and for signed char, the range is -128 to 127 and for unsigned char, the range is 0 to 255.
That's why, the condition c < 256 always evaluates to true and your code runs indefinitely.

Formatting MMSS in SAS

DATA USCFootballStatsProject;
INPUT OPPONENT $ 1-12 WINORLOSS 13-14 TIMEOFPOSSESSION 15-19 THIRDDOWNCONVERSIONPERCENTAGE 21-26 RUSHINGYARDS;
FORMAT TIMEOFPOSSESSION MMSS.;
CARDS;
UCF 1 24:30 0.125 32
GEORGIA 0 26:59 0.333 43
ALABAMA 0 22:53 0.333 71
TROY 1 31:47 0.333 116
AUBURN 0 28:51 0.167 70
KENTUCKY 1 29:11 0.636 139
VANDERBILT 1 24:50 0.333 132
TENNESSEE 1 32:08 0.353 65
ARKANSAS 1 27:53 0.429 45
FLORIDA 1 25:50 0.300 120
CLEMSON 0 28:12 0.250 167
MISSOURI 0 31:19 0.316 142
MISSSTATE 1 32:39 0.231 81
GEORGIA 0 29:08 0.364 35
WOFFORD 1 24:21 0.417 165
FLOATLANTIC 1 32:39 0.429 200
AUBURN 0 30:20 0.462 109
KENTUCKY 1 31:07 0.538 190
VANDERBILT 1 30:54 0.727 194
TENNESSEE 0 31:33 0.417 165
ARKANSAS 0 23:06 0.333 51
FLORIDA 0 30:31 0.500 135
MTENNESSEE 1 28:17 0.778 154
CLEMSON 1 32:57 0.600 208
HOUSTON 1 33:19 0.500 189
LOUISLFY 1 31:38 0.538 195
GEORGIA 1 30:34 0.091 140
SCSTATE 1 26:53 0.364 223
LSU 0 27:09 0.500 17
MISSSTATE 1 29:39 0.500 123
KENTUCKY 1 29:57 0.462 86
NCAROLINA 1 25:56 0.083 110
VANDERBILT 0 26:36 0.083 26
TENNESSEE 0 36:25 0.438 39
ARKANSAS 0 30:55 0.400 125
FLORIDA 0 25:39 0.167 68
CLEMSON 0 21:23 0.375 80
NCSTATE 1 34:15 0.357 171
VANDERBILT 0 29:58 0.400 92
GEORGIA 0 24:47 0.417 18
WOFFORD 1 31:07 0.636 172
UAB 1 34:59 0.500 158
OLEMISS 1 31:21 0.538 78
KENTUCKY 1 30:43 0.471 74
LSU 0 25:28 0.111 39
TENNESSEE 1 32:30 0.500 101
ARKANSAS 1 29:06 0.417 132
FLORIDA 0 29:50 0.067 53
CLEMSON 0 27:13 0.471 92
IOWA 0 24:06 0.455 43
NCSTATE 1 32:25 0.333 108
GEORGIA 0 34:21 0.353 114
FLOATLANTIC 1 27:50 0.300 287
OLEMISS 1 33:35 0.375 65
SCSTATE 1 27:13 0.538 213
KENTUCKY 1 29:17 0.500 128
ALABAMA 0 31:43 0.474 64
VANDERBILT 1 32:22 0.375 119
TENNESSEE 0 26:35 0.267 65
ARKANSAS 0 27:37 0.500 53
FLORIDA 0 31:21 0.250 61
CLEMSON 1 36:31 0.375 223
CONN 0 24:32 0.200 76
SOUTHERMISS 1 28:52 0.400 224
GEORGIA 1 35:15 0.643 189
FURMAN 1 33:30 0.600 182
AUBURN 0 28:52 0.500 79
ALABAMA 1 27:33 0.545 110
KENTUCKY 0 25:13 0.500 90
VANDERBILT 1 37:21 0.529 129
TENNESSEE 1 28:28 0.538 212
ARKANSAS 0 25:40 0.417 105
FLORIDA 1 40:46 0.500 239
TROY 1 30:49 0.455 212
CLEMSON 1 34:43 0.333 95
AUBURN 0 28:59 0.417 156
FLORIDAST 0 26:32 0.500 139
ECU 1 30:02 0.500 220
GEORGIA 1 29:02 0.286 253
NAVY 1 31:15 0.556 254
VANDERBILT 1 34:08 0.526 131
AUBURN 0 24:13 0.200 129
KENTUCKY 1 38:37 0.500 288
MISSSTATE 1 32:34 0.200 110
TENNESSEE 1 36:18 0.556 231
ARKANSAS 0 29:05 0.667 79
FLORIDA 1 32:04 0.214 215
CITADEL 1 26:44 0.667 256
CLEMSON 1 37:17 0.444 210
NEBRASKA 1 29:11 0.308 121
VANDERBILT 1 31:36 0.250 205
ECU 1 28:42 0.533 131
UAB 1 23:47 0.500 179
MISSOURI 1 32:33 0.500 144
KENTUCKY 1 31:17 0.500 200
GEORGIA 1 33:13 0.417 230
LSU 0 23:03 0.231 34
FLORIDA 0 24:32 0.214 36
TENNESSEE 1 35:22 0.400 147
ARKANSAS 1 31:45 0.538 104
WOFFORD 1 29:11 0.538 171
CLEMSON 1 39:58 0.524 134
MICHIGAN 1 22:01 0.300 85
NCAROLINA 1 29:33 0.357 228
GEORGIA 0 24:58 0.455 226
VANDERBILT 1 37:10 0.647 220
UCF 1 30:49 0.556 225
KENTUCKY 1 29:45 0.556 178
ARKANSAS 1 43:25 0.563 277
TENNESSEE 0 27:38 0.286 218
MISSOURI 1 34:27 0.294 75
MISSSTATE 1 26:14 0.091 160
FLORIDA 1 28:59 0.313 164
CCAROLINA 1 34:31 0.556 352
CLEMSON 1 38:09 0.526 140
WISCONSIN 1 30:34 0.444 117
TAMU 0 22:22 0.222 67
ECU 1 36:19 0.538 175
GEORGIA 1 31:27 0.222 176
VANDERBILT 1 31:02 0.583 212
MISSOURI 0 35:55 0.381 119
KENTUCKY 0 34:20 0.600 282
FURMAN 1 29:37 0.333 267
AUBURN 0 33:31 0.429 119
TENNESSEE 0 30:13 0.462 248
FLORIDA 1 31:30 0.471 95
SALABAMA 1 24:12 0.400 210
CLEMSON 0 31:20 0.400 63
MIAMI 1 28:50 0.467 60
;
PROC PRINT DATA = USCFootballStatsProject;
RUN;
As it currently stands, it is not printing any of the times in the column for TIMEOFPOSSESSION, but it is printing everything else fine. Any ideas as to why it isn't printing that column? I'm using FORMAT TIMEOFPOSSESSION MMSS.;
I plan on doing a logistic regression with the WINORLOSS being the response variable, but I want to make sure that the data is being read in correctly.
Thanks.
Because you told it to read columns 15 to 19 as a number, but the value has a colon in it. You need to either use formatted input.
input ... #15 TIMEOFPOSSESSION STIMER5. ... ;
Or use list mode input with an attached INFORMAT.
informat TIMEOFPOSSESSION stimer5.;
input ... #15 TIMEOFPOSSESSION ... ;

c++ Multiples of 5 in sieve of atkin implemenation

I'm solving a problem over at project euler which requires me to find the sum of all primes under 2 million. I tried to implement sieve of atkin and strangely it sets numbers like 65,85 as primes. I looked at the code and algorithm for over a a day but can't find anything wrong. I'm sure it must be something silly but i can't find it. Thanks in advance
i'm using visual studio express 2012.
here's the code:
#include "stdafx.h"
#include <iostream>
#include <math.h>
#include <vector>
#include <fstream>
#include <conio.h>
int main(){
long long int limit,n;
std::cout<<"Enter a number...."<<std::endl;
std::cin>>limit;
std::vector<bool> prime;
for(long long int k=0;k<limit;k++){ //sets all entries in the vector 'prime' to false
prime.push_back(false);
}
long long int root_limit= ceil(sqrt(limit));
//sive of atkin implementation
for(long long int x=1;x<=root_limit;x++){
for(long long int y=1;y<=root_limit;y++){
n=(4*x*x)+(y*y);
if(n<=limit && (n%12==1 || n%12==5)){
prime[n]=true;
}
n=(3*x*x)+(y*y);
if(n<=limit && n%12==7){
prime[n]=true;
}
n=(3*x*x)-(y*y);
if(x>y && n<=limit && n%12==11){
prime[n]=true;
}
}
}
//loop to eliminate squares of the primes(making them square free)
for(long long int i=5;i<=root_limit;i++){
if(prime[i]==true){
for(long long int j=i*i;j<limit;j+=(i*i)){
prime[j]=false;
}
}
}
unsigned long long int sum=0;
//print values to a seperate text file
std::ofstream outputfile("data.txt");
outputfile<<"2"<<std::endl;
outputfile<<"3"<<std::endl;
for(long long int l=5;l<limit;l++){
if(prime[l]==true){
sum+=l;
outputfile<<l<<std::endl;;
}
}
outputfile.close();
std::cout<<"The sum is...."<<sum+5<<std::endl;
prime.clear();
return 0;
}
and hers the data.txt i pointed out few errors
2
3
5
7
11
13
17
19
23
29
31
37
41
43
47
53
59
61
65<-----
67
71
73
79
83
85<-----
89
91
97
101
103
107
109
113
127
131
137
139
143
145<----
149
151
157
163
167
173
179
181
185<----
191
193
197
199
205
211
221
223
227
229
233
239
241
247
251
257
259
263
265
269
271
277
281
283
293
299
305
307
311
313
317
331
337
347
349
353
359
365
367
373
377
379
383
389
397
401
403
407
409
419
421
427
431
433
439
443
445
449
457
461
463
467
479
481
485
487
491
493
499
503
505
509
511
521
523
533
541
545
547
557
559
563
565
569
571
577
587
593
599
601
607
611
613
617
619
629
631
641
643
647
653
659
661
671
673
677
679
683
685
689
691
697
701
703
709
719
727
733
739
743
745
751
757
761
763
767
769
773
785
787
793
797
803
809
811
821
823
827
829
839
851
853
857
859
863
865
871
877
881
883
887
901
905
907
911
919
923
929
937
941
947
949
953
965
967
971
977
983
985
991
997
1009
1013
1019
1021
1027
1031
1033
1037
1039
1049
1051
1061
1063
1067
1069
1073
1079
1087
1091
1093
1097
1099
1103
1105
1109
1117
1123
1129
1145
1147
1151
1153
1157
1159
1163
1165
1171
1181
1187
1189
1193
1199
1201
1205
1213
1217
1223
1229
1231
1237
1241
1249
1259
1261
1267
1277
1279
1283
1285
1289
1291
1297
1301
1303
1307
1313
1319
1321
1327
1339
1345
1351
1361
1367
1373
1381
1385
1387
1391
1399
1403
1405
1409
1417
1423
1427
1429
1433
1439
1447
1451
1453
1459
1465
1469
1471
1481
1483
1487
1489
1493
1499
1511
1513
1517
1523
1531
1537
1543
1549
1553
1559
1565
1567
1571
1579
1583
1585
1591
1597
1601
1603
1607
1609
1613
1619
1621
1627
1637
1649
1651
1657
1663
1667
1669
1679
1685
1687
1693
1697
1699
1703
1709
1717
1721
1723
1727
1733
1739
1741
1745
1747
1753
1759
1765
1769
1777
1781
1783
1787
1789
1801
1807
1811
1823
1831
1843
1847
1853
1861
1865
1867
1871
1873
1877
1879
1885
1889
1891
1901
1907
1913
1921
1931
1933
1937
1939
1945
1949
1951
1961
1963
1973
1979
1985
1987
1991
1993
1997
1999
You're supposed to flip the entries to the sieve list. In the first nested for loops instead of prime[n]=true; you should have prime[n]=!prime[n];