Calculating missing values by calculating average based on few conditions

Calculating missing values by calculating average based on few conditions - sas

I have a data set with lots of flight information, in the below format.
carrier flight origin dest air_time
9E 4194 EWR ATL 105
9E 4362 EWR ATL .
9E 4362 EWR ATL 117
9E 3633 EWR ATL 113
The second record, does not have the air_time data available. The business requirement is that in such cases;
I should find the average air_time for the air craft carrier code,
use the same departure, and destination airports
Populate this average air_time as the air_time for row #2, which has the missing data.
I am unable to code this in SAS. The code should do this for every time a missing value is found in air_time. Request experts to help me.
Thanks in advance!

The below solution worked for me perfectly.
Step #1, sorted the variables using which I planned to find the average values.
proc sort data=cs1.flights_cln out=cs1.flights_srt;
by carrier origin dest;
run;
Step #2, used the standar procedure. After I ran this code, the missing values in the data set, got replaced with the average values.
proc standard data=cs1.flights_srt out=cs1.flights_stn replace;
by carrier origin dest;
run;

Related

What can be better option instead of creating index column as Sort by column feature is not working in Power BI?

I am creating graphs in Power BI.
This graph is a line chart with data names – ‘key’, ‘Cumulative Volume’ and ‘Cumulative Gross Profit’.
To calculate both Cumulative, ‘key’ needs to be ordered in decreasing ‘Gross Profit’.
I don’t want to calculate cumulative in Power BI. I want them to be calculated at a data source for good efficiency. Once calculated at source, how can I sort the visual by ‘key’ in decreasing order of ‘Gross Profit’?
Sort by column feature is not working, I’ve tried it already.
For now, I do it by creating an additional column called ‘index’ in decreasing order of ‘Gross Profit’. But I have other similar graphs also and it is maybe not a good practice to create index columns at the source side.
Thank you.
Key
Volume
Gross Profit
22
200
52
35
566
95
74
888
32
89
600
54
I want to make Graph C, but I m getting graph B or A.
enter image description here

PROC IML Log (SAS Studio)

I am relatively new to PROC IML procedure. I'd like to have my log to be completely clean, which includes log showing no notes and "!"(length in this case too?) if possible. How can I eliminate the note, keep my CPU and performance very efficient?
Thank you for your help!! I appreciate it!.- Michelle
71 proc iml;
NOTE: IML Ready
72
72 ! varNames={"NACCZMMS" "NACCZLMI" "NACCZLMD" "NACCZDFT" "NACCAGEB"};
73
73 ! use Class2.exercise2;
NOTE: Data file CLASS2.EXERCISE2.DATA is in a format that is native to
another host, or the file encoding does not
match the session encoding. Cross Environment Data Access will be used,
which might require additional CPU
resources and might reduce performance.
74
74 ! read all var varNames into CG;
75
75 ! print CG[c=varNames];
75 ! /*c for colname*/
76 quit;

You can convert the data set to a format that's optimal for your system.
data exercise2;
set class.exercise2;
run;
Then use the exercise2 data in your IML code. You only need to do this once. This has to do with the fact that the data set was created on a different operating system than yours and SAS is letting you know that. It will do the conversion automatically, but can slow things down.
Turn on the option NONOTES; which will suppress all NOTES to the LOG. But WARNINGS will be displayed. I don't recommend this as NOTES can be very useful to detect issues in your code.

SAS is the number lower than the highest number in the column so far

I am using sas and I have a column in my table that consists of various numbers. I want to go down the column and select the number if it is smaller than the highest number so far. I posted a picture of an example of what I am looking for. I also have a column with the year, that I didn't post in the picture if that matters. I am guessing I will need some sort of loop.n is the original column and output is what I would like my loop to do.
example:
n - current column
28
22
30
40
39
55
110
89
98
160
155
157
250
output - desired output
22
39
89
98
155
157
I attempted this in proc sql because I am new to sas and know much more about sql. As I was attempting proc sql I realized I am not going to be able to do in proc sql.
Here is what I tried in proc sql.
I can post more things I have tried as I attempt more loops. As of now my loops are too far off.
proc sql;
select a.*
from homework a
full join homework b on a.make = b.make
and a.model = b.model
where a.[Initial Model Year] < b.[Initial Model Year]
and a.MPH < b.MPH;
quit;

Why always use SQL? SAS has a lot of facilities that are often more suited for the job dan SQL. Junior in SAS tend to use the only thing they know from school: SQL, and neglect all the rest.
By definition, SQL is not suited for this job! SQL does not even guarantee the order of rows is prevailed, let alone that you can use the order of the input rows in your logic. (Yes, there are SQL dialects that can do this, but not standard SQL)
Use a data step. That reads in your data row by row, in the order they occur.
Avoid writing loops explicitely whenever you can. The data step implicitely loops over it's input.
By default, the data step writes one row for each row read. You can remove a row from the output with a delete statement. You can also write explicit output statements. Then only the rows for which you do execute output wil be in the output. (output is also used if you want more than one row in the output per row in the input.)
However, by default, row by row means if forgets the previous row and all that is related to it. So you need to explicitly retain some information.
Attention, by default SAS keeps all intermediate results of calculations. If you don't want that, you need either an explicit keep statement, or a drop.
Example sollution:
data MY_SELECTION;
set MY_INPUT;
retain largest 0; * largest is initialized to 0 for the first row only *;
if largest < number then largest = number;
else if number < largest then output;
drop largest;
run;
Final remark: By default, SQL writes a report and the data step creates a new data set. If you want SQL to behave as the data step, preceed your query with create table MY_SELECTION as. If you want the data step to behave as SQL, insert proc print; before the run;

Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!

Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.

This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.

Issues with reading sas xpt files on numeric formats

We read SAS xpt files to load data in .net. Everything works fine but recently we have encountered a problem where the customer has stored date as a numeric value in a column and provided a Format in the file header. The SAS viewer can display that data correctly using the given format but we have to load that data in .net in our program and we don't require SAS.
I recently found out that you can use the SaS LocalProvider with OLEDB but it turns out that it does not support Numeric formatting. So we are ending up with the wrong data in columns where data is stored as a numeric value with a format provided for it.
Can anyone please help me understand and resolve the issue with probably some code sample. I have looked around for code samples in .Net but with no luck so far for this issue.
Thanks in advance.
Regards,
Nasir

SAS Date values are stored as the number of days since Jan 1, 1960.
122
123 data _null_;
124 x=today();
125 put x=;
126 run;
x=19410
Today (2/21/2013) for example is 19410 days since 1/1/1960. Assuming you know your own software's date format (probably some number of days since some other date), you can perform the transformation on your own.
If it's relevant, SAS datetime values are # of seconds since 1/1/1960 00:00:00 .
128 data _null_;
129 x=datetime();
130 put x=;
131 run;
x=1677052885.5
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
Again, that's the time as of 08:00 2/21/2013.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js