SAS data balancing good & bad

SAS data balancing good & bad - sas

I have a data set where the %age of bads are quite low.Can any one suggest a way to balance such a data set using SAS so that the logistic regression run gives a better result? Below is a sample. Thanks in advance!!
ID X1 X2 X3 X4 X5 Target
1 87 400 2 0 0 0
2 70 620 1 0 0 0
3 66 410 3 0 0 0
4 85 300 1 0 0 0
5 100 200 4 0 0 0
6 201 110 1 0 0 0
7 132 513 3 0 0 0
8 98 417 4 0 0 0
9 397 620 1 0 0 1
10 98 700 5 0 0 1

You can oversample the percentage of bads and then use the priorevent option within the score statement of proc logistic to correct the oversampling. There are plenty of examples online that will help you further with this.

Related

Keeping data rows with same Unique ID after an Indicator variable changes from 1 to 0 for the first time in SAS

I need help with a SAS code which would keep data rows with same Unique ID after meeting a certain condition. For example, if I have a dataset called BASE and is as shown below;
Account_Number Default_Indicator
1010 0
1010 0
1010 1
1010 1
1010 1
1010 0
1010 0
1010 0
1010 1
1010 1
1020 0
1020 0
1020 0
1020 1
1020 1
1020 1
1020 0
1020 0
1020 1
1020 1
I would like the final dataset to keep rows after the Default_Indicator changes from 1 to 0 for the first time as shown below;
Account_Number Default_Indicator
1010 0
1010 0
1010 0
1010 1
1010 1
1020 0
1020 0
1020 1
1020 1
Help with this will be greatly appreciated.

You can use BY group processing, just add the NOTSORTED keyword to the BY statement. Use the LAG() function to access the value from the previous data step iteration. Retain the flag variable indicting that you have found a 1 -> 0 transition. Make sure to reset when starting a new account.
data have;
row+1;
input Account_Number Default_Indicator ## ;
cards;
1010 0 1010 0
1010 1 1010 1 1010 1
1010 0 1010 0 1010 0
1010 1 1010 1
1020 0 1020 0 1020 0
1020 1 1020 1 1020 1
1020 0 1020 0
1020 1 1020 1
;
data want ;
set have;
by account_number default_indicator notsorted;
lag_indicator=lag(default_indicator);
if first.account_number then call missing(found,lag_indicator);
if first.default_indicator and default_indicator=0 and lag_indicator=1 then found+1;
if found then output;
drop lag_indicator found;
run;
Results (without the DROP statement)
Account_ Default_ lag_
Obs row Number Indicator indicator found
1 6 1010 0 1 1
2 7 1010 0 0 1
3 8 1010 0 0 1
4 9 1010 1 0 1
5 10 1010 1 1 1
6 17 1020 0 1 1
7 18 1020 0 0 1
8 19 1020 1 0 1
9 20 1020 1 1 1

How to loop rows and columns in pandas while replacing values with a constant increment

I am trying to replace values in a dataframe by 0. the first column I need to replace the 1st 3 values, the next column the 1st 6 values so on so forth increasing by 3 every time
a=np.array([133,124,156,189,132,176,189,192,100,120,130,140,150,50,70,133,124,156,189,132])
b = pd.DataFrame(a.reshape(10,2), columns= ['s','t'])
for columns in b:
yy = 3
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
the outcome is the following
s t
0 0 0
1 0 0
2 0 0
3 189 189
4 132 132
5 176 176
6 189 189
7 192 192
8 100 100
9 120 120
I am clearly missing something really simple, to make the loop replace 6 values instead of only 3 in column t, any ideas?

i would do it this way:
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
Demo:
In [86]: b = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
In [87]: %paste
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
## -- End pasted text --
In [88]: b
Out[88]:
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 10 0 0 0
4 8 0 0 0
5 49 0 0 0
6 55 48 0 0
7 99 43 0 0
8 63 29 0 0
9 61 65 74 0
10 15 29 41 0
11 79 88 3 0
12 91 74 11 4
13 56 71 6 79
14 15 65 46 81
15 81 42 60 24
16 71 57 95 18
17 53 4 80 15
18 42 55 84 11
19 26 80 67 59

You need inicialize yy=3 before loop:
yy = 3
for columns in b:
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
Python 3 solution:
yy = 3
for columns in b:
for i in range(yy):
b[columns][i] = 0
yy += 3
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132
Another solution:
yy= 3
for i, col in enumerate(b.columns):
b.ix[:i*yy+yy-1, col] = 0
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132

In R, Train/update model with multiple datasets

In R, I'm trying to train a neural network on multiple files. I have preformed the multinom function on a single dataset but I cannot find how to train my model with another dataset.
So I want to apply a model from a previous call to new data without re-estimating the model.
So first you build a model as in Sam Thomas's answer is explained.
#load libraries
library(nnet)
library(MASS)
#Define data
example(birthwt)
# Define training and test data
set.seed(321)
index <- sample(seq_len(nrow(bwt)), 130)
bwt_train <- bwt[index, ]
bwt_test <- bwt[-index, ]
# Build model
bwt.mu <- multinom(low ~ ., data=bwt_train)
Then I have another similar dataset I want to train/update the earlier created model with. So I want to update the model with new data to improve my model.
# New data set (for example resampled bwt)
bwt2=sapply(bwt, sample)
head(bwt2,3)
low age lwt race smoke ptd ht ui ftv
[1,] 1 31 115 3 1 1 0 0 2
[2,] 1 20 95 1 0 1 0 0 3
[3,] 2 25 95 2 0 1 0 1 1
# Define training and test data with new dataset
set.seed(321)
index <- sample(seq_len(nrow(bwt2)), 130)
bwt2_train <- bwt2[index, ]
bwt2_test <- bwt2[-index, ]
Now with this new dataset I want to optimze the model. I cannot merge the two datasets because the model should update over time when new data is available. This also because it is not preferable to recalculate everytime we have new data availble.
Thanks in advance,
Adam

Borrowed from an example in ?nnet::multinom
library(nnet)
library(MASS)
example(birthwt)
head(bwt, 2)
low age lwt race smoke ptd ht ui ftv
1 0 19 182 black FALSE FALSE FALSE TRUE 0
2 0 33 155 other FALSE FALSE FALSE FALSE 2+
set.seed(321)
index <- sample(seq_len(nrow(bwt)), 130)
bwt_train <- bwt[index, ]
bwt_test <- bwt[-index, ]
bwt.mu <- multinom(low ~ ., bwt_train)
(pred <- predict(bwt.mu, newdata=bwt_test))
[1] 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
[39] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
Levels: 0 1
Or if you want the probabilities
(pred <- predict(bwt.mu, newdata=bwt_test, type="probs"))
1 5 6 16 19 23 24
0.43672841 0.65881933 0.21958026 0.39061949 0.51970665 0.01627479 0.17210620
26 27 28 29 30 37 40
0.06133368 0.31568117 0.05665126 0.26507476 0.37419673 0.18475433 0.14946268
44 46 47 51 56 58 60
0.09670367 0.72178459 0.06541529 0.37448908 0.31883809 0.09532218 0.27515734
61 64 67 69 72 74 76
0.27515734 0.09456443 0.16829037 0.62285841 0.12026718 0.47417711 0.09603950
78 87 94 99 100 106 114
0.34588019 0.30327432 0.87688323 0.21177276 0.06576210 0.19741587 0.22418653
115 117 118 120 125 126 130
0.14592195 0.19340994 0.14874536 0.30176632 0.09513698 0.08334515 0.03886775
133 134 139 140 145 147 148
0.41216817 0.85046516 0.46344537 0.34219775 0.33673304 0.26894886 0.43778705
152 163 164 165 168 174 180
0.19044485 0.27800125 0.17865143 0.86783149 0.25969355 0.60623964 0.34931986
182 183 185
0.22944657 0.08066599 0.22863967

awk if-greater-than and replacement under condition

I have following data
......
6 4 4 17 154 93 309 0 11930
7 3 2 233 311 0 11936 11932 111874
8 3 1 15 0 11938 11943 211004 11449
9 3 2 55 102 0 11932 11941 111883
10 3 2 197 231 0 11925 11921 111849
11 3 2 160 777 0 11934 11928 111875
......
I hope to replace any values greater than 5000 to 0, from column 4 to column 9. How can I do this work with awk?

To print with lots of spaces like the input, something like this:
awk '{for(i=4;i<=NF;i++)if($i>5000)$i=0; for(i=1;i<=NF;i++)printf "%7d",$i;printf"\n"}' file
Output
6 4 4 17 154 93 309 0 0
7 3 2 233 311 0 0 0 0
8 3 1 15 0 0 0 0 0
9 3 2 55 102 0 0 0 0
10 3 2 197 231 0 0 0 0
11 3 2 160 777 0 0 0 0
For scrunched up together (TM) output, you can use this:
awk '{for(i=4;i<=NF;i++)if($i>5000)$i=0}1' file
6 4 4 17 154 93 309 0 0
7 3 2 233 311 0 0 0 0
8 3 1 15 0 0 0 0 0
9 3 2 55 102 0 0 0 0
10 3 2 197 231 0 0 0 0
11 3 2 160 777 0 0 0 0

An alternative approach (requires gawk4+):
{
patsplit($0, a, "[0-9]+", s)
printf s[0]
for (i=1; i<=length(a); i++){
if(i>4 && a[i]>5000) {
l=length(a[i])
a[i]=0
}
else l=0
printf "%"l"s%s", a[i], s[i]
}
printf "\n"
}
It is more flexible when the spacing would vary, as opposed to the example data. It might also be faster than the accepted answer, in case the number of fields is way bigger than 9.

need help for array [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I need create some new variables day1 day2 day3 etc. If readmit=1then do day[i] each day[i]=gap For example, the first two readmit should get day[1]=21 day[2]=9. then the next readmit=1, For the third readmit, the fourth readmit and the fifth readmit=1 should get the result day[1]=29 day[2]=12 day[3]=23 and so on. Hopefully, I expressed well enough. Thanks in advance.
STUDYID index readmit gap
10001 1 0
10001 1 0 79
10001 1 0 48
10001 1 0 39
10001 1 0 74
10001 1 0 41
10001 0 1 21
10001 0 1 9
10001 0 0 130
10001 0 0 52
10001 0 0 110
10001 1 0 80
10001 0 1 29
10001 0 1 12
10001 0 1 23
10001 1 0 57
10001 0 1 28
10001 0 1 14
10001 1 0 118
10001 0 1 5
10001 0 1 22
10001 1 0 40
10001 0 1 23
10001 0 1 24
10001 0 1 19

I think the code below answers your question. This requires 2 passes of the data, the first to calculate the maximum number of consecutive rows where READMIT=1, which is stored in a macro variable used to determine the array size in the second pass.
The key to solving this question is the order of the data and the use of the NOTSORTED option in the BY statement. This enables every change in the READMIT value to be treated as a new section.
Hope this helps, although it would be good if someone could find a method that just uses a single pass of the data.
data have;
input STUDYID index readmit gap;
cards;
10001 1 0 .
10001 1 0 79
10001 1 0 48
10001 1 0 39
10001 1 0 74
10001 1 0 41
10001 0 1 21
10001 0 1 9
10001 0 0 130
10001 0 0 52
10001 0 0 110
10001 1 0 80
10001 0 1 29
10001 0 1 12
10001 0 1 23
10001 1 0 57
10001 0 1 28
10001 0 1 14
10001 1 0 118
10001 0 1 5
10001 0 1 22
10001 1 0 40
10001 0 1 23
10001 0 1 24
10001 0 1 19
;
run;
data _null_;
set have (keep=readmit) end=last;
by readmit notsorted;
if first.readmit then days=0;
retain max_days;
if readmit=1 then days+1;
max_days=max(max_days,days);
if last then call symput('max_days',strip(max_days));
run;
%put maximum consecutive days = &max_days.;
data want;
set have;
by readmit notsorted;
array dayvar{*} day1-day&max_days.;
if first.readmit then do;
num_day=0;
call missing(of day:);
end;
retain day1-day&max_days.;
if readmit=1 then do;
num_day+1;
dayvar{num_day}=gap;
if last.readmit then output;
end;
keep studyid index day: ;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS data balancing good & bad - sas

You can oversample the percentage of bads and then use the priorevent option within the score statement of proc logistic to correct the oversampling. There are plenty of examples online that will help you further with this.

Related

Keeping data rows with same Unique ID after an Indicator variable changes from 1 to 0 for the first time in SAS

How to loop rows and columns in pandas while replacing values with a constant increment

In R, Train/update model with multiple datasets

awk if-greater-than and replacement under condition

need help for array [closed]

Categories

Resources